Learning dense vector representations of words from text
Word2Vec, introduced by Mikolov et al. at Google in 2013, revolutionized NLP by learning dense vector representations of words where semantic similarity corresponds to geometric proximity. The famous equation “king - man + woman = queen” demonstrated these vectors capture meaningful relationships.
The Key Insight
Instead of one-hot vectors (sparse, no similarity), learn dense embeddings where:
- Similar words are close: vec(“cat”) ≈ vec(“dog”)
- Relationships are linear: vec(“king”) - vec(“man”) + vec(“woman”) ≈ vec(“queen”)
Two Architectures
Skip-gram
Given a word, predict surrounding context words:
Objective: Maximize probability of context words given center word.
CBOW (Continuous Bag of Words)
Given context words, predict the center word:
where is the average of context word vectors.
Interactive Demo
Explore word embeddings and vector arithmetic:
Word2Vec: Word Embeddings
Training with Negative Sampling
Computing the full softmax over vocabulary is expensive. Negative sampling approximates it:
Instead of normalizing over all words, contrast positive pairs against random negative samples.
The Training Process
Sentence: "The quick brown fox jumps"
Window size: 2
For center word "brown":
Positive pairs: (brown, quick), (brown, fox)
Negative samples: (brown, computer), (brown, elephant), ...
Objective: Push positive pairs together, negative pairs apart
Why Linear Relationships?
The optimization objective creates a structure where:
This emerges because words appearing in similar contexts get similar vectors, and gender/royalty patterns are consistent across the corpus.
Hyperparameters
| Parameter | Typical Value | Effect |
|---|---|---|
| Embedding dimension | 100-300 | Higher = more capacity |
| Window size | 5-10 | Larger = more syntactic |
| Negative samples | 5-20 | More = better for rare words |
| Min word count | 5 | Filter rare words |
| Subsampling | 1e-3 to 1e-5 | Downsample frequent words |
Word2Vec vs. Later Methods
| Method | Year | Key Difference |
|---|---|---|
| Word2Vec | 2013 | Static embeddings, one vector per word |
| GloVe | 2014 | Global co-occurrence statistics |
| FastText | 2016 | Subword embeddings (handles OOV) |
| ELMo | 2018 | Context-dependent embeddings |
| BERT | 2018 | Deep bidirectional context |
Limitations
- One vector per word: “bank” (river) = “bank” (financial)
- No morphology: “run”, “running”, “runs” are unrelated
- Fixed vocabulary: Out-of-vocabulary words get no embedding
- Shallow context: Just neighboring words, not deep semantics
These limitations led to contextual embeddings (ELMo, BERT).
Historical Impact
Word2Vec:
- Made NLP research accessible (fast to train)
- Introduced the embedding paradigm
- Enabled transfer learning in NLP
- Demonstrated emergent structure in learned representations
- Inspired similar approaches for graphs, products, etc.
Key Papers
- Efficient Estimation of Word Representations in Vector Space – Mikolov et al., 2013
https://arxiv.org/abs/1301.3781 - Distributed Representations of Words and Phrases and their Compositionality – Mikolov et al., 2013
https://arxiv.org/abs/1310.4546 - GloVe: Global Vectors for Word Representation – Pennington et al., 2014
https://nlp.stanford.edu/pubs/glove.pdf - Enriching Word Vectors with Subword Information (FastText) – Bojanowski et al., 2016
https://arxiv.org/abs/1607.04606