Word2Vec: Word Embeddings

Learning dense vector representations of words from text

Word2Vec, introduced by Mikolov et al. at Google in 2013, revolutionized NLP by learning dense vector representations of words where semantic similarity corresponds to geometric proximity. The famous equation “king - man + woman = queen” demonstrated these vectors capture meaningful relationships.

The Key Insight

Instead of one-hot vectors (sparse, no similarity), learn dense embeddings where:

  • Similar words are close: vec(“cat”) ≈ vec(“dog”)
  • Relationships are linear: vec(“king”) - vec(“man”) + vec(“woman”) ≈ vec(“queen”)

Two Architectures

Skip-gram

Given a word, predict surrounding context words:

P(wcontextwcenter)=exp(vwcTvwt)wVexp(vwTvwt)P(w_{context} | w_{center}) = \frac{\exp(v_{w_c}^T v_{w_t})}{\sum_{w \in V} \exp(v_w^T v_{w_t})}

Objective: Maximize probability of context words given center word.

CBOW (Continuous Bag of Words)

Given context words, predict the center word:

P(wcenterwcontext)=exp(vwtTvˉcontext)wVexp(vwTvˉcontext)P(w_{center} | w_{context}) = \frac{\exp(v_{w_t}^T \bar{v}_{context})}{\sum_{w \in V} \exp(v_w^T \bar{v}_{context})}

where vˉcontext\bar{v}_{context} is the average of context word vectors.

Interactive Demo

Explore word embeddings and vector arithmetic:

Word2Vec: Word Embeddings

royalty
gender
animal
place
kingqueenprinceprincessmanwomanboygirlcatdogkittenpuppyparisfrancetokyojapan
Vector Analogies:
How it works:
The relationship between king and man is captured as a vector. Adding this same vector to woman gives us queen.
vec(king) - vec(man) + vec(woman) ≈ vec(queen)
Skip-gram Training
Context:thequickbrownfoxjumps
Given "brown", predict: "the", "quick", "fox", "jumps"

Training with Negative Sampling

Computing the full softmax over vocabulary is expensive. Negative sampling approximates it:

logσ(vwOTvwI)+i=1kEwiPn(w)[logσ(vwiTvwI)]\log \sigma(v_{w_O}^T v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-v_{w_i}^T v_{w_I})]

Instead of normalizing over all words, contrast positive pairs against kk random negative samples.

The Training Process

Sentence: "The quick brown fox jumps"
Window size: 2

For center word "brown":
  Positive pairs: (brown, quick), (brown, fox)
  Negative samples: (brown, computer), (brown, elephant), ...

Objective: Push positive pairs together, negative pairs apart

Why Linear Relationships?

The optimization objective creates a structure where:

vkingvmanvqueenvwoman\vec{v}_{king} - \vec{v}_{man} \approx \vec{v}_{queen} - \vec{v}_{woman}

This emerges because words appearing in similar contexts get similar vectors, and gender/royalty patterns are consistent across the corpus.

Hyperparameters

ParameterTypical ValueEffect
Embedding dimension100-300Higher = more capacity
Window size5-10Larger = more syntactic
Negative samples5-20More = better for rare words
Min word count5Filter rare words
Subsampling1e-3 to 1e-5Downsample frequent words

Word2Vec vs. Later Methods

MethodYearKey Difference
Word2Vec2013Static embeddings, one vector per word
GloVe2014Global co-occurrence statistics
FastText2016Subword embeddings (handles OOV)
ELMo2018Context-dependent embeddings
BERT2018Deep bidirectional context

Limitations

  1. One vector per word: “bank” (river) = “bank” (financial)
  2. No morphology: “run”, “running”, “runs” are unrelated
  3. Fixed vocabulary: Out-of-vocabulary words get no embedding
  4. Shallow context: Just neighboring words, not deep semantics

These limitations led to contextual embeddings (ELMo, BERT).

Historical Impact

Word2Vec:

  • Made NLP research accessible (fast to train)
  • Introduced the embedding paradigm
  • Enabled transfer learning in NLP
  • Demonstrated emergent structure in learned representations
  • Inspired similar approaches for graphs, products, etc.

Key Papers

Found an error or want to contribute? Edit on GitHub