A word is characterized by the company it keeps. Word embeddings allow us to map the entire human lexicon into a meaningful geometric space.
1Capturing Meaning with Dense Vectors
Older techniques like Bag of Words just count words. They treat "car" and "automobile" as completely unrelated tokens. To capture true meaning, we use Word Embeddings.
Instead of a massive, sparse array of zeros and ones, an embedding is a small, Dense Vector (usually 100 to 300 floating-point numbers). This vector mathematically represents the "semantic space" of a word, allowing a machine to understand that "king" and "queen" are highly related concepts.
"""
Sparse Vector (Bag of Words):
'car' -> [0, 0, 1, 0, 0, 0...]
Dense Vector (Embedding):
'car' -> [0.88, -0.23, 0.45, ...]
"""2Word2Vec: Learning from Context
How do we figure out these precise floating-point numbers? We let a neural network learn them. The most famous algorithm for this is Google's Word2Vec.
Word2Vec operates on the Distributional Hypothesis: words that appear in similar contexts share similar meanings. By sliding a window across millions of sentences, the neural network adjusts the vectors so that words appearing near each other (like "bark" and "dog") end up close together in the mathematical space.
from gensim.models import Word2Vec
# The neural network learns the arrays automatically
king = [0.95, -0.12, 0.44, ...]
queen = [0.92, -0.10, 0.48, ...]3CBOW vs Skip-Gram Architectures
Word2Vec comes in two architectural flavors. Continuous Bag of Words (CBOW) looks at the surrounding context words and tries to predict the missing target word in the middle.
Skip-Gram does the exact opposite: it takes a single target word and tries to predict the surrounding context words. While CBOW is faster and handles frequent words well, Skip-Gram is notoriously better at capturing fine-grained relationships and representing rare vocabulary.
# CBOW: Predicts Target
# [The, cat, __, the, mat] -> 'sat'
# Skip-Gram: Predicts Context
# 'sat' -> [The, cat, the, mat]4GloVe: Global Statistics
Word2Vec is fundamentally a predictive neural network model. An alternative approach is GloVe (Global Vectors for Word Representation), developed by Stanford.
Instead of predicting local windows, GloVe builds a massive matrix of how often every word co-occurs with every other word across the entire dataset. It then uses matrix factorization to compress this massive table down into dense vectors. It achieves similar semantic power but through raw, global statistics rather than local prediction.
# GloVe vs Word2Vec
# Word2Vec: Neural Prediction (Local windows)
# GloVe: Matrix Factorization (Global counts)5Vector Mathematics & Analogies
The most mind-blowing aspect of Word Embeddings is that linguistic concepts become subject to mathematical addition and subtraction.
If you take the vector for "King", subtract the vector for "Man", and add the vector for "Woman", the resulting coordinates will place you closest to the vector for "Queen". The embedding space literally learns geometry that maps to human logic, gender, geography, and syntax!
# Analogical reasoning via math
result = model.most_similar(
positive=['king', 'woman'],
negative=['man']
)
print(result) # [('queen', 0.85)]