Word Embeddings:
Decoding Word2Vec
Module 2: Core Architecture & Components
"You shall know a word by the company it keeps." - John Rupert Firth. This linguistic theory is the mathematical foundation of Word2Vec.
The Problem with Bag of Words
Before neural embeddings, Natural Language Processing relied heavily on Bag of Words (BoW) and TF-IDF. These methods use One-Hot Encoding, representing vocabulary as massive, sparse vectors (mostly zeros).
The critical flaw? Orthogonality. In a one-hot representation, the distance between "King" and "Queen" is the same as the distance between "King" and "Apple". They capture frequency, but completely fail to capture semantic relationships.
Enter Dense Embeddings
Word embeddings solve this by mapping words to a low-dimensional, dense vector space (typically 100 to 300 dimensions). Instead of a vector length equal to the entire dictionary, a word like "Dog" becomes a 300-value array of real numbers.
How Word2Vec Learns: The Architectures
Developed by Google researchers in 2013, Word2Vec employs a shallow, two-layer neural network to reconstruct linguistic context. It does this via two primary algorithms:
1. CBOW (Continuous Bag of Words)
The model predicts the target word by looking at the surrounding context words. It treats context as a single observation. It is significantly faster to train and has slightly better accuracy for frequent words.
2. Skip-Gram
The inverse of CBOW. It uses the target word to predict the surrounding context words. Skip-gram is slower to train but performs exceptionally well with small amounts of training data and represents rare words better.
⚡ Frequently Asked NLP Questions
What is Cosine Similarity in Word2Vec?
Cosine similarity measures the angle between two vectors in n-dimensional space. A cosine value of 1 means the vectors point in the exact same direction (highly similar semantics), 0 means they are orthogonal (unrelated), and -1 means they are exactly opposite.
Why is vector math (King - Man + Woman = Queen) possible?
Because Word2Vec learns abstract features (like gender, royalty, plurality) implicitly across its dimensions. When you subtract the "male" vector distance from "King" and add the "female" vector distance, the resulting point in vector space lands closest to the word "Queen".