MODULE 2 /// WORD EMBEDDINGS /// WORD2VEC /// GENSIM /// MODULE 2 /// WORD EMBEDDINGS ///

Word Embeddings

Teach machines the meaning of language. Map vocabulary into dense n-dimensional vector space using Word2Vec architecture.

train_embeddings.py
1 / 7

LOG:How do we make computers understand words? They only understand numbers. We need to turn words into vectors (lists of numbers).

Logic Tree

One-Hot to Dense

Understanding why large, sparse arrays fail to capture context, and how compressing them to dense vectors creates 'semantic space'.

Validation Node

What is the primary flaw of One-Hot Encoding in NLP?

Word Embeddings:
Decoding Word2Vec

Module 2: Core Architecture & Components

"You shall know a word by the company it keeps." - John Rupert Firth. This linguistic theory is the mathematical foundation of Word2Vec.

The Problem with Bag of Words

Before neural embeddings, Natural Language Processing relied heavily on Bag of Words (BoW) and TF-IDF. These methods use One-Hot Encoding, representing vocabulary as massive, sparse vectors (mostly zeros).

The critical flaw? Orthogonality. In a one-hot representation, the distance between "King" and "Queen" is the same as the distance between "King" and "Apple". They capture frequency, but completely fail to capture semantic relationships.

Enter Dense Embeddings

Word embeddings solve this by mapping words to a low-dimensional, dense vector space (typically 100 to 300 dimensions). Instead of a vector length equal to the entire dictionary, a word like "Dog" becomes a 300-value array of real numbers.

How Word2Vec Learns: The Architectures

Developed by Google researchers in 2013, Word2Vec employs a shallow, two-layer neural network to reconstruct linguistic context. It does this via two primary algorithms:

1. CBOW (Continuous Bag of Words)

The model predicts the target word by looking at the surrounding context words. It treats context as a single observation. It is significantly faster to train and has slightly better accuracy for frequent words.

2. Skip-Gram

The inverse of CBOW. It uses the target word to predict the surrounding context words. Skip-gram is slower to train but performs exceptionally well with small amounts of training data and represents rare words better.

⚡ Frequently Asked NLP Questions

What is Cosine Similarity in Word2Vec?

Cosine similarity measures the angle between two vectors in n-dimensional space. A cosine value of 1 means the vectors point in the exact same direction (highly similar semantics), 0 means they are orthogonal (unrelated), and -1 means they are exactly opposite.

Why is vector math (King - Man + Woman = Queen) possible?

Because Word2Vec learns abstract features (like gender, royalty, plurality) implicitly across its dimensions. When you subtract the "male" vector distance from "King" and add the "female" vector distance, the resulting point in vector space lands closest to the word "Queen".

NLP Glossary

Word Embedding
A learned representation for text where words with the same meaning have a similar representation in vector space.
One-Hot Encoding
A method of converting categorical variables into a binary matrix, where only one element is 'hot' (1).
CBOW
Continuous Bag of Words. An architecture that predicts a current word based on a window of surrounding context words.
Skip-Gram
An architecture that predicts surrounding context words based on a single current target word.
Gensim
An open-source library for unsupervised topic modeling and natural language processing in Python.
Cosine Similarity
A mathematical metric used to determine how similar two vectors are irrespective of their size.