To a computer, 'happiness' is not a feeling—it's a coordinate in a high-dimensional space. Vectorization is the bridge between text and computation.
1Language to Math
Computers process numbers, not letters. A machine learning model cannot run matrix multiplication on the word "apple". Before any NLP model can understand text—whether it's a simple spam filter or a complex LLM—we must convert our text into numerical arrays.
This process is called Vectorization (or Feature Extraction). It is the bridge between human language and machine computation. The goal is to represent text in a way that captures its meaning or structure mathematically.
"""
Raw Text:
"I love AI"
Vectorized Representation:
[1, 0, 1, 0, 0, 1]
"""2Bag of Words (BoW)
The most fundamental vectorization technique is the Bag of Words (BoW).
BoW works by first scanning the entire dataset to create a 'Vocabulary'—a master list of every unique word. Then, for each document, it creates an array equal in length to the vocabulary, counting how many times each word appears. It's called a 'bag' because it throws away all grammar, word order, and context. All that matters is frequency.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['I love AI', 'AI is the future']
vectorizer = CountVectorizer()
# Creates the frequency matrix
X = vectorizer.fit_transform(corpus)3The Context Flaw
While BoW is fast and easy to implement, it has a massive limitation: it completely destroys context.
Because it only counts frequencies, BoW sees the sentences "The dog bit the man" and "The man bit the dog" as mathematically identical. Furthermore, common words like "the", "is", and "and" will dominate the counts, overshadowing the rare, meaningful words that actually define the topic of the text.
# Vocab: {'I':0, 'love':1, 'AI':2, 'is':3}
# 'I love AI' -> [1, 1, 1, 0]
# Warning: "Good, not bad"
# and "Bad, not good" look identical.4TF-IDF: Smart Weighting
To solve the frequency problem, we use TF-IDF (Term Frequency - Inverse Document Frequency).
TF-IDF doesn't just count words; it scores their importance. If a word appears a lot in one specific document (High TF), that's good. But if that same word appears in *every* document in the dataset (Low IDF), TF-IDF penalizes it. This means useless words like "the" get pushed to zero, while unique keywords that define a document get heavily boosted.
from sklearn.feature_extraction.text import TfidfVectorizer
# Penalizes common words, boosts rare ones
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)5Sparse Matrices
When you vectorize a large dataset (like Wikipedia), your vocabulary might contain 500,000 unique words. This means every single sentence becomes an array of 500,000 numbers, where 99.9% of them are zeros!
Storing this in standard RAM would instantly crash your computer. Frameworks like Scikit-Learn handle this by using Sparse Matrices—a highly optimized data structure that only stores the non-zero values and their coordinates, saving massive amounts of memory.
# High Weight: Rare, meaningful words
# Low Weight: Common 'stop words'
# Stored as a SciPy Sparse Matrix to save RAM