Bag of Words & TF-IDF: Translating Text to Math

Pascual Vila
AI/NLP Instructor // Code Syllabus
Machine Learning algorithms cannot process raw text. They require numerical input. Text Vectorization is the bridge between human language and mathematical modeling, and it all starts with Bag of Words and TF-IDF.
The Baseline: Bag of Words (BoW)
The Bag of Words model is exactly what it sounds like: a text is represented as the bag of its words, disregarding grammar and even word order, but keeping track of frequency. It extracts a vocabulary from all the documents and creates a matrix where each row is a document, and each column represents a word from the vocabulary.
However, it suffers from two major issues: Sparsity (most documents don't use most words, resulting in matrices filled with zeros) and Lack of Semantic Meaning (the sentences "Dog bites man" and "Man bites dog" have the exact same BoW vector).
Enhancing with TF-IDF
To solve the issue of frequent but meaningless words (like "the", "a", "is") dominating the vectors, we use TF-IDF (Term Frequency-Inverse Document Frequency). It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
- Term Frequency (TF): Measures how frequently a term occurs in a document. The higher the count, the more important it seems.
- Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. It weighs down the frequent terms and scales up the rare ones.
Implementation Tip: Sparse Matrices+
When using sklearn.feature_extraction.text.TfidfVectorizer, the returned object is a SciPy Sparse Matrix, not a standard NumPy array. This saves massive amounts of RAM because it only stores the non-zero values. If you try to run .toarray() on a large corpus, your kernel will likely crash due to memory exhaustion!
❓ Frequently Asked Questions (NLP)
What is the main difference between Bag of Words and TF-IDF?
Bag of Words simply counts the number of times each word appears in a document. All words are treated equally.
TF-IDF not only looks at the frequency of the word in a specific document but also considers how often it appears across all documents. It penalizes highly frequent words, giving higher weights to words that are unique and descriptive of a specific text.
Why are BoW matrices considered "Sparse"?
In a large corpus, the total vocabulary might be 50,000 unique words. A single short document (like a tweet) might only contain 10 words. The resulting vector for that document will have 10 non-zero values and 49,990 zeros. A matrix full of mostly zeros is called a "sparse matrix".
Can Bag of Words understand context?
No. Because it entirely ignores the order of words, "This movie is not good, it is bad" and "This movie is not bad, it is good" will yield the exact same BoW vector, even though their meanings are opposites. For context awareness, we move to Word Embeddings (like Word2Vec) or Transformers (like BERT).