Building a Content-Based Model
A Content-Based Recommender System analyzes the properties of items a user has liked in the past to recommend new items with similar properties. It's the "If you liked this, you'll love that" approach based purely on item metadata.
Vectorizing Text with TF-IDF
Machines cannot understand text directly; they require numerical representations. The most common technique for text features in content-based filtering is Term Frequency-Inverse Document Frequency (TF-IDF).
It assigns a weight to each word in a document. The weight increases proportionally to the number of times a word appears in the document ($tf$), but is offset by the frequency of the word in the corpus ($df$). This helps to adjust for the fact that some words appear more frequently in general.
Calculating Cosine Similarity
Once we have mapped our items into a multi-dimensional vector space, we need a way to measure the distance or similarity between them. Cosine Similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space.
A value of 1 implies the items are identical in their features, while 0 means they share no attributes. Because TF-IDF vectors are inherently normalized, we can simply compute the dot product (using linear_kernel in scikit-learn) to efficiently find the cosine similarities.
❓ Recommender FAQ
What is the Cold Start Problem, and does Content-Based Filtering fix it?
The User Cold Start problem occurs when a new user joins, and you have no history to base recommendations on. Content-based models handle this slightly better than Collaborative Filtering because they only need a user to interact with a single item to start finding similar items via metadata.
However, it suffers from the Item Cold Start if new items are added without rich metadata (tags, descriptions).
Why use linear_kernel over cosine_similarity in scikit-learn?
When utilizing TfidfVectorizer, the resulting sparse matrices are already normalized (L2 norm). Therefore, computing the dot product is mathematically equivalent to computing the cosine similarity.
# linear_kernel is significantly faster cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)What is the "Serendipity" issue?
Content-based recommenders create a "filter bubble." Because they only recommend items highly similar to what a user has already consumed, they struggle to recommend surprising or novel items (serendipity) from completely different genres that the user might actually enjoy.
