RECSYS /// COLLABORATIVE FILTERING /// MATRIX FACTORIZATION /// COSINE SIMILARITY /// RECSYS /// COLLABORATIVE FILTERING ///

User-User CF

Leverage the wisdom of crowds. Learn to structure data, calculate user similarity, and predict preferences mathematically.

model_training.py
1 / 9
12345
πŸ€–

System:Welcome to Collaborative Filtering. 'Users who liked what you liked, also liked...' is the core philosophy here.

Execution Graph

UNLOCK NODES BY MASTERING RECSYS.

Data Structuring

Before algorithms run, interaction logs (clicks, purchases, ratings) must be aggregated into a User-Item Matrix.

System Check

What defines the dimensionality of a User-Item matrix?


ML Engineers Network

Discuss Hyperparameters

ACTIVE

Struggling with sparse matrices or determining 'k' neighbors? Join the chat.

User-User Collaborative Filtering: Finding DoppelgΓ€ngers

Author

Pascual Vila

ML Instructor // Code Syllabus

"Tell me who your neighbors are, and I'll tell you what movie you'll watch next." The essence of memory-based collaborative filtering relies entirely on the wisdom of similar crowds.

The User-Item Matrix

Every recommender starts with data mapping. In Collaborative Filtering, we construct a 2D matrix where rows represent users ($u$) and columns represent items ($i$). The cells contain the ratings ($r_&123;u, i&125;$). Because users only rate a tiny fraction of total items, this matrix is heavily populated with empty spaces. This is known as Sparsity.

Measuring Similarity

To predict what User A will think of a movie, we first find Users B, C, and D who have similar tastes. We calculate the similarity between User A's vector and every other user's vector using Cosine Similarity or Pearson Correlation.

Pearson Correlation is often preferred because it accounts for rating scale biases (e.g., users who rate everything 5 stars vs users who average 3 stars) by subtracting the mean rating of each user:

$$ sim(u, v) = \frac&123;\sum (r_&123;u, i&125; - \bar&123;r&125;_u)(r_&123;v, i&125; - \bar&123;r&125;_v)&125;&123;\sqrt&123;\sum (r_&123;u, i&125; - \bar&123;r&125;_u)^2&125; \sqrt&123;\sum (r_&123;v, i&125; - \bar&123;r&125;_v)^2&125; $$

Predicting the Rating

Once we have the similarities, we predict User $u$'s rating for item $i$ ($\hat&123;r&125;_&123;u, i&125;$) by taking a weighted average of the ratings given to item $i$ by their $N$ most similar neighbors:

$$ \hat&123;r&125;_&123;u, i&125; = \bar&123;r&125;_u + \frac&123;\sum_&123;v \in N&125; sim(u,v) (r_&123;v, i&125; - \bar&123;r&125;_v)&125;&123;\sum_&123;v \in N&125; |sim(u,v)|&125; $$

❓ FAQ - Recommender Engines

What is the Cold Start Problem?

It occurs when a new user enters the system. Since they haven't rated anything, we cannot calculate their similarity to other users. Workarounds include recommending overall popular items or asking them to rate a few genres upon signup.

Why use Item-Item over User-User?

In massive e-commerce sites (like Amazon), there are far more users than items, and user tastes change rapidly. Item similarities are more static, making Item-Item CF faster to compute and often more stable.

RecSys Glossary

User-Item Matrix
A 2D array where rows are users, columns are items, and cells are interaction scores (ratings).
python
Cosine Similarity
Metric that measures the cosine of the angle between two multi-dimensional vectors.
python
Pearson Correlation
Mean-centered cosine similarity. Accounts for users being strict or generous raters.
python
Sparsity
The phenomenon where the vast majority of cells in a user-item matrix are empty (NaN).
python