RecSys User User Collaborative Filtering

User-User Collaborative Filtering: Finding Doppelgängers

Pascual Vila

ML Instructor // Code Syllabus

"Tell me who your neighbors are, and I'll tell you what movie you'll watch next." The essence of memory-based collaborative filtering relies entirely on the wisdom of similar crowds.

The User-Item Matrix

Every recommender starts with data mapping. In Collaborative Filtering, we construct a 2D matrix where rows represent users ($u$) and columns represent items ($i$). The cells contain the ratings ($r_&123;u, i&125;$). Because users only rate a tiny fraction of total items, this matrix is heavily populated with empty spaces. This is known as Sparsity.

Measuring Similarity

To predict what User A will think of a movie, we first find Users B, C, and D who have similar tastes. We calculate the similarity between User A's vector and every other user's vector using Cosine Similarity or Pearson Correlation.

Pearson Correlation is often preferred because it accounts for rating scale biases (e.g., users who rate everything 5 stars vs users who average 3 stars) by subtracting the mean rating of each user:

$$ sim(u, v) = \frac&123;\sum (r_&123;u, i&125; - \bar&123;r&125;_u)(r_&123;v, i&125; - \bar&123;r&125;_v)&125;&123;\sqrt&123;\sum (r_&123;u, i&125; - \bar&123;r&125;_u)^2&125; \sqrt&123;\sum (r_&123;v, i&125; - \bar&123;r&125;_v)^2&125; $$

Predicting the Rating

Once we have the similarities, we predict User $u$'s rating for item $i$ ($\hat&123;r&125;_&123;u, i&125;$) by taking a weighted average of the ratings given to item $i$ by their $N$ most similar neighbors:

$$ \hat&123;r&125;_&123;u, i&125; = \bar&123;r&125;_u + \frac&123;\sum_&123;v \in N&125; sim(u,v) (r_&123;v, i&125; - \bar&123;r&125;_v)&125;&123;\sum_&123;v \in N&125; |sim(u,v)|&125; $$

❓ FAQ - Recommender Engines

What is the Cold Start Problem?

It occurs when a new user enters the system. Since they haven't rated anything, we cannot calculate their similarity to other users. Workarounds include recommending overall popular items or asking them to rate a few genres upon signup.

Why use Item-Item over User-User?

In massive e-commerce sites (like Amazon), there are far more users than items, and user tastes change rapidly. Item similarities are more static, making Item-Item CF faster to compute and often more stable.

User-User CF

Execution Graph

Data Structuring

System Check

Algorithm Challenges

ML Engineers Network

Discuss Hyperparameters

User-User Collaborative Filtering: Finding Doppelgängers

The User-Item Matrix

Measuring Similarity

Predicting the Rating

❓ FAQ - Recommender Engines

RecSys Glossary