User-User Collaborative Filtering: Finding DoppelgΓ€ngers

Pascual Vila
ML Instructor // Code Syllabus
"Tell me who your neighbors are, and I'll tell you what movie you'll watch next." The essence of memory-based collaborative filtering relies entirely on the wisdom of similar crowds.
The User-Item Matrix
Every recommender starts with data mapping. In Collaborative Filtering, we construct a 2D matrix where rows represent users ($u$) and columns represent items ($i$). The cells contain the ratings ($r_&123;u, i&125;$). Because users only rate a tiny fraction of total items, this matrix is heavily populated with empty spaces. This is known as Sparsity.
Measuring Similarity
To predict what User A will think of a movie, we first find Users B, C, and D who have similar tastes. We calculate the similarity between User A's vector and every other user's vector using Cosine Similarity or Pearson Correlation.
Pearson Correlation is often preferred because it accounts for rating scale biases (e.g., users who rate everything 5 stars vs users who average 3 stars) by subtracting the mean rating of each user:
Predicting the Rating
Once we have the similarities, we predict User $u$'s rating for item $i$ ($\hat&123;r&125;_&123;u, i&125;$) by taking a weighted average of the ratings given to item $i$ by their $N$ most similar neighbors:
β FAQ - Recommender Engines
What is the Cold Start Problem?
It occurs when a new user enters the system. Since they haven't rated anything, we cannot calculate their similarity to other users. Workarounds include recommending overall popular items or asking them to rate a few genres upon signup.
Why use Item-Item over User-User?
In massive e-commerce sites (like Amazon), there are far more users than items, and user tastes change rapidly. Item similarities are more static, making Item-Item CF faster to compute and often more stable.