Similarity is not a feeling; it is an angle. In a high-dimensional space of millions of items, Cosine Similarity is the lighthouse that finds the nearest shore.
1The Angle of Preference
When we treat items as Vectors (lists of ratings), we can visualize them in space. Euclidean Distance measures the 'Straight-line' distance between two points. If one user rates everything 5/5 and another rates everything 3/5, they will be far apart in Euclidean space. However, Cosine Similarity measures the Angle between the vectors. If both users loved Item A twice as much as Item B, their vectors point in the same direction, resulting in a high similarity score. This makes Cosine the superior choice for handling the inherent subjectivity of human ratings.
2The Dot Product
The numerator of the Cosine formula is the Dot Product. It multiplies the ratings of corresponding items and sums them up. If two items are often rated highly by the same users, the dot product will be large. We then Normalize this by dividing by the magnitudes of the vectors. This step ensures that a popular item with thousands of ratings doesn't automatically dominate the results simply because it has 'more numbers'. It scales everything to a consistent range from 0 to 1.
3Removing the Bias
A common problem in RecSys is the 'Optimistic User' who gives everything 4 stars, and the 'Pessimist' who gives everything 2 stars. To the AI, the Optimist's 3 might be a 'dislike', while the Pessimist's 3 might be a 'rave review'. We solve this with Mean Centering. We subtract the user's average rating from every individual rating. Now, a positive number means 'Above Average' and a negative number means 'Below Average'. This 'Adjusted Cosine Similarity' is the industry standard for high-accuracy collaborative filtering.
