A computer can't 'read' a movie description, but it can calculate it. TF-IDF is the bridge between human language and machine-readable profiles.
1Term Frequency (TF)
The first step in describing an item is counting. Term Frequency measures how many times a word appears in a specific document relative to the total number of words. If the word 'Magic' appears 10 times in a Harry Potter summary, it's a strong signal. However, TF alone is misleading—common words like 'the' will always have the highest TF, but they tell us nothing about the genre or specific content of the item.
2Inverse Document Frequency (IDF)
IDF is the 'Filter for Commonality'. It looks at the entire catalog (all documents). If a word appears in every single document (like 'Director' or 'Movie'), its IDF score will be near zero. If a word appears only in a few documents (like 'Dinosaur' or 'Vampire'), its IDF score will be very high. By multiplying **TF * IDF**, we get a score that is high only for words that are frequent in *one* document but rare in the rest—perfectly capturing the 'Essence' of that item.
3The Feature Space
Combining these scores results in an Item Profile Vector. Each item in your catalog becomes a point in a high-dimensional space. The distance between these points represents how 'Similar' the items are. For example, a movie with high weights for 'Space', 'Ship', and 'Star' will be mathematically closer to other sci-fi movies than to a romantic comedy. This numerical representation is the prerequisite for all advanced content-based filtering algorithms.
