A model is just a static file until it's deployed. In the world of recommendations, deployment means handling high-concurrency requests with sub-100ms latency while processing millions of user events.
1The Retrieval-Ranking Pipeline
When a user opens an app, you can't score every one of your 10 million items in real-time. Instead, we use a Two-Stage pipeline. The first stage is Retrieval (or Candidate Generation), which uses simple, fast logic to find the top ~100 items most likely to interest the user. The second stage is Ranking, where a more complex and 'heavy' model (like a Deep Neural Network) scores only those 100 candidates to produce the final top-10 list shown to the user.
2Latency Optimization with ANN
To make the Retrieval stage fast enough, we convert items and users into Embeddings (vectors) and use Approximate Nearest Neighbors (ANN). Algorithms like HNSW (Hierarchical Navigable Small World) allow us to search through millions of vectors in milliseconds by creating a navigable graph of similarities. This 'approximation' trades a tiny bit of accuracy for a massive gain in speed, which is the fundamental trade-off of production-grade Recommender Systems.
