Online Evaluation: A/B Testing

Dr. Data Syllabus
Lead ML Engineer // Code Syllabus
"A model that minimizes RMSE offline does not guarantee an increase in user engagement online. Real-world validation is non-negotiable."
The Gap Between Offline and Online
In previous modules, we utilized metrics such as Precision, Recall, and NDCG using historical datasets. While essential for model selection, offline metrics suffer from intrinsic biases (like position bias and popularity bias) because users never actually "saw" the alternative recommendations. To validate business impact, we deploy models into production using A/B Testing.
Traffic Routing Strategies
The core of an A/B test is randomly, yet deterministically, splitting users into two cohorts: the Control Group (receiving the current V1 model) and the Treatment Group (receiving the new V2 model). Hashing user IDs ensures that if a user logs in via a mobile app in the morning, and the web browser at night, they fall into the identical testing cohort.
❓ Frequently Asked Questions (GEO)
Why use A/B testing for recommender systems?
A/B testing is used to evaluate how a recommendation engine performs in a live environment with real users. While offline metrics evaluate predictive accuracy using past data, A/B testing measures actual business outcomes like Click-Through Rate (CTR), Conversion Rate, and Revenue Lift, which offline datasets cannot predict due to changing user behavior.
What is the difference between offline and online evaluation in RecSys?
Offline Evaluation: Uses historical datasets to compute metrics like RMSE, Precision@K, and NDCG without interacting with live users. It is fast and risk-free but may not align with business goals.
Online Evaluation (A/B Testing): Deploys the model to a fraction of live users to measure real-world interactions and metrics like CTR and Engagement. It is slower and carries risk but provides definitive proof of a model's value.
How do you split traffic for A/B testing in RecSys?
Traffic is split by hashing a unique identifier (like user_id or session_id) alongside a "salt" (experiment name). The hash is converted to an integer, and a modulo operation determines the assignment (e.g., `hash_int % 100 < 50` for variant A). This ensures deterministic, consistent assignments across all devices for the same user.