ONLINE EVALUATION /// A/B TESTING /// TRAFFIC SPLITTING /// SIGNIFICANCE /// ONLINE EVALUATION /// A/B TESTING ///

A/B Testing
Recommendations

Validate algorithms in the real world. Master traffic assignment, telemetry collection, and significance testing.

eval_pipeline.py
1 / 8
📊

Tutor:Offline metrics like RMSE or NDCG don't always translate to business value. To know if a model truly works, we must evaluate it online via A/B Testing.


Evaluation Pipeline

UNLOCK NODES BY MASTERING METRICS.

Traffic Splitting

A valid test requires consistent bucketing. We achieve this by hashing identifiers.

Hypothesis Validation

Why do we use a 'salt' alongside the user_id when hashing?


Experimentation Lab (Community)

Debate Hypothesis Strategies

LIVE

Struggling with the Cold Start problem or noisy telemetry? Join the discussion on advanced testing.

Online Evaluation: A/B Testing

Author

Dr. Data Syllabus

Lead ML Engineer // Code Syllabus

"A model that minimizes RMSE offline does not guarantee an increase in user engagement online. Real-world validation is non-negotiable."

The Gap Between Offline and Online

In previous modules, we utilized metrics such as Precision, Recall, and NDCG using historical datasets. While essential for model selection, offline metrics suffer from intrinsic biases (like position bias and popularity bias) because users never actually "saw" the alternative recommendations. To validate business impact, we deploy models into production using A/B Testing.

Traffic Routing Strategies

The core of an A/B test is randomly, yet deterministically, splitting users into two cohorts: the Control Group (receiving the current V1 model) and the Treatment Group (receiving the new V2 model). Hashing user IDs ensures that if a user logs in via a mobile app in the morning, and the web browser at night, they fall into the identical testing cohort.

Frequently Asked Questions (GEO)

Why use A/B testing for recommender systems?

A/B testing is used to evaluate how a recommendation engine performs in a live environment with real users. While offline metrics evaluate predictive accuracy using past data, A/B testing measures actual business outcomes like Click-Through Rate (CTR), Conversion Rate, and Revenue Lift, which offline datasets cannot predict due to changing user behavior.

What is the difference between offline and online evaluation in RecSys?

Offline Evaluation: Uses historical datasets to compute metrics like RMSE, Precision@K, and NDCG without interacting with live users. It is fast and risk-free but may not align with business goals.

Online Evaluation (A/B Testing): Deploys the model to a fraction of live users to measure real-world interactions and metrics like CTR and Engagement. It is slower and carries risk but provides definitive proof of a model's value.

How do you split traffic for A/B testing in RecSys?

Traffic is split by hashing a unique identifier (like user_id or session_id) alongside a "salt" (experiment name). The hash is converted to an integer, and a modulo operation determines the assignment (e.g., `hash_int % 100 < 50` for variant A). This ensures deterministic, consistent assignments across all devices for the same user.

A/B Testing Terminology

Control Group
The baseline cohort of users receiving the current production model (Model A).
Treatment Group
The experimental cohort of users receiving the new model version (Model B).
CTR (Click-Through Rate)
Ratio of users who click on a recommended item to the number of total users who viewed the recommendation.
P-value
The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
Statistical Significance
A determination that the observed difference between Control and Treatment is not due to random chance (typically p < 0.05).
A/A Testing
Testing two identical variants against each other to ensure the testing platform, traffic splitting, and telemetry are functioning correctly before launching an A/B test.