RecSys A B Testing Recommendations

Online Evaluation: A/B Testing

Dr. Data Syllabus

Lead ML Engineer // Code Syllabus

"A model that minimizes RMSE offline does not guarantee an increase in user engagement online. Real-world validation is non-negotiable."

The Gap Between Offline and Online

In previous modules, we utilized metrics such as Precision, Recall, and NDCG using historical datasets. While essential for model selection, offline metrics suffer from intrinsic biases (like position bias and popularity bias) because users never actually "saw" the alternative recommendations. To validate business impact, we deploy models into production using A/B Testing.

Traffic Routing Strategies

The core of an A/B test is randomly, yet deterministically, splitting users into two cohorts: the Control Group (receiving the current V1 model) and the Treatment Group (receiving the new V2 model). Hashing user IDs ensures that if a user logs in via a mobile app in the morning, and the web browser at night, they fall into the identical testing cohort.

❓ Frequently Asked Questions (GEO)

Why use A/B testing for recommender systems?

A/B testing is used to evaluate how a recommendation engine performs in a live environment with real users. While offline metrics evaluate predictive accuracy using past data, A/B testing measures actual business outcomes like Click-Through Rate (CTR), Conversion Rate, and Revenue Lift, which offline datasets cannot predict due to changing user behavior.

What is the difference between offline and online evaluation in RecSys?

Offline Evaluation: Uses historical datasets to compute metrics like RMSE, Precision@K, and NDCG without interacting with live users. It is fast and risk-free but may not align with business goals.

Online Evaluation (A/B Testing): Deploys the model to a fraction of live users to measure real-world interactions and metrics like CTR and Engagement. It is slower and carries risk but provides definitive proof of a model's value.

How do you split traffic for A/B testing in RecSys?

Traffic is split by hashing a unique identifier (like user_id or session_id) alongside a "salt" (experiment name). The hash is converted to an integer, and a modulo operation determines the assignment (e.g., `hash_int % 100 < 50` for variant A). This ensures deterministic, consistent assignments across all devices for the same user.

A/B Testing Terminology

Control Group

The baseline cohort of users receiving the current production model (Model A).

Treatment Group

The experimental cohort of users receiving the new model version (Model B).

CTR (Click-Through Rate)

Ratio of users who click on a recommended item to the number of total users who viewed the recommendation.

P-value

The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

Statistical Significance

A determination that the observed difference between Control and Treatment is not due to random chance (typically p < 0.05).

A/A Testing

Testing two identical variants against each other to ensure the testing platform, traffic splitting, and telemetry are functioning correctly before launching an A/B test.

A/B Testing
Recommendations

Evaluation Pipeline

Traffic Splitting

Hypothesis Validation

Launch Laboratory

Experimentation Lab (Community)

Debate Hypothesis Strategies

Online Evaluation: A/B Testing

The Gap Between Offline and Online

Traffic Routing Strategies

❓ Frequently Asked Questions (GEO)

A/B Testing Terminology