A/B Testing for ML Models

Pascual Vila
MLOps Engineer // Code Syllabus
Deploying a machine learning model locally is easy. Proving that it creates actual business value in production without disrupting the user experienceโthat is engineering.
Champion vs Challenger
In the MLOps lifecycle, a deployed model currently serving users is known as the Champion. When data scientists train a new, potentially better model, it becomes the Challenger. Instead of a risky, overnight replacement, we use A/B testing to directly compare their performance on live data.
Traffic Routing Implementation
To conduct an A/B test, we use an API Gateway or a custom routing script (like in FastAPI) to split the incoming traffic. Usually, 80-90% of traffic continues to hit the Champion model to ensure stability, while the remaining 10-20% is routed to the Challenger.
It is vital to ensure consistent routing. If a user receives a recommendation from the Challenger on Monday, they should probably receive Challenger recommendations on Tuesday as well (often solved by hashing the User ID).
Defining the Metrics
ML A/B testing goes beyond simple model accuracy. You must evaluate:
- Latency/Throughput: Does the Challenger take twice as long to infer?
- Business Metrics: Does the new model actually increase conversion rates or user retention?
- Statistical Significance: You must wait until you have a large enough sample size to ensure the difference isn't due to random variance (p-value < 0.05).
? Frequently Asked Questions
What is the difference between A/B Testing and Shadow Deployment?
In Shadow Deployment, live traffic is sent to both the Champion and the Challenger. However, the user ONLY sees the response from the Champion. The Challenger's output is just logged for analysis. In A/B Testing, the user actually receives the prediction from whichever model they were routed to.
How long should an ML A/B test run?
An A/B test should run until statistical significance is achieved. This depends on your traffic volume and the expected effect size. Additionally, you should run it across full business cycles (e.g., at least a full week to account for weekend vs. weekday traffic differences).
What is a Multi-Armed Bandit compared to A/B testing?
While A/B testing keeps the traffic split static (e.g., 90/10) until the end of the test, a Multi-Armed Bandit (MAB) algorithm dynamically adjusts the traffic split in real-time. If the Challenger starts performing well, the MAB automatically routes more traffic to it immediately, minimizing lost opportunity costs.