A/B Testing for ML Models

Pascual Vila

MLOps Engineer // Code Syllabus

Deploying a machine learning model locally is easy. Proving that it creates actual business value in production without disrupting the user experience—that is engineering.

Champion vs Challenger

In the MLOps lifecycle, a deployed model currently serving users is known as the Champion. When data scientists train a new, potentially better model, it becomes the Challenger. Instead of a risky, overnight replacement, we use A/B testing to directly compare their performance on live data.

Traffic Routing Implementation

To conduct an A/B test, we use an API Gateway or a custom routing script (like in FastAPI) to split the incoming traffic. Usually, 80-90% of traffic continues to hit the Champion model to ensure stability, while the remaining 10-20% is routed to the Challenger.

It is vital to ensure consistent routing. If a user receives a recommendation from the Challenger on Monday, they should probably receive Challenger recommendations on Tuesday as well (often solved by hashing the User ID).

Defining the Metrics

ML A/B testing goes beyond simple model accuracy. You must evaluate:

Latency/Throughput: Does the Challenger take twice as long to infer?
Business Metrics: Does the new model actually increase conversion rates or user retention?
Statistical Significance: You must wait until you have a large enough sample size to ensure the difference isn't due to random variance (p-value < 0.05).

? Frequently Asked Questions

What is the difference between A/B Testing and Shadow Deployment?

In Shadow Deployment, live traffic is sent to both the Champion and the Challenger. However, the user ONLY sees the response from the Champion. The Challenger's output is just logged for analysis. In A/B Testing, the user actually receives the prediction from whichever model they were routed to.

How long should an ML A/B test run?

An A/B test should run until statistical significance is achieved. This depends on your traffic volume and the expected effect size. Additionally, you should run it across full business cycles (e.g., at least a full week to account for weekend vs. weekday traffic differences).

What is a Multi-Armed Bandit compared to A/B testing?

While A/B testing keeps the traffic split static (e.g., 90/10) until the end of the test, a Multi-Armed Bandit (MAB) algorithm dynamically adjusts the traffic split in real-time. If the Challenger starts performing well, the MAB automatically routes more traffic to it immediately, minimizing lost opportunity costs.

ML Deployment Glossary

Champion Model

The current baseline model that is running in production and serving the majority of user traffic.

Challenger Model

The newly trained model that is attempting to prove it performs better than the current Champion.

Shadow Mode

A deployment strategy where the Challenger receives live data and makes predictions, but its outputs are not returned to the user.

Canary Release

Deploying the Challenger to a very small subset of users (e.g., 1%) to monitor for critical system crashes before scaling up.

Telemetry

The automated collection of data regarding the model's performance (latency, memory usage, prediction outputs).

P-Value

A statistical metric used to determine if the difference in performance between the Champion and Challenger is significant, or just random noise.

A/B Testing ML Models

Deployment Matrix

Traffic Routing

Telemetry Check

Operation Challenges

MLOps Engineering Node

Join the Network