Models are living things. They require constant observation to ensure they remain healthy, fast, and accurate in a changing world.
1The Scrape Architecture
Unlike traditional push-based logging, Prometheus uses a Pull (Scrape) model. Your application exposes a /metrics endpoint, and Prometheus visits it every few seconds to record the current state of your system. This is highly efficient for high-scale microservices, as the application doesn't have to wait for a logging server to acknowledge every requestβit simply updates an internal counter.
# Monitoring with Prometheus & Grafana
# Visualizing the Health of Your ML Services2The Four Golden Signals
When monitoring ML, you must track the Four Golden Signals: 1) Latency (how long it takes to predict), 2) Traffic (number of requests), 3) Errors (rate of 500/400 errors), and 4) Saturation (how close your CPU/GPU is to its limit). In MLOps, we also add a fifth signal: Model Distribution, tracking if the model's answers are suddenly shifting in an unexpected direction.
from prometheus_client import Counter, Histogram
PRED_COUNT = Counter("model_predictions_total", "Total predictions")
LATENCY = Histogram("model_latency_seconds", "Prediction time")3Proactive Alerting
Monitoring is useless without Alerting. Using Alertmanager, you can define rules that trigger notifications to Slack, Email, or PagerDuty. For example, if your average prediction latency exceeds 200ms for more than 5 minutes, an alert can be fired. This allows your MLOps team to investigate and resolve issues (like memory leaks or model crashes) before they affect the end-user experience.
Dashboard: [ML Production Health]
Panel 1: Latency (ms) - [Green]
Panel 2: Request Rate - [Steady]
Panel 3: Error Rate - [0%]