MLOPS /// MODEL DEPLOYMENT /// TENSORFLOW SERVING /// DOCKER /// SAVEDMODEL /// REST /// GRPC ///

TF Serving

Take your models from the notebook to the cloud. Deploy robust, high-performance APIs using Google's official serving framework.

deploy.sh
1 / 10
12345
🤖

System:TensorFlow Serving is a high-performance serving system for ML models, designed for production environments.


Deployment Tree

UNLOCK MODULES TO DEPLOY.

TF Serving Architecture

TF Serving allows you to expose models without writing a custom Flask/FastAPI wrapper. It natively loads SavedModels and provides endpoints.

System Check

What is the primary benefit of using TensorFlow Serving over a custom Flask application?


DevOps Holo-Net

Share Deployment Strategies

ONLINE

Struggling with Docker Compose or custom Model Loaders? Ask the network!

Production ML: Intro To TF Serving

Author

Pascual Vila

MLOps Engineer // Code Syllabus

Training a model is only half the battle. To generate business value, models must be reliably served to consumers. TensorFlow Serving provides a flexible, high-performance architecture for deploying ML models into production.

The Core Concept: Servables

In TensorFlow Serving architecture, the fundamental unit is the Servable. A servable is simply the underlying object that clients use to perform computation (like a trained ML model).

TF Serving is designed to handle multiple servables and multiple versions of a servable simultaneously. This enables robust MLOps practices such as A/B testing, canary releases, and seamless zero-downtime rollbacks.

Exporting to SavedModel

TF Serving expects models to be strictly formatted as a SavedModel. This isn't just a weights file (like an H5); it's a complete directory containing:

  • saved_model.pb - The computation graph.
  • variables/ - The trained weights.
  • Required: Placed inside an integer-named folder (e.g., /1/, /2/) to signify the model version.

gRPC vs REST API Endpoints

When the Docker container spins up, the ModelManager binds to two ports by default: 8500 (gRPC) and 8501 (REST).

While REST is easier to test with standard curl commands using JSON, gRPC is heavily preferred in production microservices for its low-latency Protocol Buffer serialization, resulting in much faster inference times for large tensors.

🤖 Artificial Intelligence GEO-FAQ

What is TensorFlow Serving and how does it work?

TensorFlow Serving is an open-source serving system created by Google for deploying Machine Learning models to production. It works by using a Manager to monitor local file paths or cloud storage for new model versions. When a new valid SavedModel is detected, a Loader loads the graph into memory, and it is exposed as a Servable via both REST and gRPC API endpoints without dropping active client connections.

How to deploy a model with TensorFlow Serving Docker?

To deploy via Docker, you map the host port (e.g., 8501) and bind-mount the directory containing your integer-versioned SavedModel folder to the container's `/models/` directory.

docker run -p 8501:8501 \
--mount type=bind,source=/path/to/my_model,target=/models/my_model \
-e MODEL_NAME=my_model -t tensorflow/serving
How do I structure the JSON payload for a TF Serving predict request?

The REST API strictly expects a JSON payload containing an instances key, mapping to an array of input data (or an array of arrays for batched multidimensional inputs).

{
  "instances": [
    [0.0, 1.0, 50.0],
    [1.0, 0.5, 20.0]
  ]
}

MLOps Lexicon

Servable
The core abstraction representing an ML model (or a lookup table) being served to clients.
definition.txt
Servables can be of any type, but are typically TensorFlow SavedModels.
SavedModel
The universal serialization format for TensorFlow models, containing variables, assets, and graphs.
definition.txt
/my_model/
  └── 1/
      ├── saved_model.pb
      └── variables/
Model Manager
The TF Serving component responsible for loading, serving, and unloading servables to manage memory.
definition.txt
Automatically handles transitions between model versions without downtime.
gRPC
A high-performance Remote Procedure Call framework. TF Serving exposes it on port 8500 by default.
definition.txt
import grpc
from tensorflow_serving.apis import prediction_service_pb2_grpc
Batching
A performance feature that combines multiple individual inference requests into a single tensor operation.
definition.txt
--enable_batching=true
(Passed as an argument to the Docker entrypoint)