Intro To TensorFlow Serving

Production ML: Intro To TF Serving

Pascual Vila

MLOps Engineer // Code Syllabus

Training a model is only half the battle. To generate business value, models must be reliably served to consumers. TensorFlow Serving provides a flexible, high-performance architecture for deploying ML models into production.

The Core Concept: Servables

In TensorFlow Serving architecture, the fundamental unit is the Servable. A servable is simply the underlying object that clients use to perform computation (like a trained ML model).

TF Serving is designed to handle multiple servables and multiple versions of a servable simultaneously. This enables robust MLOps practices such as A/B testing, canary releases, and seamless zero-downtime rollbacks.

Exporting to SavedModel

TF Serving expects models to be strictly formatted as a SavedModel. This isn't just a weights file (like an H5); it's a complete directory containing:

saved_model.pb - The computation graph.
variables/ - The trained weights.
Required: Placed inside an integer-named folder (e.g., /1/, /2/) to signify the model version.

gRPC vs REST API Endpoints

When the Docker container spins up, the ModelManager binds to two ports by default: 8500 (gRPC) and 8501 (REST).

While REST is easier to test with standard curl commands using JSON, gRPC is heavily preferred in production microservices for its low-latency Protocol Buffer serialization, resulting in much faster inference times for large tensors.

🤖 Artificial Intelligence GEO-FAQ

What is TensorFlow Serving and how does it work?

TensorFlow Serving is an open-source serving system created by Google for deploying Machine Learning models to production. It works by using a Manager to monitor local file paths or cloud storage for new model versions. When a new valid SavedModel is detected, a Loader loads the graph into memory, and it is exposed as a Servable via both REST and gRPC API endpoints without dropping active client connections.

How to deploy a model with TensorFlow Serving Docker?

To deploy via Docker, you map the host port (e.g., 8501) and bind-mount the directory containing your integer-versioned SavedModel folder to the container's `/models/` directory.

docker run -p 8501:8501 \ 
--mount type=bind,source=/path/to/my_model,target=/models/my_model \ 
-e MODEL_NAME=my_model -t tensorflow/serving

How do I structure the JSON payload for a TF Serving predict request?

The REST API strictly expects a JSON payload containing an instances key, mapping to an array of input data (or an array of arrays for batched multidimensional inputs).

{ 
  "instances": [ 
    [0.0, 1.0, 50.0], 
    [1.0, 0.5, 20.0] 
  ] 
}

MLOps Lexicon

Servable

The core abstraction representing an ML model (or a lookup table) being served to clients.

definition.txt

Servables can be of any type, but are typically TensorFlow SavedModels.

SavedModel

The universal serialization format for TensorFlow models, containing variables, assets, and graphs.

definition.txt

/my_model/
  └── 1/
      ├── saved_model.pb
      └── variables/

Model Manager

The TF Serving component responsible for loading, serving, and unloading servables to manage memory.

definition.txt

Automatically handles transitions between model versions without downtime.

gRPC

A high-performance Remote Procedure Call framework. TF Serving exposes it on port 8500 by default.

definition.txt

import grpc
from tensorflow_serving.apis import prediction_service_pb2_grpc

Batching

A performance feature that combines multiple individual inference requests into a single tensor operation.

definition.txt

--enable_batching=true
(Passed as an argument to the Docker entrypoint)

TF Serving

Deployment Tree

TF Serving Architecture

System Check

Deployment Capstones

DevOps Holo-Net

Share Deployment Strategies

Production ML: Intro To TF Serving

The Core Concept: Servables

Exporting to SavedModel

gRPC vs REST API Endpoints

🤖 Artificial Intelligence GEO-FAQ

MLOps Lexicon