Tensorflow has grown to be the de facto ML platform, popular within both industry and research. The demand and support for Tensorflow has contributed to host of OSS libraries, tools and frameworks around training and serving ML models. The Tensorflow Serving is a project built to focus on the inference aspect for serving ML models in a distributed, production environment.
Mux uses Tensorflow Serving in several parts of its infrastructure, and we’ve previously discussed using Tensorflow Serving to power our per-title-encoding feature. Today, we’ll focus on techniques that improve latency by optimizing both the prediction server and client. Model predictions are usually “online” operations (on critical application request path), thus our primary optimization objectives are to handle high volumes of requests with as low latency as possible.
First let’s do a quick overview of Tensorflow Serving.
Tensorflow Serving provides a flexible server architecture designed to deploy and serve ML models. Once a model is trained and ready to be used for prediction, Tensorflow Serving requires the model to be exported to a Servable compatible format.
A Servable is the central abstraction that wraps Tensorflow objects. For example, a model could be represented as one or more Servables. Thus, Servables are the underlying objects that client uses to perform computation such as inference. The size of Servable matters, as smaller models use less memory, less storage, and will have faster load time. Servables expect models to be in SavedModel format for loading and serving with the Predict API.
Tensorflow Serving puts together the core serving components to build a gRPC/HTTP server that can serve multiple ML models (or multiple versions), provide monitoring components, and a configurable architecture.
Lets get a base-line prediction performance latency metric with the standard Tensorflow Serving (no CPU optimizations).
First, pull the latest serving image from Tensorflow Docker hub:
For the purpose of this post, all containers are run on a 4 core, 15GB, Ubuntu 16.04 host machine.
When a model is trained using Tensorflow, the output can be saved as variable checkpoints (files on disk). Inference can be run directly by restoring model checkpoints or on its converted frozen graph (binary).
In order to serve these models with Tensorflow Serving, the frozen graph has to be exported into SavedModel format. Tensorflow documentation has examples on exporting trained models in SavedModel format.
Tensorflow also provides a host of official and research models as starting point for experiments, research or production use.
As an example, we will use the deep residual network (ResNet) model that can be used to classify ImageNet’s dataset of 1000 classes. Download the pre-trained
ResNet-50 v2 model, specifically the channels_last (NHWC) convolution SavedModel, which is generally better for CPUs.
Copy the RestNet model directory in the following structure:
Tensorflow Serving expects models to be in numerically ordered directory structure to manage model versioning. In this case, the directory
1/ corresponds to model version
1, which contains the model architecture
saved_model.pb along with snapshot of the model weights (variables).
The following command spins up a Tensorflow Serving model server in docker container. In order to load the SavedModel, the model’s host directory needs to be mounted into the expected container directory.
Inspecting the container logs show that the ModelServer is running and ready to serve inference requests for
resnet model on gRPC and HTTP endpoints:
Tensorflow Serving defines the API services schema as protocol buffers (protobufs). The gRPC client implementations for the prediction API is packaged as
tensorflow_serving.apis python package. We will also need the
tensorflow python package for utility functionalities.
Lets install dependencies to create a simple client:
ResNet-50 v2 model expects floating point Tensor inputs in a channels_last (NHWC) formatted data structure. Hence, the input image is read using opencv-python which loads into a numpy array (height x width x channels) as float32 data type. The script below creates the prediction client stub and loads JPEG image data into numpy array, converts to Tensor proto to make the gRPC prediction request:
The output of running the client with an input JPEG image is shown below:
The output Tensor has the prediction result as an integer value and probabilities of features.
For a single request, this kind of prediction latency is unacceptable. However, this is not totally unexpected; the default Tensorflow Serving binary targets the broadest range of hardware to cover most use cases. You may have noticed from the standard Tensorflow Serving container logs:
This is an indication of Tensorflow Serving binary running on an incompatible CPU platform that it was not optimized for.
According to Tensorflow documentation, it is recommended to compile Tensorflow from source with all the optimizations available for the CPU of the host platform the binary will run on. The Tensorflow build options expose flags to enable building for platform-specific CPU instruction sets:
Instruction SetFlagsAVX--copt=-mavxAVX2--copt=-mavx2FMA--copt=-mfmaSSE 4.1--copt=-msse4.1SSE 4.2--copt=-msse4.2All supported by processor--copt=-march=native
Clone Tensorflow Serving pinned to specific version. In this case, we’ll be using
1.13 (latest as of this publishing this post):
Tensorflow Serving development image uses Bazel as the build tool. Build targets for processor-specific CPU instruction sets can be specified as follows:
If memory is a constraint, limit the consumption of the memory intensive build process with
--local_resources=2048,.5,1.0 flag. See Tensorflow Serving with Docker and Bazel docs as resources on such build flags.
Build the serving image with development image as base:
ModelServer can be configured with Tensorflow-specific flags to enable Session parallelism. The following options configure two thread pools to parallelize executions:
The default for both options are set to a value of
0. This means, the system picks an appropriate number, which most often entails one thread per CPU core available. However, this can be manually controlled for multi-core CPU parallelism.
Next, start the serving container similarly to before, this time with the docker image built from source and with Tensorflow specific CPU optimization flags:
The container logs should not show CPU guard warnings anymore. Without changing any code, running the same prediction request drops the prediction latency by ~35.8%:
Can we do better? Server side has been optimized for its CPU platform but a prediction latency over 1s still seems too high.
It just so happens that there is a large latency cost to loading the
tensorflow libraries. Each call to
tf.contrib.util.make_tensor_proto also adds an un-necessary latency overhead as well.
“Hold up”, you might be thinking. “Don’t I need the Tensorflow Python packages to actually make prediction requests to Tensorflow Server?”
The simple answer is no, we don’t actually need the
tensorflow_serving packages to make prediction requests.
As noted previously, Tensorflow prediction APIs are defined as protobufs. Hence, the two external dependencies can be replaced by generating the necessary
tensorflow_serving protobuf python stubs. This avoids the need the pull in the entire (heavy) Tensorflow library on the client itself.
To start with, get rid of
tensorflow_serving dependencies and add
tensorflow/serving repositories and copy the following protobuf files into the client project:
Copying the above protobuf files into a
protos/ directory and preserving the original paths:
For simplicity, the prediction_service.proto can be simplified to only implement the Predict RPC. This avoids pulling in nested dependencies of the other RPCs defined in the service. Here is an example of the simplified
Generate the gRPC python implementations using
Now the entire
tensorflow_serving module can be removed:
and replaced with the generated protobufs from
The Tensorflow library is imported in order to use the helper function
make_tensor_proto, which is used for wrapping a python/numpy object as TensorProto object.
Thus, we can replace the following dependency and code snippet:
with protobuf imports and building the TensorProto object:
Full python script available here. Run the updated inception client that makes prediction request to optimized Tensorflow Serving:
The following chart shows latency of a prediction request against standard, optimized Tensorflow serving and client over 10 runs:
The average latency dropped from standard Tensorflow Serving to optimized version by ~70.4%.
Tensorflow Serving can also be configured for high throughput processing. Optimizing for throughput is usually done for "offline" batch processing where tight latency bounds are not a strict requirement.
Server-side batching is supported out of the box by Tensorflow Serving as mentioned in docs here.
The trade-offs between latency and throughput are governed by the batching parameters supported. Tensorflow Serving batching works best to unlock the high throughput promised by hardware accelerators.
To enable batching, set
--batching_parameters_file flags. Batching parameters can be set as defined by SessionBundleConfig. For CPU-only systems, consider setting
num_batch_threads to number of cores available. See here for batching configuration approaches with GPU-enabled systems.
Upon reaching full batch on server-side, inference requests are merged internally into a single large request (tensor) and a Tensorflow Session is run on the merged request. Running a batch of requests on a single Session is where CPU/GPU parallelism can really be leveraged.
Some general use-cases to consider for batch proce Tensorflow Serving Batching:
Batching on the client-side is grouping multiple inputs together to make a single request.
Since the ResNet model expects input in NHWC format (first dimension being the number of inputs), we can aggregate multiple input images into a single RPC request:
For a batch of N images, the output Tensor in the response would have prediction results for the same number of inputs in request batch, in this case N = 2:
A few words on GPUs.
For training, parallelization can be exploited by GPUs more intuitively, since building deep neural networks requires massive calculations to arrive at optimal solution.
However, this is not always the case for inference. Many times, CNN's will get inference be sped-up when graph execution steps are placed on GPU devices. However, Picking hardware that optimizes the price-performance sweet spot requires rigorous testing, in-depth technical and cost analysis. Hardware accelerated parallelization are more valuable for "offline" inference batch processing (massive volumes).
Before inviting GPUs to the party, consider the business requirements with a thorough cost (monetary, operational, technical) analysis over benefits (strict latency, high throughput).
Photo by Fancycrave on Unsplash