Unveiling the Core of Instacart’s Griffin 2.0: A Deep Dive Into the Model Serving Platform

Published in

tech-at-instacart

12 min readFeb 5, 2024

Authors: Zihan Li, Joseph Haraldson, Adway Dhillon, Keith Lazuka, Vaibhav Agarwal, James Matthews, Sahil Khanna, Rajpal Paryani

Background

Model serving is a critical step in the machine learning life cycle in which a trained model is deployed to a production environment as a web service to process input data, perform inference, and provide real-time predictions for various applications. Effective model serving infrastructure requires the model to be readily available, function efficiently under different load conditions, and maintain acceptable latency and throughput levels to support the corresponding application. Key aspects of model serving include:

Model Deployment: Packaging and deploying the trained model to a serving infrastructure.
Model Experimentation: Managing different versions of a model to allow for A/B testing and rollback.
Monitoring: Ensuring model serving is conducted within acceptable latency and accuracy.
Scaling & Load Balancing: Dynamically adjusting resources to handle varying numbers of inference requests.

In June 2022, we introduced the Griffin 1.0 (an extensible platform that supports diverse data management systems, integrates with multiple machine learning tools and various machine learning workflows) and its approach to model serving. In November 2023, we published a tech blog about high-level design for Griffin 2.0. In this post, we will go deeper into how we evolved the model serving system from Griffin 1.0 to Griffin 2.0, to improve the above aspects while also improving ease of use.

Design Considerations

In Griffin 1.0, teams ran custom model serving services based on the Gunicorn framework, and we identified a few problems with it:

Each team implemented custom model serving services. This means common logic like feature loading, feature preprocessing, model experimentation, monitoring, and feature logging are re-implemented per service. This practice causes a few problems: (1) It is an inefficient use of developer time. If consolidated, the common logic can be reused for making predictions for all models. (2) Each team needs to manage its own codebase and service, which causes a lot of DevOps overhead. (3) There is no standard and straightforward way to deploy a new model or do model experimentation, which is not a good experience for machine learning engineers.
Due to the usage of Gunicorn framework and Python being an interpreted language, latency and resource usage are not satisfactory. Take the Ads team’s model for predicting click-through-rate for example, its P99 latency accounts for 15% of the whole ads serving latency. In addition, Gunicorn forks multiple worker processes to serve concurrent requests, where each process loads the same model. Its memory footprint is therefore linear to the number of worker processes.

In Griffin 2.0, we resolved the above issues by consolidating common logic and creating a unified model serving platform (MSP for short) with several improvements:

Re-used model serving logic: Common logic of feature loading, feature preprocessing, model experimentation, monitoring, and feature logging are all consolidated into the MSP. We expose a generic interface to all applications for model serving.
Performant model serving: Golang is used to build the unified model serving service because it is a compiled language and its concurrency model is more suitable than Gunicorn to build online high-concurrency systems; with this switch, we were able to reduce P99 latency by over 80%. In addition, since each model only needs to be loaded once for each model serving instance, the overall memory footprint is greatly reduced.
Improved DevOps experience: Teams owning machine learning applications no longer need to manage their own model serving service. They are now just customers of MSP and there’s one team responsible for new enhancements and maintenance.
Improved experience for machine learning engineers: Machine learning engineers interact with MSP via configuration files and Control Plane UI, which is self-serviced. Feature loading, feature preprocessing and model experimentation are all configuration-driven.

In summary, for the four aspects of a model serving system mentioned in the Background section, Griffin 2.0:

Improves Model Deployment via improved experience for DevOps and machine learning engineers.
Improves Model Experimentation via consolidating the logic into MSP and making it configuration-driven.
Improves Monitoring via only monitoring MSP, instead of having each team manage their own monitoring.
Scaling & Load Balancing is still guaranteed via deploying the unified model serving service as an Amazon ECS.

How It’s Built

System Architecture

Fig-1 is the system diagram for the model serving platform (MSP for short) in Griffin 2.0.

MSP comprises four distinct components: Proxy, Worker, Control Plane, and Model Registry. The Proxy is designed to manage the routing of items to be scored, directing them to the appropriate workers. This is particularly beneficial for model experimentation. The Worker, on the other hand, is tasked with executing model inference logic.

Proxy: The Proxy has a routing configuration and a worker configuration which jointly help route items to be scored to the right workers. The routing configuration specifies which worker the proxy should direct a scoring item towards by identifying the worker endpoint alias. The worker configuration is responsible for defining the worker endpoint alias and its corresponding URL.
Worker: Each worker service (deployed using Amazon ECS) operates a single model. This model is initialized in a sidecar container. The primary container is dedicated to the tasks of feature loading and preprocessing, which precedes the invocation of the sidecar to perform the actual model inference.
Control Plane: This component is in charge of managing model deployments. When a worker is deployed for the first time, the Control Plane also generates the necessary worker configuration. Consequently, the Proxy utilizes this configuration to properly route requests to the worker.
Model Registry: This is where model artifacts are securely stored. Each model artifact contains the model and configuration files for feature loading and feature preprocessing.

Shown as (1) in Fig-1, when a model serving request for model_A arrives at Proxy, it will run this logic:

Depending on whether there is an experiment, there are two cases: (1) If there is no experiment, the routing config determines which default worker endpoint requests should be routed to for model_A. For example, the default worker endpoint can be the one for model_A_v1. (2) If there is an experiment, the routing config defines the experiment setup, including the key for distributing items across experiment arms (experiment_traffic_splitting_key). Shown as (2) and (3) in the diagram, items could be routed to different workers, some to the worker for model_A_v1 and some to the worker for model_A_v2.
Proxy merges responses from the workers, renders prediction results in the same order as they appear in the request, and returns the response to the client.

How machine learning engineers use MSP

Machine learning engineers interact with MSP by:

Registering the model via a Control Plane UI.
Deploying model to its worker via the Control Plane UI.
If they just want to onboard a new model to MSP, they set up the default endpoint in the routing config, so that Proxy knows how to route requests to it.
If they want to run model experimentation, they can set up experiments via the Traffic Assignment Engine UI. In the routing config, list out the experiment setup in the Traffic Assignment Engine, and associate each experiment arm with the corresponding worker endpoint the proxy knows. Notice that proxy will validate if experiment setup in the Traffic Assignment Engine and routing config match during bootstrap time.

In addition, teams owning the models can also set up their own custom training pipelines, which will publish models and trigger worker redeployment upon successful model promotion, via Control Plane (shown as (4) in Fig-1).

Key Design Decisions

Separation of routing config and worker config

We intentionally separate the routing config from worker config and make proxy combine them to construct its view of routing. Routing config is environment-agnostic, and it is defined on the application layer (for a given model use case, we define which worker(s) to which we should route the request). The worker config is environment-sensitive and it is defined on the physical layer (we define different worker URLs for different environments). Fig-2 shows an example of routing config, while Fig-3 and Fig-4 show examples of worker configs for the two worker endpoints in Fig-2.

Fig-3: example of worker config for model_a_v1

Fig-4: example of worker config for model_a_v2

Routing

Routing is configuration-driven, and it supports running multiple parallel experiments for a given model. In Fig-2, for model_a, there is only one experiment “model_a_v1_vs_v2” defined to experiment with version v2 of model_a against v1. Field experiment_traffic_splitting_key represents the key used to distribute items to be scored to different experiment arms -here, we use product_id. The routing config is designed as part of the MSP. Machine learning engineers send code reviews to make changes to it and the machine learning infra team will review and approve it. This process enforces team ownership for the MSP and places a gatekeeper for config changes.

Batching

We enable batching for proxy-to-worker calls as a mechanism to reduce long tail latency (e.g., P99). Batch size is configurable on a per worker/model basis via the “batch_size” parameter in the worker config.

Interface design for Proxy

The Proxy interface is what machine learning applications are calling to perform online model inference, and it is designed to be application-agnostic. Fig-5 shows the contract for the Proxy Request. The contract for Worker Request is similar (Fig-6).

Notice that the only difference between Proxy Request and Worker Request is that the former contains model_use_case_name while the latter does not. The introduction of shared_features, feature_data_frame, and query_data_frame is to reduce data redundancy in the request while also structuring them in a reasonable way. In Fig-1, “model_use_case_name” refers to “model_A” or “model_B”. This API design results in ease of testing. In some cases, when machine learning engineers onboard new models to MSP, they will send requests directly to the worker in the first place to verify if end-to-end works and measure performance. Later on, they will switch to sending requests through proxy for model experimentation.

Multi-container architecture for Worker

At Instacart TensorFlow is mostly widely used, and therefore it is the first case we want MSP to support. At the same time, we want to ensure the worker design can support other algorithm frameworks in the future. Given this, we employ a multi-container architecture: the model is loaded into the sidecar container (with TensorFlow Serving image), while the main container is for accepting requests, doing feature loading, feature preprocessing, constructing and sending requests to the sidecar container, and feature logging.

Using single-tenancy for Worker

We chose single-tenancy over multi-tenancy for workers (i.e., either deploying a single model or multiple models on one worker) because it provides better failure isolation on a per-model model basis and can speed up worker restart since the worker only needs to load one model. It’s also much easier to implement single-tenancy than multi-tenancy.

Mitigate model version discrepancy

The model version discrepancy issue occurs when a new version of a given model is promoted and model serving instances are not relaunched to load it atomically. Before MSP, we needed to write customized airflow jobs to restart each model serving service upon a successful model promotion. What made things worse in Griffin 1.0 was that the model serving service created for each model use case usually holds multiple models for experimentation purposes, which makes the service restart slow, and therefore, lengthens the time period for model version discrepancy.

In MSP, since Control Plane will automatically relaunch Worker once a new version of a given model is promoted, and we use single-tenancy for Worker, the model version discrepancy issue can be effectively mitigated.

Feature location config

Each item to be scored is associated with a list of features. Each feature can come from either the real-time inference request or the feature store. We introduce feature location config to allow model owners to decide on fetching what features from what locations in a configuration-driven way. This config is serialized as a Protobuf file and is embedded in the model artifact. The worker will download model artifacts from S3, deserialize the feature location config and apply it during the feature loading step. Fig-7 shows one example. The model has three features: “product_id”, “search_term”, and “product_id_search_term_l90d_ctr”. Features “product_id” and “search_term” come from worker request, feature “product_id_search_term_l90d_ctr” needs to be fetched from ML feature store by using (product_id, search_term) as lookup key.

Feature preprocessor config

Prior to this project, each application team implemented its custom feature preprocessing logic in both the training pipeline and inference service. In MSP, we decided to write a unified preprocessor library in Python, which is used for both training and serving purposes across Instacart. We also introduced the feature preprocessor config to define DAGs for applying feature preprocessors to features. The config is serialized as a Protobuf file and is embedded in the model artifact. The worker will download model artifacts from S3, deserialize the config, and apply it during the feature preprocessing step.

There are multiple benefits of doing it this way: the preprocessor library is algorithm framework (TensorFlow, PyTorch, etc.) agnostic; feature preprocessors are consistent for both training and serving across the whole company; Python is well-known among machine learning engineers and allows for fast prototyping.

Feature logging

We set up a near-real-time data pipeline to log the following info for compliance, debugging, and model performance monitoring purposes: features associated with items to be scored (we log both pre-processed version and post-processed version); inference results; model name and version.

Model Deployment Workflow

The Control Plane is responsible for executing the model deployment workflow. Once a model is published to the Control Plane, a machine learning engineer can click a button on the Control Plane UI to trigger worker deployment for the model (as shown in Fig-8). Three things will happen accordingly:

Worker service will be deployed just for this model.
Worker config will be created for the Worker and will be consumed by Proxy.
It will create Datadog metrics and alerts for the Worker.

Fig-8: deploy model test_e2e_training_griffin to worker endpoint by using the Control Plane UI

Monitoring

We use Datadog for real-time monitoring for both Proxy and Worker. Proxy metrics are shared across all Workers. When creating a Worker service for the first time, the model deployment workflow will call Datadog APIs to automatically create metrics and alerts for it on a per-worker service basis. In addition, we use Arize to monitor model performance. Take Ads team’s predictive click-through-rate (pCTR for short) model for example, we log features and inference results (pCTR scores) during online inference time and periodically pump this logged data into Arize. On the other hand, we periodically pump feature data, and actual ad click data for training purposes into Arize as well. Arize will compare these two sources and provide us with prediction accuracy, prediction score distribution, discrepancy between features used for serving and features used for training, and data drift for features used for serving and training respectively over time.

Impact & Future Work

We’ve adopted MSP for several models at Instacart. After employing it for Ads team’s predictive click-through-rate (pCTR for short) model, we were left with some notable improvements:

It reduced model serving latency (for both P99 latency and P50 latency) by over 80%, measured on MSP’s client side, compared to Gunicorn-based service. It therefore makes pCTR model prediction latency only account for 3% of the whole ads serving latency, instead of 15% as before.
Substantial EC2 cost savings. Due to Go’s concurrency model, each worker instance only needs to load one copy of a given model, instead of one model per worker process for Gunicorn-based service. This greatly reduces memory footprint.
Model experimentation, feature loading, and feature preprocessing are all configuration-driven.
Reduce the time to launch an ML model from weeks to minutes.

In the future, we have plans to support algorithm frameworks other than TensorFlow (e.g. PyTorch), as well as to support adaptive experimentation in addition to traditional static experiments.