Using Contextual Bandit models in large action spaces at Instacart

Published in

tech-at-instacart

13 min readJun 15, 2023

David Vengerov, Vinesh Gudla, Tejaswi Tenneti, Haixun Wang, Kourosh Hakhamaneshi

At Instacart, we strive to provide our customers with the most personalized experience possible by combining multiple considerations they may have when searching for products on Instacart. These considerations may include personal relevance of search results, popularity of recommended products, preference for low-priced items, the availability of those items at the store, etc. In order to better personalize the Instacart shopping experience for our customers across many of these considerations, we started using the latest Contextual Bandit (CB) models, which can be trained to recommend the best action to take in each observed context.

Contextual bandit models are a popular approach for personalizing the user experience by recommending relevant products. However, these models become challenging to train and evaluate when the number of actions, or products, in the recommendation pool become large. In this blog post, we will explore the difficulties associated with using contextual bandit models in large action spaces and propose potential solutions to overcome these challenges. One of these solutions was recently launched into production at Instacart.

Background on Contextual Bandit Models

Multi-armed Bandits vs Contextual Bandits

First, it is useful to clarify the difference between Contextual Bandit (CB) models and the more traditional and well-known Multi-Armed Bandit (MAB) models.

In the MAB framework, the system learns the values of possible actions through online exploration. After an action is taken, a stochastic outcome (that is, the reward from the action) is observed before the next action is taken. After some period of time during which the MAB model learns the reward distribution for each action, the action with the best average performance across all customers will be deployed as the production policy.

Fig. 1 During learning, the MAB model can try sometimes taking suboptimal actions if the uncertainty in their reward is high.

In contrast, a CB model observes some additional information (which is called context) that might be useful for selecting the action that is expected to give the largest reward. If we include user-specific features into the context vector, the CB model will learn to personalize the recommended action to each user, thereby potentially receiving a much higher average reward for all customers relative to an MAB model. Hence, a CB model can be viewed as a direct generalization of an MAB model, which does not use any context information.

In order to accurately predict the impact of each action in new contexts, the training data for a CB model should contain multiple examples of each action taken in each discrete context (or in each region of a continuous state space). This means that training of a standard discrete-action CB model becomes infeasible as the number of actions grows very large, as it will be hard or impossible to collect enough training data where each action is taken multiple times in each context. Unfortunately, the problem of optimally ranking the list of products shown in response to a search query falls in this infeasible problem category because of the very large number of potential product lists that will be observed in response to the search queries issued by the customers. The next section dives deeper into this problem and into the possible solutions we investigated.

Applying the CB Framework to Ranking and Retrieval Problems

At Instacart, we wanted to explore if we can make item retrieval and ranking much more personalized using Contextual Bandits. Item retrieval and ranking refers to the process of finding and ordering products that are relevant to a user’s search query. The item retrieval task involves searching through the store’s product database to find items that match the user’s search terms or browsing history. The ranking task then involves ordering the retrieved items based on their relevance to the user’s query, with the most relevant products appearing at the top of the search results. The ultimate goal is to provide users with a personalized and efficient shopping experience, where they can easily find the products they are looking for and make informed purchase decisions.

Traditionally, the relevance of each item to the user’s query is computed using a machine learning model that predicts the probability of each item being added to cart. Such a model is trained on the past data, where user+item features are considered to be model inputs, and the binary value that is 1 for each product that was shown to the user and then added to cart and 0 otherwise is used as the label to be predicted.

Unfortunately, there is no obvious small set of actions that exists in the problem of ranking recommended items in response to a search query or on some landing page. Several hundred potential candidate items are usually first retrieved from a large pool of millions of items and then sorted according to their ML-assigned ranking score. The entire sorted candidate list cannot be treated as a discrete action because this would lead to a very large number of possible actions (distinct sorted lists formed from all possible items). Moreover, each individual item in the list cannot be treated as a discrete action because the reward for that item (click or purchase) depends on the other items shown to the customer, which violates the independent reward attribution assumption of the CB framework.

One possible way of dealing with the above problem is to embed each possible action that a model can take into a low-dimensional vector, following the suggestion in Off-Policy Evaluation for Large Action Spaces via Embeddings. For example, for an item ranking problem, we can describe the list of items presented to the user using features such as the average ranking score and the average price of top 5, 10, 15, 20 items. We can add many other numerical features, but as the dimensionality of the vector that describes each action increases, learning an accurate CB model that would recommend the best vector for each input context will become increasingly difficult.

In 2022, we started with a simpler approach for applying CBs to the ranking problem. Our work started with two observations. First, for well-defined queries such as “milk,” a linear model that predicts the relevance score for each item (the items are then ranked in the decreasing order of these scores) without relying on user features as inputs performs really well by recommending the most popular milk products for that retailer. Second, for broad/undefined queries such as “healthy snack,” a nonlinear item scoring model that uses many user features as inputs performs really well by recommending personalized items for the customer.

These observations motivated us to test whether a CB model can learn to select the best search ranker for each <user, query> context so as to increase the average relevance of search results as measured by cart_adds_per_search (items added to cart per search query). We trained a CB model using the X-learner framework described below, and during an A/B test such a model showed a 0.66% increase in cart_adds_per_search relative to the neural network item scoring model, which is a very large gain given Instacart’s scale.

After this success, we started investigating the possibility of personalizing multi-objective ranking, which is a problem faced by all real-world search and recommendation systems. All such systems need to perform a trade-off between the personalized relevance, popularity and novelty of recommended items. In e-commerce companies such as Instacart, the price and availability of recommended products are also important considerations. For our first online experiment with personalizing multi-objective ranking, we defined several different formulas for ranking the results according to considered objectives (with each formula giving a preference to one or two objectives) and treated them as the possible actions for the CB model.

For example, if we want to rank items according to a linear combination of relevance, popularity, price and availability scores, action0 can correspond to using weights (0.4, 0.2, 0.2, 0.2), action1 can correspond to using weights (0.2, 0.4, 0.2, 0.2), etc. As a result, action0 would imply that the final ranking score is computed as

action1 would imply

etc.

With the above framework in mind, we first performed a randomized data collection experiment, for which we defined eight actions that explored the space of coefficients for the above ranking objectives around the values used by the production model at that time (perturbing these coefficients by 20% up or down). We then trained CB models on the collected data and evaluated them counterfactually as described below. The X-learner model and the XGBoost model showed the most promising tradeoffs between counterfactual estimates for Cart Adds Per Search (CAPS) and Gross Merchandise Value (GMV) per search metrics, which are the top line metrics we use in the A/B tests for search ranking experiments. We then performed an online A/B test with these models and observed a statistically significant increase close to 0.6% in CAPS for Android users (our most price sensitive users) for both models, and a positive but not statistically significant increase in GMV per user. Based on these results, the Search team decided to launch into production the XGBoost model, which had the lower latency of the two models. We plan to continue experimenting in this domain with different action spaces for the CB model.

In addition to investigating the use of CB for personalizing search ranking, we also plan to investigate the use of CB models for recommending a vector of preference scores over the available retrieval sources so as to maximize a certain objective that can be computed from each sorted list of results. There are several ways to translate the preference vector produced by the CB model into the particular treatment received by the user. If the system retrieves more than enough candidates, then we can sample candidates from each source so as to have more candidates from higher preference sources. In addition, we can heuristically uprank the candidates from the most preferred source by multiplying their ranking scores by a number greater than 1. Finally, it is possible to use the CB preference of each item’s source as an extra ranking feature.

Training contextual bandit models

In order to train a CB model, we first need to collect a dataset where the actions we want the CB model to use have already been taken in the various contexts, and the outcomes have been observed. During data collection, customers would be randomly assigned to different variants, with each variant always using one particular “action” for ranking the items. We can then train a CB model on this data to learn the predicted reward for each action in a given context (defined by input features) and also evaluate the trained model counterfactually (see below for a discussion of counterfactual evaluation).

A few different algorithms can be used to train a CB model on top of the collected data. The simplest approach that can be performed with a few lines of Python code consists of training and XGBoost or LightGBM model with “action” being a categorical feature in the model. In order to obtain the recommended action from such a model for a new input vector, all allowed actions can be sequentially substituted into the trained model in place of the “action” feature, and the action with the highest predicted reward can then be selected.

A more sophisticated approach is to represent the CB model by a neural network with as many input nodes as there are context features and as many output nodes as there are possible actions. The final activations of the output nodes will then represent a vector of predicted rewards for all actions in the observed context. At serving time, the CB model can be configured to recommend the action with the highest predicted reward or to recommend actions probabilistically based on their predicted rewards. During training, for each datapoint in the format (context, action, reward), the error between the predicted reward and the observed reward for the action that was taken will be backpropagated through the output node corresponding to that action.

Recently, a more advanced framework for training CB models (called X-learners) was proposed in the 2019 paper Metalearners for estimating heterogeneous treatment effects using machine learning. The method starts by grouping the data according to the treatment (action) that was used for that group, and then for each group k a first-level ML model is trained to predict the observed reward. Then, second-level ML models are trained for each group to predict the incremental lift (also known as Conditional Average Treatment Effect, CATE) that would be observed if that group were to receive the “control” action (in practice, “control” action is the currently deployed policy upon which we want to improve). At serving time, the X-learner CB model recommends the action with the highest predicted lift relative to control (or recommends actions probabilistically based on their predicted lifts). The diagram below shows the training steps for an X-learner model with 2 actions: treatment0 (control) and treatment1. In that diagram, X represents the set of input features, T represents the action (treatment) recorded in the data, Y represents the observed reward (Y0 is the vector of observed rewards for cases where T=0 was applied, while Y1 is the vector of observed rewards for cases where T=1 was applied), and Y* represented the predicted value of Y according to the first-level model.

If more treatments are present, the steps in the diagram are repeated for each treatment (action). Our Python implementation of the X-learner framework stores the trained models internally inside the single X-learner object, for which a single “predict” method can be called to get the expected rewards for all actions.

Counterfactual Evaluation of Trained CB Models

The most commonly used approach for evaluating trained CB models is Inverse Propensity Sampling (IPS). This estimator weighs the reward in each observed (state,action,reward) = (s,a,r) tuple in the dataset by the ratio of probabilities pnew/p0, where pnew is the probability of taking the observed action under the new (trained CB) policy and p0 is the probability of taking this action under the logging policy. The intuition for this weighing is that if a certain action a is going to be taken twice as often in a state s under the new policy than under the old policy, then the reward r is also expected to be observed twice as often. The IPS estimator has been proven to produce an unbiased estimate of the expected reward under the new policy, assuming the dynamics of the process doesn’t change over time.

We have also used a more advanced Doubly Robust (DR) estimator to evaluate the trained models. The DR method is nicely summarized on slide 53 of the RecSys2021 Tutorial Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances.

The two estimators described above are only applicable to evaluating CB models with discrete action spaces. In cases where the action space is continuous (which arises when we use action embeddings), more sophisticated approaches need to be used. Luckily, such approaches have started appearing in the scientific literature. For example, equation (4) in Off-Policy Evaluation for Embedded Spaces describes an elegant and practical approach, which we are planning to use for problems at Instacart where action embeddings will be necessary.

Tools we created for training and evaluating CB models

We started out by implementing the X-learner framework as a Python script for training CB models with discrete action spaces. We used XGBoost as the basic model inside this framework because of its fast training time, ability to use both continuous and categorical features without any preprocessing, and the very good predictive power of XGBoost (which can often match or exceed that of Neural Networks, especially on small datasets).

We have also experimented with using Ray’s RLlib for training neural network based CB models with discrete action spaces (using the DQN algorithm). RLlib is an open-source library for training contextual bandit and reinforcement learning models (RL), offering support for production-level, highly distributed RL workloads while maintaining unified and simple APIs for a large variety of industry applications. When we start considering problems with continuous action spaces, RLlib will be our main choice since X-learners cannot be used in this case. RLlib, on the other hand, implements algorithms such as DDPG and SAC that are designed to work in continuous action spaces. The reinforcement learning algorithms in RLlib are also able to handle sequential decision problems, where we want to make a tradeoff between taking an action that maximizes the expected immediate reward and the one that will place the customer into a better state, from which higher future rewards would be obtained.

We worked with engineers from Anyscale (Kourosh Hakhamaneshi and others) to create a small python script that can use any CB/RL algorithm currently available in RLlib and can then be run on a laptop or in the cloud (Amazon’s SageMaker, Anyscale cloud, etc.) using all available cores on the machine for speeding up computation. As that script is running, it prints the IPS and DR estimates of the policy after each training epoch (a complete pass through all of the data) and also stores them in a tensorboard logging directory, from which the training progress can be nicely visualized using tensorboard. At the end of training (specified by the maximum number of epochs to be used), we can then use the model that corresponds to the epoch for which the best IPS and DR metrics have been observed.

Using RLlib can be as easy as defining a config object and calling a tuner.fit() on it. It also give you flexibility in defining different levels of customizations for both the algorithm and the logging :

config = (
DQNConfig()
.offline(input_config={"format": json, "paths": path}) # Specify the path of the training file saved in the json format
.training(gamma=0, model={"hidden_dim": tune.grid_search([32, 64])}) # Specify parameters of the CB model and of the training algorithm
…
)
tuner = tune.Tuner(
"DQN",
param_space=config
)
results = tuner.fit()

Conclusions

At Instacart, we have an exciting roadmap for 2023, filled with novel applications of Contextual Bandits (CB) to retrieval and ranking problems in many different surfaces like Search, Autocomplete and Ads. We also have the tools to quickly train CB models and evaluate their predicted impact on various business metrics, which allows us to A/B test only the most promising models. Stay tuned for follow-up blog posts where we will share some success stories of using CB models at Instacart!