How Instacart Modernized the Prediction of Real Time Availability for Hundreds of Millions of Items While Saving Costs

Published in

tech-at-instacart

12 min readJul 17, 2023

This is Part 2 of a three-part blog post series in which we outline how we addressed inventory challenges through product, machine learning, and engineering innovations. See Part 1 here and Part 3 here.

Introduction

At Instacart, we serve customers with a goal of finding and delivering all the products that they want to purchase from their favorite grocery stores. But this is no easy task as we described in Part 1. Inventory fluctuates at physical stores, and especially following the COVID pandemic, supply chains have been volatile. Therefore, predicting the inventory status of each product — in-stock, low stock, and out-of-stock — is a big challenge that we’ve taken on using machine learning.

We aim to provide our customers with up-to-date inventory predictions, so that they can expect to get what they order. Our community of shoppers scans millions of items per day and delivers them to the customers. Every time a shopper scans an item into their cart or marks an item as “not found”, we treat this as another signal that helps us make granular predictions of an item’s in-store availability. Sometimes when some items are low in stock, we communicate it with customers in the Instacart app (see Fig 1) so they can pick alternative products or choose replacements beforehand.

Fig 1. Product-specific availability indicator in the Instacart app

Within Instacart, an item is defined as a product within a particular store. We treat the same product sold in different stores as different items, because the same product may have different inventory status in each individual store.

Several years ago, Instacart built an item availability model that has since served as the basis of our approach to this challenge of scale. The model predicts the probability of finding an item in store, represented as a value between 0 and 1, for each item in our catalog. However, over the years, particularly during the pandemic period, we have identified some opportunities to improve the existing model, especially around better interpretability and modernizing the legacy infrastructure.

In order to support timely and informed shopping decisions and address the opportunities we have identified, we have built and deployed a new real-time item availability model. The new model has shown significant improvement in terms of business metrics and engineering performance.

Challenges with the previous model

Our previous model pipeline, shown in Fig 2, primarily focused on generating comprehensive features using our data warehouse, Snowflake, and was a batch scoring job which ran every 2–4 hours. The scope of the machine learning layer ended when the pipeline wrote to a Postgres database for scalable serving.

While the previous model delivered satisfactory results, there were still several newly arising challenges since the pandemic; we wanted to address them when we designed the new model and the serving pipeline.

Interpretability

The first challenge was interpretability, and it was particularly problematic when the model made a wrong prediction about an item actually in stock. We want to have a structured model so that its predictions are reliable, and easy to understand why it makes a particular prediction. This is important because the correct interpretation can prompt follow-up actions, such as making more observations in the store, or informing retailers to replenish.

Sparse Data

The second challenge was a sparse data problem. During the pandemic, our catalog grew several times larger than before, thanks to our continued catalog data work with our retailer partners. While we have millions of signals from Instacart shoppers who fulfill customer orders in store every day, our catalog size is hundreds times larger, so the majority of the items in our catalog have very limited or even no shopper signals. In order to still accurately predict the scores for the large number of tail items, we need to add and generate new features in a principled and reliable manner, and revisit some possibly less relevant and less useful features.

Business use case mismatch

The third challenge was the so-called “business use case mismatch”, with use cases shown in Fig 3. There are two major use cases: customers ordering (cart building, post check-out, etc) and shoppers picking up in stores. When the order time and picking time are relatively close (instant delivery), one set of scores is sufficient to support all stages of a shopping journey. However, since the pandemic, we have seen more diverse use case scenarios raising new challenges.

For example, increasingly, customers like to plan ahead. A customer wants to place an order at night for early next day delivery, when some items may run out at the ordering time but will restock the next day at store opening time. In the previous pipeline, because there was only one version of scores, the customer will see the item as out-of-stock (“OOS”) at order time and likely will not be able to buy it, even though the delivery won’t happen soon because they’ve scheduled it for the following morning.

As a comparison, routing of our shoppers to stores should always use the most recent signals, to identify the store with the most available items as of now. Our new pipeline therefore should be context aware, and support multiple versions of scores to power different use case scenarios, e.g., scheduled later delivery for ordering vs. sending shoppers to stores.

Legacy Infrastructure

The last problem arises from the legacy engineering infrastructure capability and costs, leading to score staleness. Because of the rapid growth of our catalog, it is not economically feasible and necessary to score very often for hundreds of millions of items. What’s more, there was several hours of delay in the signals from the source to the model serving. Therefore we need to completely rebuild the pipeline using the latest technical stack aimed for productive MLOps called Griffin developed by the Instacart Machine Learning Infra team, to reduce the costs while improving the quality and freshness of our scores.

Our new model structure mainly addresses the first and second challenges, while our new infrastructure and pipeline tackles the last two challenges.

New Model

To address the challenges around interpretability and data sparsity, we came up with a new hierarchical model structure with three distinct components, short for G-T-R, and is illustrated in Fig 4. Each component plays an important and yet different role in the item availability predictions.

G stands for General availability, which represents typical item availability patterns. It addresses the long-term pattern of availability for an item over 7–180 days and helps solve the data sparsity challenge.
T as in Trending quantifies the deviations from the general baseline. It answers the question of “what is the observed availability over the past 0.5–30 days?”
R stands for Real-Time, and responds to the very latest observation, which is more relevant to the likely short-term availability status than previous ones: what is the most recent pattern of shopper data and retailer inventory data today?

General scores — describing the typical patterns

General availability addresses a simple question: what is the typical probability of finding an item? In order to answer this question with a high level of confidence, we need a sufficiently large sample of observations. After finding the relevant sample, we will be able to draw a relatively reliable conclusion even with simple statistics like found rates. How can we find such a sample, given a lot of items in our catalog have a limited number of observations?

We employed an algorithm to identify the “most relevant sample of data”, by searching from the most local and recent data, aggregated from the same product and similar items purchase history in an area over a period of time, illustrated in Fig 5. The intuition is that, the more local and more recent a data point is, the more relevant it is for the scoring purpose. By imposing a required sample size K, determined by offline empirical evaluations, we are able to identify the minimally sufficient scope and confidently answer the “typical” availability question.

For example, we have observed more than K (sample size) shopper events for a frequent item like organic banana from one store over the last 7 days, so the found rate from one store over the last 7 days can serve as the general score. However, a rarely purchased product (tail item) may have only K sales over the last 180 days nationwide, then we need to generate a data sample from the past 180 days nationwide stores to infer the general score for this product.

Trending — quantifying deviations from the typical patterns

Since the COVID pandemic began, supply chain disruption has been commonplace. Some examples include the paper towel demand surge, baby formula shortage, water shortages due to a snow storm or a hurricane, and special events like sports championship games. The so-called “trending” model is intended to detect and quantify such near-term deviations from the long-term patterns.

Similar to our previous model, an XGBoost model is trained to quantify how much deviations from the baseline we should expect due to a recent event, either trending up, down, or holding steady, and outputs the final predictions. But different from our previous model, we consolidated multiple long-term features into the general scores as part of the input, so that the new model can have have a reliable prior with fewer features. This approach has achieved superior offline and online performance, likely due to the more stable prior of the general score. By comparing the trending output against the general part which has the highest weight among all features in the trained trending model, we can also understand how some recent events affect the item availability predictions, thereby providing interpretability.

Real time — inferring from the last known status

Real-time inference is based on the latest observation of an item status, and is able to respond to recent signals in a more timely manner, thanks to the real-time streaming infrastructure. The assumption is that the most recent observation has a higher impact on the inferred state in the short term than all previous observations.

We make an inference using real-time features generated from the last signals in addition to more historical features. The real-time features include but are not limited to: time since the last observation, last known observed status from an Instacart shopper, last status reported by a particular retailer, and retailer inventory levels. As a concrete example shown in Fig 6, once an Instacart shopper reports an item as out of stock on the shelf, we can look at historical data to understand how soon it will be back in stock (restocking to shelf time), represented as an unnormalized probability distribution function (PDF).

Fig 6. Restocking time used for real time inference

We can easily convert the learned PDF to a Cumulative Distribution Function (CDF), an example as shown in Fig 7, and then either use the mean/75 percentile of restocking time, area under the curve of the CDF over time, or other statistics as part of the input to the real time inference.

Fig 7. Example CDF for real time inference

Another important aspect of real-time inference is that the predictive power of a single signal wanes over time. Therefore we can control the scope of real-time scored items by filtering by the last observed time, e.g., only items with last known status within 10 hours. This has been a major cost saving factor, as only about 1% of all items are eligible for real-time scoring.

The three component new model has enabled us to improve interpretability by checking the final score against each component, so that we can understand which source of features and observations contributes more to the particular prediction. It also better handles sparse data by taking sample size into consideration during the feature generation process.

New Infrastructure and Pipeline

The previous model generates features and availability probability scores for hundreds of millions of items every few hours, which has been very costly. We believe items with different purchase frequencies, i.e., head vs. torso/tail items, can benefit from different scoring schedules to reduce latency and costs. While there is no carved-in-stone definition for each group, in our empirical study, we found the following grouping working well: head (frequent) items have the latest signals within 10 hours (about 1% of all unique items scored), torso items within a month (about 85%), and the remaining items are considered tail.

The new model running schedule is shown in Fig 8 below. All items have the general and trending daily scoring, but only head items have the privilege of trending scores hourly and the real-time inference. This stratified model serving structure substantially reduces computation costs by about 80%, thanks to the less frequent scoring of torso and tail items, and has shown better metrics in the online experiment.

The real-time item availability model has been the pioneering adopter of some of the latest engineering tech stacks. It is the first use case of the real-time data streaming pipeline and the latest productive MLOps tool suite Griffin including the feature store in Instacart.

Now with an API which supports multiple versions of models, it can also support different product layers with a version specified during a query. Therefore the API can distinguish between logistics-related models and customer cart building-related models, and return different scores even for the same item but in different contexts. In this way, we can use a real time model for shopper routing, while serving our customers with a model predicting availability in a future fulfillment time.

Conclusion

While we are in a good position to better serve our customers and collaborate with our retailer partners, the road leading to the current spot has not been easy. It is a system engineering problem whose solution requires deep understanding of data, business and product processes, machine learning theories, as well as full stack, large scale systems. It requires iterative trial and error processes to align different components in order to produce optimal online performance and customer experience.

We have been able to achieve better performance during offline evaluation and online experiment while not having to make trade-offs with costs. We have been able to substantially reduce the model complexity and meaningfully decrease feature engineering time. We also saved in costs by reducing the scoring needs for torso and tail items.

The rich learnings from building the new real time model and its system have prepared us for even greater impact and value that we will bring to our customers and retailer partners in the near future.

Acknowledgement

I would like to express my sincere gratitude to all those who have contributed to the success of this project, a result of highly effective cross functional teamwork. Firstly, I would like to acknowledge the unwavering support and leadership from the management team, Allan Stewart, Haixun Wang, and Shishir Kumar Prasad, who have been instrumental in providing the resources and guidance. I would also like to extend my greatest appreciation to the machine learning project colleagues, Dingxiong Deng, Hechao Sun, Yiming Sun, Lev Kokotov, Xinyu Li who have put in tireless effort and dedication to make this project a reality. Their hands-on contributions and teamwork have been invaluable. Additionally, I would like to acknowledge the support from our partner members and teams, including the core shopping engineering team VMK Peddinti, Damon Ding, Frank Chung, Yanhua Liu, Michael Prescott, Joe Cohen, Victoria Spiegel, Alysia Echevarria, Eugene Agronin, Claire Rossi-de Vries; the product team Matt Izant, Joey Loi, Ken Yuhas, Nicholas Turley; the machine learning infrastructure team Guanghua Shu, Jacob Jesen, Han Li, Wideet Shende, Matt Crane, Sahil Khanna, Rajpal Paryani, Siddharth Choudhuri, Reza Sadri; the shopper team Kevin Ryan, Asif Haque, Graham Adeson, Aneesh Mannava, Fred Liang, Sai Teja Chava, Atul Gupte, Amod Mital, Darren Johnson, Guangshi Chen, Xianlei Qiu, Sean Cashin, Montana Low; the catalog team Suor Kim, Charles Wesley, Oliver Chan; as well as the data engineering team, the search team, and the catalog store team who have been instrumental in providing the necessary tools, resources and efforts for us to achieve our goals. I would also like to thank all my former and current teammates in the quality and recommendation machine learning team for all the reviews, discussions, and encouragement along the way. Last but not the least, I really appreciate the engineering council reviewers who provided useful and constructive feedback and suggestions.