Predicting the real-time availability of 200 million grocery items

Published in

tech-at-instacart

8 min readDec 5, 2018

Ever wished there was a way to know if your favorite Ben and Jerry’s ice cream flavor is currently available in a grocery store near you? Instacart’s machine learning team has built tools to figure that out!

Our marketplace’s scale lets us build sophisticated prediction models. Our community of over 100,000 shoppers scans millions of items per day across 20,000 physical stores and delivers them to the customers. These stores belong to our grocery retail partners like Aldi, Costco, Kroger, Safeway, and Wegmans. Every time a shopper scans an item into their cart or marks an item as “not found”, we get information that helps us make granular predictions of an item’s in-store availability. This helps us set accurate expectations for out-of-stock items and recommend appropriate replacements for items likely to be out-of-stock.

As a quick overview of how Instacart works, customers place orders online to be fulfilled from one of our grocery retail partners. A shopper picks items in the store and delivers them in as little as an hour. We have millions of grocery products listed on our website. Each product at a particular store is defined as an “item” and we want to know the availability of each item.

The problem: understanding “not founds”

If a shopper cannot find an item in the store, we label the item as “not found”. A not-found item is bad for every stakeholder in our marketplace — customers don’t get what they want, retail partners lose out on revenue, shoppers spend more time searching for them, and Instacart fails to deliver the best customer experience.

Not-founds occur primarily due to two reasons:

Availability: Instacart doesn’t own the logistics supply chain for products listed on its platform (our retail partners do), which makes it difficult for us as a third party to know whether a store has an item at a given time. We get regular updates (typically once a day) from our retail partners on the availability of all items. But items can sell out quickly within a day. We realized that we needed more granular data throughout the day — we needed to know the real-time availability of each item.
Find-ability: Sometimes, due to our exhaustive product catalog, shoppers aren’t able to find every item available in the store. It could be because the items are moved to the front of the store for a seasonal promotion or items are paired to drive more sales. For example, chips are placed next to salsa instead of their usual aisle. Recommending easy-to-find items saves shoppers’ time and cuts down on replacements for customers.

Hence, to infer real-time availability and capture find-ability, we built an item availability model that constantly predicts the availability of 200 million grocery items every 60 minutes.

Building the Model

As we set out to build the model, we formulated it as a classification problem where every ordered item is a training example. In order to capture an item’s availability and find-ability, the model is trained to predict if the item was found by the shopper. Making this model work is a challenging problem, both from the perspective of training a model with good performance as well as the scale at which it needs to perform. Let’s look at the modeling aspects first.

Features

For each training sample of found/not-found, we use data from several months prior to that order to create features. All feature engineering is geared towards working well for tree-based models because this model uses XGBoost. We use three broad categories of features to train the model: item level features, time-based features, and categorical features.

Item level features

We build item-level features using an item’s past orders data and associated found/not-founds. Since we’re trying to predict if an item will be found, these should be the most important set of features. An item with a not-found in the last sixty minutes is very likely to be unavailable in store. Or an item with very low historical found rate is difficult to find and hence, is very likely to be a not-found.

Up to sixty minutes after a not-found event, items have very low found-rate

The most important features from this set are the item’s historical found rate, time since its last not-found, and the expected time to next not-found (based on the historical time between two not-founds of that item). We also use item availability data that retailers send us daily to create more features.

Time-based features:

We use time-based features like the time of day and the day of the week that the order was picked in store. We typically see better availability of items in mornings. (Pro tip: Always go grocery shopping in the mornings! The shelves are re-stocked then.)

Every found event leads to higher availability score whereas not-found leads to lower. Notice how model tries to forget not-found event and gradually increases score with time. Also, the model assigns low scores in early morning and late evenings when deli is closed — all picked up from past orders during these times with higher not-founds

The problem of high cardinality categorical features

We also have several categorical features (i.e. identifiers for the store, product, retailer, department, aisle, brand, region, etc) which could be used directly in the model. But, some of these have extremely high cardinality (number of unique categories). Cardinality is into millions for product identifier, and into tens of thousands for store identifier. Using these as one-hot-encoded features leads to ineffective learning and they are inefficient from scaling perspective as well (OHE features can blow up data size and model training/scoring time and we need to score 100s of millions of items as frequently as possible!). Also, training embeddings is probably not a good idea given the cardinality of millions. For the model to learn proper embeddings, each category will need to have sufficient samples in training data which will again blow up the training data size. We fixed this problem by doing something simple which also fixed another important issue explained below.

Items in the long tail

While many staples are purchased over and over again, there always are items in the long tail that have been sold maybe once in the past 6 months. Sparse order history leads to weak or nonexistent item-level features. And having a population of such items in training data is definitely concerning as the most important feature set doesn’t work for them.

In addition to high cardinality and long-tail item problems, features described above still don’t capture a lot of things like efficiency of supply chains, store-specific restocking patterns, product seasonality, products being discontinued, etc. Getting granular data on these is next to impossible, but we do have explicit data on found-rate of items which is a direct result of these and possibly other factors. We, therefore, use found rates at the granularity of item metadata (such as product, brand, region and their combinations) for this purpose.

We do something similar to mean encoding for all categorical features and their combinations. In mean encoding, the categorical value is imputed with the mean value of the dependent variable for that category. The mean of the dependent variable in our case is found-rate and instead of using found rate from within the training data, we use historical found rates. For example, in the store identifier feature, the identifier is replaced with the historical found rate of that store, converting it into a continuous feature.

Since these features aren’t dependent on item’s order history, but on its metadata’s order history, these are well-populated for tail items. These features significantly improved the model for tail items and proved to be among the most important features.

Specifically, the most important feature for the model is the found-rate of the item’s parent product aggregated over the region in which the item is being sold. This feature largely captures a product’s findability and how good it’s supply chain is in that region. A product with an inefficient supply chain will have low found rates across the region. This feature also captures a product being discontinued or going out of season. It will pick up low found-rate for a product across different stores and propagate this information to all of its items and give them low scores, even if the individual item hasn’t been bought before!

AUC Improvement due to mean encoded categorical features is drastic for tail items because item level features don’t work well for them.

Scale of scoring

An item’s availability changes in near real-time as they get sold and restocked, and as such we want to predict availability as often as is possible. Sometimes training at scale is a bottleneck, but for this problem scoring at scale is the larger bottleneck. Hence, we spent more effort optimizing the scoring pipeline to allow it to score over 200 million items every 60 minutes. In this pipeline, about 130 features are created for each item and 10s of TB of data is processed every 60 minutes. The new scoring architecture that we built from scratch scores 15x more items, using 1/5 of the resources in 1/4 of the time. Below are a few things that helped us achieve this massive scaling:

Performing complicated feature engineering in our snowflake data warehouse instead of python.
Identifying and caching features which don’t change frequently — this helped us decrease our feature engineering time.
Optimizing data transfers from the data warehouse to an AWS instance.
Better parallelization of python scoring code.
Quicker and efficient uploading of scores to Postgres table.

Evolving the model

We currently use item availability predictions in many ways across the product. One such use case is to decide items that customers are able to order. We hide items with very low availability scores and low relevance in search. We also use these predictions to route shoppers to stores with better availability of ordered items.

We are just beginning to understand the factors affecting item availability. As we look to the future, we are always identifying better data sources which might drastically improve the model. Currently, we’re playing around with the idea of assigning a findability and availability score for each item for an improved customer and shopper experience. There’s lots more work to be done!

Interested in working on such large-scale and high impact projects at Instacart? Check out our careers page at careers.instacart.com.

Feel free to reach out with any feedback/questions through comments or mail me at abhay.pawar@instacart.com or message me on Linkedin or twitter.

Special thanks to everyone who worked on this over the years to bring it to its current state: Shishir Prasad, Angadh Singh and Sharath Rao. Also, thanks to Jeremey Stanley, Rachel Holm, Tyler Tate and many others whose feedback helped make this post significantly better.