Calibrating CTR Prediction with Transfer Learning in Instacart Ads

Published in

tech-at-instacart

10 min readSep 15, 2022

Authors: Zhenbang Chen, Peng Qi

Calibration in deep recommender systems is a tough problem to crack, especially in the ads ranking problem space. In this post, we will explain how Instacart uses a system through transfer learning to improve the calibration of deep predicted click-through-rate (pCTR) models.

A little bit of the context: pCTR with deep neural networks

When a user searches or browses for items on Instacart, they are likely to see an ad format called Sponsored Product. Our model predicts the probability p̂ of a user clicking on the ad displayed to them. This probability is called the predicted click-through rate (pCTR).

As a binary classification problem, many different machine learning models can be applied, such as logistic regression, XGBoost, and deep learning. A better model helps improve user engagement and generally helps allocate positions to ads that drive long-term marketplace value. Our current model is a deep learning model based on the Wide and Deep architecture. [2]

The problem: calibration

What exactly is calibration?

Calibration measures whether a machine learning model’s probability of prediction is the same as its actual probability, either across all samples or a subgroup of samples. For example, if the model predicts that the probability of a click is 0.2 (20% certain that the ad will lead to a click) and the probability of non-click is 0.8, then the case of click should be correct 20% of the time. A probability higher or lower than 20% means the model’s predicted probability isn’t calibrated.

Models such as logistic regression guarantee calibration at convergence, but more complicated models, like decision tree models and deep neural networks, are often miscalibrated (more background).

It’s worth noting that a highly accurate model doesn’t necessarily have good calibration. For example, if we have a binary classification model that accurately outputs a score between 0 and 1 for all samples, dividing all predicted scores by 2 doesn’t affect the model’s accuracy, but will result in much worse calibration. The same is true for XGBoost model when being used for binary prediction. The calibration is sometimes poorer despite better accuracy than logistic regression.

In our pCTR problems, the model predicts whether a sample (an ad impression) with features X generates clicks (Y=1) or not (Y=0). It gives the predicted probability p̂ for click Y=1. And we want for every p̂ that, the frequency of Y=1 to be as close to the real probability p as possible. Formally,

for all p in [0,1] over X. [1]

Why does calibration matter for ads businesses?

Generalized second-price auctions (GSP) have become an industry standard for performance-based ads. In such auctions, ads are typically ranked by their effective cost-per-thousand impressions (eCPM), defined below. Note that this is a simplified formula just to illustrate the core idea; the actual ranking function usually involves many factors other than bid and pCTR and is sometimes referred to as the utility.

where p̂ represents the pCTR for an impression.

If the prediction is poorly calibrated and it does not reflect the real probability, the ads might be ranked out of order, and the price of a clicked ad might not be aligned with the best value. Also, in real-world systems, there are many operating parameters and downstream applications that rely on the model’s predictions, such as filtering thresholds and ranking weights. Miscalibrated predictions impede the function of those parameters. The operating parameters may become less accurate over time and continuously tuning parameters can be very costly.

Ground truth for calibration

How do we know whether the calibration is good or bad? It needs to be evaluated against some data. However, even getting the ground truth data is non-trivial for ad recommender systems. The data is often subject to selection biases and position biases, affecting the correctness of calibration.

Measuring calibration

Calibration score

The calibration score measures the overall calibration, defined as the ratio of averaged pCTR to average observed CTR:

It can be used to measure the calibration over the entire dataset or on a specific subset D of the dataset (such as calibration on new users, calibration across product category, etc.).

Reliability diagram

Another common way to measure calibration is the reliability diagram. It first splits predictions into k bins with equal intervals ([0.0,0.1), [0.1,0.2), etc.). For each bin, we calculate the average pCTR as the x-coordinate and the observed CTR as the y-coordinate coordinate of the point.

For example, the charts below show two reliability diagrams of a poorly-calibrated model and a well-calibrated model. In the ideal case, the average prediction equals the average CTR in each bin. To put it in another way, prediction(b)=observed(b) for each bin b. As a result, the x-coordinate equals the y-coordinate for every point. All such points fall on the diagonal line.

Expected Calibration Error (ECE)

A reliability diagram can be a useful visualization of the calibration error, but we can also quantitatively measure the calibration error. One measurement is called ECE. It’s defined as a weighted average of the absolute difference between prediction(b) and observed(b) across bins:

where n_b denotes the number of samples in that bin b, and N denotes the total number of samples in the dataset. The lower ECE, the better the model is calibrated. In the ideal case, prediction(b)=observed(b) for all bins, so ECE would be 0. This is equivalent to all points falling on the diagonal line in the reliability diagram.

Existing methods for calibrating models

Let’s quickly go through several existing methods that are commonly used:

Platt scaling

Platt scaling was first introduced for Support Vector Machines (SVM) but also applies to other classification models. It takes two parameters α, β and uses the original output of the model as a feature. That is,

where zᵢ is the original output of the model and σ stands for the sigmoid function. The two parameters can be fit on the dataset. [7]

Platt scaling is effective for SVMs and boosted trees. It’s less effective on models that are already well-calibrated based on probabilistic modeling such as logistic regression and DL classification models with the sigmoid function as the last layer.

Platt scaling with features

A common practice is to extend the Platt scaling method by adding one-hot encoding for additional categorical features such as browse, country, day of the week, etc., inside the sigmoid function. In this case, this method is essentially a logistic regression model on top of the existing model. It can be shown that the predictions for subgroups specified by the categorical features will be calibrated upon the convergence of the corresponding model parameters.

This logistic-regression-like method will be referred to as “Platt scaling with features” in the sections below.

Non-parametric methods

Some non-parametric methods, such as Isotonic Regression and Bayesian Binning into Quantiles, are based on the intuition of generating optimal bin splits and adjusting the probability based on the bins. However, they are not widely used in recommender systems. Although their calibration can be better than Platt scaling, they may have worse model performance at times. The relevant references with more details are shared below.

Our approach

Starting with hold back group traffic (as a standard industry practice) …

Similar to common practice in search advertising platforms, we hold back a tiny fraction of user request traffic from our production ranker. As we’ll see later, this fraction of “hold back” traffic becomes better ground truth for calibration.

Transfer learning!

Transfer learning is a ML technique: We have a model trained on domain A usually with a large number of training samples. To make predictions for a problem in domain B, we fine-tune the same model in the new domain. Transfer learning preserves a lot of lower-level information in the original model, which can benefit the prediction on domain B.

When the two problems (A and B) don’t share the same objective, one common way is to drop the last couple of layers in the original model (for problem A) and replace them with a new set of last layers (for problem B).

How does transfer learning apply to our case?

Although the objective is the same (pCTR accuracy), the distribution of data is different between domain A (ranked ads) and domain B (hold-back) in the following aspects:

Selection and popularity biases. Ranked ad impressions result from the output of the production model. As a consequence, domain A is subject to more biases than domain B. The model’s previous (possibly incorrect) predictions determine the display positions and labels of the subsequent training dataset in domain A. Therefore, domain B is a better proxy for the ground truth.
Dataset size. Only a small percentage of the traffic is held back, meaning that domain A is much larger than domain B. There are much fewer data points to train the model and embeddings on in domain B, resulting in worse model performance.

Our objective is to make a fair and accurate prediction of CTR for each ad (domain B). Because training the model on ranked ads (domain A) suffers from different feature/label distributions, while the hold-back dataset (domain B) is too small, our solution is to use transfer learning.

Our approach consists of two stages (as illustrated in the figure above):

Normal training, i.e. train on ranked ad impressions as we did before.
Transfer learning.
1. Freeze the parameters of the lower layers. For example, all the embeddings won’t be updated during backpropagation.
2. Train only the last few layers, especially the last layer, in our case a weighted sum of neurons with the sigmoid activation. Due to the sigmoid and binary cross entropy loss function, this fine-tuning step will recalibrate our predictions.

Compared with the Platt scaling methods, our approach doesn’t require an additional model for calibration. Instead, the same model is reused, which reduces operational complexity significantly. The transfer learning approach can also fine-tune the parameters in the layers other than the last layer.

Results

Comparison with other calibration methods

We first compare the model’s performance and calibration with and without transfer learning.

The baseline model is our Wide & Deep model trained on ranked ads (excluding hold back group traffic) with no calibration. Four calibration methods are applied on top of the baseline model:

Cross-entropy and AUC are relative to the no-calibration baseline

Transfer learning outperforms all three other calibration methods. It shows the best calibration score (closest to 1) and the best model metrics.

In contrast, Isotonic Regression has the second-best calibration, but it does so at the cost of model AUC (even worse than no calibration). Transfer learning also outperforms Platt Scaling with or without extra features. Our intuition is that while Platt Scaling essentially fine-tunes a single-layer model, the proposed approach fine-tunes all the feed-forward layers (except the embeddings) and is able to learn more unbiased information besides calibration of the final score.

Comparison with different date ranges

We also compared the effect of different date ranges on calibration. The baseline model is trained with fixed X days of ranked data. We then use transfer learning to continue to train the model with 1, 2, 5, 10, 20, 30, 60 days of hold back group data. The results are as follows:

ECE and calibration wrt. different days of data used

In the left plot, the Expected Calibration Error (ECE) goes down when the transfer learning date range increases. Notably, even transfer learning on a short date range gives a lower ECE than baseline. The same is true for the calibration score (right plot). When the date range increases, the calibration is closer to 1.0.

In conclusion, transfer learning greatly improves the calibration of pCTR, leading to more stable production models. What’s even better is that no additional calibration model is needed.

Acknowledgments

This project is a multi-quarter collaboration of several teams at Instacart. It wouldn’t have been possible without the support and help from Vik Gupta and Sharath Rao from Instacart Ads Engineering. We also want to thank Chuanwei Ruan, David Pal, Jagannath Putrevu, Shishir Kumar Prasad, Li Tan, and Haixun Wang for their review and suggestions on Machine Learning and design details, and Deniz Gültekin and Erin Fan for their great efforts in polishing and making this post more enjoyable and clear to readers.

References

Guo, Chuan, et al. “On calibration of modern neural networks.” International Conference on Machine Learning. PMLR, 2017.
Cheng, Heng-Tze, et al. “Wide & deep learning for recommender systems.” Proceedings of the 1st workshop on deep learning for recommender systems. 2016.
Nixon, Jeremy, et al. “Measuring Calibration in Deep Learning.” CVPR Workshops. Vol. 2. №7. 2019.
Craswell, Nick, et al. “An experimental comparison of click position-bias models.” Proceedings of the 2008 international conference on web search and data mining. 2008.
Kirkpatrick, James, et al. “Overcoming catastrophic forgetting in neural networks.” Proceedings of the national academy of sciences 114.13 (2017): 3521–3526.
Naeini, Mahdi Pakdaman, Gregory Cooper, and Milos Hauskrecht. “Obtaining well-calibrated probabilities using bayesian binning.” Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
Platt, John. “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.” Advances in large margin classifiers 10.3 (1999): 61–74.
Ling, Xiaoliang, et al. “Model ensemble for click prediction in bing search ads.” Proceedings of the 26th international conference on world wide web companion. 2017.
Huang, Jianqiang, et al. “Deep Position-wise Interaction Network for CTR Prediction.” Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021.