Ad Click Prediction: a View from the Trenches

Ad Click Prediction: a View from the Trenches

August 11-14, 2013 | H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica
Ad Click Prediction: A View from the Trenches H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica Google, Inc. mcmahan@google.com, gholt@google.com, dsculley@google.com Abstract: Predicting ad click-through rates (CTR) is a massive-scale learning problem central to the multi-billion dollar online advertising industry. We present case studies and topics from recent experiments in a deployed CTR prediction system. These include improvements in traditional supervised learning using an FTRL-Proximal online learning algorithm with excellent sparsity and convergence properties, and the use of per-coordinate learning rates. We also explore challenges in real-world systems, such as memory savings, performance assessment, confidence estimates for predicted probabilities, calibration, and automated feature management. We also detail directions that did not yield significant benefits. The goal is to highlight the relationship between theoretical advances and practical engineering in this industrial setting. Keywords: online advertising, data mining, large-scale learning Introduction: Online advertising is a multi-billion dollar industry that has relied heavily on learned models to predict ad click-through rates accurately, quickly, and reliably. This problem has pushed the field to address issues of scale. A typical industrial model may provide predictions on billions of events per day, using a correspondingly large feature space, and then learn from the resulting mass of data. System Overview: When a user does a search, an initial set of candidate ads is matched to the query based on advertiser-chosen keywords. An auction mechanism determines whether these ads are shown to the user, what order they are shown in, and what prices the advertisers pay if their ad is clicked. The features used in our system are drawn from a variety of sources, including the query, the text of the ad creative, and various ad-related metadata. Data tends to be extremely sparse, with typically only a tiny fraction of nonzero feature values per example. Methods such as regularized logistic regression are a natural fit for this problem setting. It is necessary to make predictions many billions of times per day and to quickly update the model as new clicks and non-clicks are observed. Of course, this data rate means that training data sets are enormous. Data is provided by a streaming service based on the Photon system. Because large-scale learning has been so well studied in recent years, we do not devote significant space in this paper to describing our system architecture in detail. We will note, however, that the training methods bear resemblance to the Downpour SGD method described by the Google Brain team, with the difference that we train a single-layer model rather than a deep network of many layers. This allows us to handleAd Click Prediction: A View from the Trenches H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica Google, Inc. mcmahan@google.com, gholt@google.com, dsculley@google.com Abstract: Predicting ad click-through rates (CTR) is a massive-scale learning problem central to the multi-billion dollar online advertising industry. We present case studies and topics from recent experiments in a deployed CTR prediction system. These include improvements in traditional supervised learning using an FTRL-Proximal online learning algorithm with excellent sparsity and convergence properties, and the use of per-coordinate learning rates. We also explore challenges in real-world systems, such as memory savings, performance assessment, confidence estimates for predicted probabilities, calibration, and automated feature management. We also detail directions that did not yield significant benefits. The goal is to highlight the relationship between theoretical advances and practical engineering in this industrial setting. Keywords: online advertising, data mining, large-scale learning Introduction: Online advertising is a multi-billion dollar industry that has relied heavily on learned models to predict ad click-through rates accurately, quickly, and reliably. This problem has pushed the field to address issues of scale. A typical industrial model may provide predictions on billions of events per day, using a correspondingly large feature space, and then learn from the resulting mass of data. System Overview: When a user does a search, an initial set of candidate ads is matched to the query based on advertiser-chosen keywords. An auction mechanism determines whether these ads are shown to the user, what order they are shown in, and what prices the advertisers pay if their ad is clicked. The features used in our system are drawn from a variety of sources, including the query, the text of the ad creative, and various ad-related metadata. Data tends to be extremely sparse, with typically only a tiny fraction of nonzero feature values per example. Methods such as regularized logistic regression are a natural fit for this problem setting. It is necessary to make predictions many billions of times per day and to quickly update the model as new clicks and non-clicks are observed. Of course, this data rate means that training data sets are enormous. Data is provided by a streaming service based on the Photon system. Because large-scale learning has been so well studied in recent years, we do not devote significant space in this paper to describing our system architecture in detail. We will note, however, that the training methods bear resemblance to the Downpour SGD method described by the Google Brain team, with the difference that we train a single-layer model rather than a deep network of many layers. This allows us to handle
Reach us at info@study.space
[slides] Ad click prediction%3A a view from the trenches | StudySpace