Skip to content

Commit

Permalink
Add new papers
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Aug 2, 2021
1 parent 3333218 commit 0a83d52
Show file tree
Hide file tree
Showing 5 changed files with 399 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Deep Neural Networks for YouTube Recommendations](https://shagunsodhani.com/papers-I-read/Deep-Neural-Networks-for-YouTube-Recommendations)
* [The Tail at Scale](https://shagunsodhani.com/papers-I-read/The-Tail-at-Scale)
* [Practical Lessons from Predicting Clicks on Ads at Facebook](https://shagunsodhani.com/papers-I-read/Practical-Lessons-from-Predicting-Clicks-on-Ads-at-Facebook)
* [Ad Click Prediction - a View from the Trenches](https://shagunsodhani.com/papers-I-read/Ad-Click-Prediction-a-View-from-the-Trenches)
* [Anatomy of Catastrophic Forgetting - Hidden Representations and Task Semantics](https://shagunsodhani.com/papers-I-read/Anatomy-of-Catastrophic-Forgetting-Hidden-Representations-and-Task-Semantics)
* [When Do Curricula Work?](https://shagunsodhani.com/papers-I-read/When-Do-Curricula-Work)
* [Continual learning with hypernetworks](https://shagunsodhani.com/papers-I-read/Continual-learning-with-hypernetworks)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
layout: post
title: Ad Click Prediction - a View from the Trenches
comments: True
excerpt:
tags: ['2013', 'Click-Through Rate', 'Data Mining', 'Empirical Advice', 'KDD 2013', 'Machine Learning', Ads, CTR, Engineering, KDD, ML, Scale, Systems]

---

## Introduction

* The paper presents case studies from the experience of deploying an ad click-through rate (CTR) prediction model at Google.

* The paper focuses on themes related to memory footprint, performance analysis, calibration, confidence in the predictions, and feature engineering.

* [Link to the paper](https://research.google/pubs/pub41159/)

## System Overview

* Features (corresponding to a given ad) include search query and the metadata in the ad. The features are very sparse.

* Single layer, regularized Logistic Regression model is trained with Online Gradient Descent (same as Stochastic Gradient Descent, but in the online setting).

* From a memory perspective, it is important to minimize the size of the final model.

* Adding just the L1 penalty is not sufficient to produce weights that are precisely equal to 0.

* ["Follow The (Proximally) Regularized Leader" algorithm or FTRL-Proximal algorithm](http://proceedings.mlr.press/v15/mcmahan11b.html) is used to learn sparse models without losing on the accuracy.

* Using per-coordinate learning rates improves the performance at the cost of memory as both the sum of gradients and the sum of the square of gradients are tracked for each feature.

* In practice, some of the cost can be alleviated by approximating that all the events containing a given feature have the same probability.

* In such a case, the sum of the square of gradients can be approximated using the counts of positive and negative events alone.

* Some memory overhead can be reduced based on the following observation: the vast majority of features are extremely rare. Hence, it is not necessary to track the statistics for such rare features.

* However, in an online setting, it is not known upfront as to which features will be sparse.

* The paper proposes to use probabilistic feature inclusion - a feature is added to the model with probability $p$. Once it is added, the feature is not removed.

* An alternative approach is to use a rolling set of counting Bloom filters to check if a feature has appeared at least $n$ times in training. Bloom filters are probabilistic data structures and can return false positives.

* Memory can also be saved by using fewer bits for encoding weights.

* Most of the weight coefficients lie in the range $(-2, 2)$, and a $16-$ bit encoding is used in place of $32$ or $64$ bit encoding.

* This quantization approach needs to account for roundoff problems. The fix is easy to implement.

* When training many models with similar hyperparameters, per-model learning rate counters can be replaced by statistics shared by all the models, thus reducing memory footprint.

* A Single Value Structure is used to reduce the memory footprint when evaluating a very large set of model variants that differ only in addition/removal of a small subset of features.

* All the models, that use a feature, share a single value structure corresponding to the feature. This reduces the memory overhead by order of magnitude.

* During the update, each model computes the weight updates corresponding to all the features that it is using. The updated weight is averaged across all the models and used to update the single value structure.

* Since CTR datasets are generally highly imbalanced, the training data (for the negative class) can be subsampled to reduce the amount of data to train over. The loss component (corresponding to negative class) can be appropriately scaled up.

* Metrics

* Offline metrics like AucLoss (1 - AUC), Log Loss, Squared Error

* Online loss is computed on the new training data (new incoming traffic) before training on it.

* The confidence in the model's prediction is estimated using a heuristic called *uncertainty score*. It can be measured using the dot product of the feature and the vector of learning rates.

* The idea is that the learning rates already maintain a notion of uncertainty.

* Features for which the learning rate is high are the features for which uncertainty is also high.

* Calibrating Predictions

* The calibration can be improved by applying correction functions $\tau_d(p)$ where $p$ is the predicted CTR, and $d$ is an element of a partition of the training data.

* $\tau$ can be modeled as $\gamma^{\kappa}$ where $\gamma$ and $\kappa$ are learned using Poisson regression.

* Unsuccessful Experiments

* Aggressive feature hashing was tried to reduce the memory overhead. However, it leads to a significant loss in performance.

* Using dropout did not help, probably because the features are sparse.

* Using feature bagging hurt the AucLoss.

* Feature vector normalization did not improve performance, probably because of per-coordinate learning rates and regularization.
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
layout: post
title: Practical Lessons from Predicting Clicks on Ads at Facebook
comments: True
excerpt:
tags: ['2014', 'Click-Through Rate', 'Data Mining', 'Empirical Advice', 'KDD 2014', 'Machine Learning', Ads, CTR, Engineering, KDD, ML, Scale, Systems]

---

## Introduction

* The paper describes several design choices for developing a model for predicting user response (clicks) on ads.

* [Link to the paper](https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/)

## Experimental Setup

* The model is trained/evaluated on offline data.

* Evaluation metrics:

* Normalized Cross-Entropy (or Normalized Entropy, NE)

* Defined as the predictive log-loss per impression, divided by the entropy of the background CTR (click-through rate).

* Background CTR is the average empirical CTR of the training data.

* Lower normalized cross-entropy is better.

* The normalization term is important to make the metric insensitive to the background CTR. Otherwise, the log loss can easily be made low when background CTR is close to 0 or 1.

* NE can also be written as $RIG - 1$, where $RIG$ is the Relative Information Gain.

* Calibration

* Ratio of average estimated CTR and empirical CTR.

* Area-Under-ROC (AUC) is a good metric for measuring ranking quality (among ads). However, it is **not used** as a metric to avoid over-delivery or under-delivery of ads.

## Implementation Details

* Feature Transformation

* A given add impression, $e$, is transformed into a $n-$dimensional vector, $x$, where the $i^{th}$ index denotes the value of the $i^{th}$ categorical feature.

* Continous features are binned, and the bin index is used as a categorical feature, thus applying a non-linear transformation to the features.

* Categorical features that are tuple-like (i.e., have a tuple of values) can be converted into new categorical features by taking a cartesian product.

* Boosted decision trees can be used to implement the previous two transformations in one go.

* Each tree is used as a categorical feature that takes the value of the index of the leaf node than an ad maps to.

* The paper used the Gradient Boosting Machine with the $L_2-$TreeBoost algorithm.

* Using the tree feature transformation improves the Normalized Cross-Entropy by $3.4\%$.

* Model

* Logistic Regression (LR) or Bayesian online learning scheme for probit regression (BOPR) algorithms are used for training a linear classifier model.

* While both LR and BOPR models provide similar performance, the LR model is half the BOPR model's size and faster for performing training/inference.

## Role of Data Freshness

* When a model is trained on the data from a particular day and evaluated on data from the subsequent days, the model's performance degrades as the delay between training and test set increases.

* This highlights the importance of the freshness of the training data.

* One straightforward approach can be to train the model every day.

* Alternatively, the linear classifier can be trained using online learning, while the boosted decision tree can still be trained daily.

* Different choices for setting the learning rate (for online training of linear classifier) are compared, and the [per-coordinate learning rate](https://research.google/pubs/pub41159/) is found to perform best in practice.

## Generating Real-Time Training Data

* An "online joiner" system is used to generate real-time training data for the linear classifier.

* The challenging part is, while there are data points with a "positive" label (i.e., the user clicked on the ad), there are no datapoints with a "negative" label (since there is no "no-click" button that the user can click).

* An impression is considered to have the "no-click" label if the user does not click on the ad within a (long) time window of seeing the ad.

* Too short a time window could mislabel some impressions, while too long a time window will delay the real-time training data.

* The online joiner performs a distributed stream-to-stream join on the stream of ad impressions and stream of ad clicks using a HashQueue.

* A HashQueue:

* comprises of a First-In-First-Out queue as a buffer window and a hash map for fast random access to label impressions.

* supports three operations on key-value pairs: enqueue, dequeue, and lookup.

## Memory and Latency

* Increasing the number of boosting trees shows diminishing returns, and most of the improvements come from the first 500 trees.

* Top 10 features account for half of the total feature importance, while the last 300 features add less than 1% feature importance.

* Features in the boosting model can be broadly classified as contextual or historical.

* Historical feature provides much more explanatory power than the contextual features through contextual features are helpful to handle the cold start problem.

* Models trained with just the contextual features rely more heavily on data freshness than models trained with just the historical features.

* Uniform subsampling and negative downsampling techniques are used to limit the amount of training data.

* In the case of negative downsampling, the model needs to be re-calibrated as well.
109 changes: 109 additions & 0 deletions site/_posts/2021-03-15-The Tail at Scale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
layout: post
title: The Tail at Scale
comments: True
excerpt:
tags: ['2013', 'Distributed Systems', ACM, Engineering, Latency, Scale, Systems]

---

## Introduction

* The paper presents some causes for (temporary) high-latency episodes in large-scale online systems and techniques to mitigate their impact so that the tail of latency distribution remains short.

* [Link to the paper](https://research.google/pubs/pub40801/)

## Why does variability in response time exist

* Shared resources between processes on the same node

* Background processes (daemons) could use cause a momentary spike in resource usage.

* Processes running on different nodes may contend for global resources like shared file systems.

* Maintenance activities like disk compaction or garbage collection.

* Others like queueing, power limits, or energy management.

* In the case of large-scale systems, the component-level variability is further amplified.

## Reducing Component Variability

* Use differentiated service classes to prioritize user requests over non-interactive requests.

* Reduce head-of-line blocking by breaking long-running requests into smaller requests.

* Synchronize maintenance jobs across nodes to minimize the window for high latency.

* Caching generally does not help to address tail latency.

## Adapting to Latency Variability

* Two categories of adaptation approaches

* Within Request Short-Term Adaptations

* These approaches are more relevant for services that perform many read queries on loosely consistent datasets.

* Hedged Request

* Send the request to multiple replicas, and once one of the replicas returns the result, cancel the other requests.

* In practice, start by sending the request to only one replica. Send the secondary requests if the first request is outstanding for more than $95^{th}$ percentile of expected latency.

* This introduces an additional $5\%$ load while substantially shortening the latency tail.

* This approach work because often, the cause of latency is not the query itself but other factors like overloaded nodes.

* Tied Request

* Hedged request approach makes a tradeoff regarding how long to wait before initiating requests to other replicas. The sooner the request is made, the lower should be the latency in serving the request, but more will be the overall load in the system.

* The load in the system can be reduced by "tieing" requests (sent to different replicas) so that as soon as one replica starts processing the request, it can notify the other replicas, which could drop the request or deprioritize it.

* In practice, "tieing" requests means that each replica has the identity of other replicas which may execute the request.

* Note that there is a short window (of the average network message delay) when multiple replicas could start executing the request. This can be mitigated if the client (issuing the requests) introduces a delay to twice the average network message delay.

* Submit the request to the least loaded replica

* This is less effective for reasons like the load on a replica can change after the request is made but before it is executed.


* Cross-Request Long-Term Adaptations

* These approaches are more relevant for situations where different services have different throughput.

* Micro-partitions

* Generate more paritions than the number of nodes.

* The partitions can be dynamically assigned to machines to ensure proper load balancing.

* In case of machine failure, many nodes can be used to quickly re-create the micro-partitions instead of waiting on one machine to read one single large partition.

* Selective Replication

* With micro-partitioning, replicas for micro-partitions can be created ahead of time to achieve good load balancing.

* Latency induced probation

* In some cases, removing a slow node can improve the overall latency of the system. The probated node can be re-incorporated when its latency improves.

* Large Information Retrieval Systems

* In such systems, speed can be more critical than the quality of the result.

* The system should return a "good enough" result that is available with low latency instead of waiting for the "best result" that is available with high latency.

* In some cases, a request could trigger an unexpected code path or cause some other exception that could slow down the entire system.

* In such cases, the *canary request* technique can be used where the system sends the request initially to only 1 or 2 nodes. The request is sent over to the other nodes only after receiving a successful response from the initial nodes.

* Requests that update state are easier to handle for several reasons:

* The scale of latency-critical modifications is generally small.

* The update can be performed asynchronously after responding to the user.

* Quorum-based approaches (often used for ensuring consistent updates) are inherently tail-tolerant.
Loading

0 comments on commit 0a83d52

Please sign in to comment.