diff --git a/README.md b/README.md index ec9dbba6..dea746ba 100755 --- a/README.md +++ b/README.md @@ -4,6 +4,10 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [Deep Neural Networks for YouTube Recommendations](https://shagunsodhani.com/papers-I-read/Deep-Neural-Networks-for-YouTube-Recommendations) +* [The Tail at Scale](https://shagunsodhani.com/papers-I-read/The-Tail-at-Scale) +* [Practical Lessons from Predicting Clicks on Ads at Facebook](https://shagunsodhani.com/papers-I-read/Practical-Lessons-from-Predicting-Clicks-on-Ads-at-Facebook) +* [Ad Click Prediction - a View from the Trenches](https://shagunsodhani.com/papers-I-read/Ad-Click-Prediction-a-View-from-the-Trenches) * [Anatomy of Catastrophic Forgetting - Hidden Representations and Task Semantics](https://shagunsodhani.com/papers-I-read/Anatomy-of-Catastrophic-Forgetting-Hidden-Representations-and-Task-Semantics) * [When Do Curricula Work?](https://shagunsodhani.com/papers-I-read/When-Do-Curricula-Work) * [Continual learning with hypernetworks](https://shagunsodhani.com/papers-I-read/Continual-learning-with-hypernetworks) diff --git a/site/_posts/2021-03-01-Ad Click Prediction - a View from the Trenches.md b/site/_posts/2021-03-01-Ad Click Prediction - a View from the Trenches.md new file mode 100755 index 00000000..278279bd --- /dev/null +++ b/site/_posts/2021-03-01-Ad Click Prediction - a View from the Trenches.md @@ -0,0 +1,86 @@ +--- +layout: post +title: Ad Click Prediction - a View from the Trenches +comments: True +excerpt: +tags: ['2013', 'Click-Through Rate', 'Data Mining', 'Empirical Advice', 'KDD 2013', 'Machine Learning', Ads, CTR, Engineering, KDD, ML, Scale, Systems] + +--- + +## Introduction + +* The paper presents case studies from the experience of deploying an ad click-through rate (CTR) prediction model at Google. + +* The paper focuses on themes related to memory footprint, performance analysis, calibration, confidence in the predictions, and feature engineering. + +* [Link to the paper](https://research.google/pubs/pub41159/) + +## System Overview + +* Features (corresponding to a given ad) include search query and the metadata in the ad. The features are very sparse. + +* Single layer, regularized Logistic Regression model is trained with Online Gradient Descent (same as Stochastic Gradient Descent, but in the online setting). + +* From a memory perspective, it is important to minimize the size of the final model. + +* Adding just the L1 penalty is not sufficient to produce weights that are precisely equal to 0. + +* ["Follow The (Proximally) Regularized Leader" algorithm or FTRL-Proximal algorithm](http://proceedings.mlr.press/v15/mcmahan11b.html) is used to learn sparse models without losing on the accuracy. + +* Using per-coordinate learning rates improves the performance at the cost of memory as both the sum of gradients and the sum of the square of gradients are tracked for each feature. + + * In practice, some of the cost can be alleviated by approximating that all the events containing a given feature have the same probability. + + * In such a case, the sum of the square of gradients can be approximated using the counts of positive and negative events alone. + +* Some memory overhead can be reduced based on the following observation: the vast majority of features are extremely rare. Hence, it is not necessary to track the statistics for such rare features. + + * However, in an online setting, it is not known upfront as to which features will be sparse. + + * The paper proposes to use probabilistic feature inclusion - a feature is added to the model with probability $p$. Once it is added, the feature is not removed. + + * An alternative approach is to use a rolling set of counting Bloom filters to check if a feature has appeared at least $n$ times in training. Bloom filters are probabilistic data structures and can return false positives. + +* Memory can also be saved by using fewer bits for encoding weights. + + * Most of the weight coefficients lie in the range $(-2, 2)$, and a $16-$ bit encoding is used in place of $32$ or $64$ bit encoding. + + * This quantization approach needs to account for roundoff problems. The fix is easy to implement. + +* When training many models with similar hyperparameters, per-model learning rate counters can be replaced by statistics shared by all the models, thus reducing memory footprint. + +* A Single Value Structure is used to reduce the memory footprint when evaluating a very large set of model variants that differ only in addition/removal of a small subset of features. + + * All the models, that use a feature, share a single value structure corresponding to the feature. This reduces the memory overhead by order of magnitude. + + * During the update, each model computes the weight updates corresponding to all the features that it is using. The updated weight is averaged across all the models and used to update the single value structure. + +* Since CTR datasets are generally highly imbalanced, the training data (for the negative class) can be subsampled to reduce the amount of data to train over. The loss component (corresponding to negative class) can be appropriately scaled up. + +* Metrics + + * Offline metrics like AucLoss (1 - AUC), Log Loss, Squared Error + + * Online loss is computed on the new training data (new incoming traffic) before training on it. + +* The confidence in the model's prediction is estimated using a heuristic called *uncertainty score*. It can be measured using the dot product of the feature and the vector of learning rates. + + * The idea is that the learning rates already maintain a notion of uncertainty. + + * Features for which the learning rate is high are the features for which uncertainty is also high. + +* Calibrating Predictions + + * The calibration can be improved by applying correction functions $\tau_d(p)$ where $p$ is the predicted CTR, and $d$ is an element of a partition of the training data. + + * $\tau$ can be modeled as $\gamma^{\kappa}$ where $\gamma$ and $\kappa$ are learned using Poisson regression. + +* Unsuccessful Experiments + + * Aggressive feature hashing was tried to reduce the memory overhead. However, it leads to a significant loss in performance. + + * Using dropout did not help, probably because the features are sparse. + + * Using feature bagging hurt the AucLoss. + + * Feature vector normalization did not improve performance, probably because of per-coordinate learning rates and regularization. diff --git a/site/_posts/2021-03-08-Practical Lessons from Predicting Clicks on Ads at Facebook.md b/site/_posts/2021-03-08-Practical Lessons from Predicting Clicks on Ads at Facebook.md new file mode 100755 index 00000000..908c7491 --- /dev/null +++ b/site/_posts/2021-03-08-Practical Lessons from Predicting Clicks on Ads at Facebook.md @@ -0,0 +1,108 @@ +--- +layout: post +title: Practical Lessons from Predicting Clicks on Ads at Facebook +comments: True +excerpt: +tags: ['2014', 'Click-Through Rate', 'Data Mining', 'Empirical Advice', 'KDD 2014', 'Machine Learning', Ads, CTR, Engineering, KDD, ML, Scale, Systems] + +--- + +## Introduction + +* The paper describes several design choices for developing a model for predicting user response (clicks) on ads. + +* [Link to the paper](https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/) + +## Experimental Setup + +* The model is trained/evaluated on offline data. + +* Evaluation metrics: + + * Normalized Cross-Entropy (or Normalized Entropy, NE) + + * Defined as the predictive log-loss per impression, divided by the entropy of the background CTR (click-through rate). + + * Background CTR is the average empirical CTR of the training data. + + * Lower normalized cross-entropy is better. + + * The normalization term is important to make the metric insensitive to the background CTR. Otherwise, the log loss can easily be made low when background CTR is close to 0 or 1. + + * NE can also be written as $RIG - 1$, where $RIG$ is the Relative Information Gain. + + * Calibration + + * Ratio of average estimated CTR and empirical CTR. + + * Area-Under-ROC (AUC) is a good metric for measuring ranking quality (among ads). However, it is **not used** as a metric to avoid over-delivery or under-delivery of ads. + +## Implementation Details + +* Feature Transformation + + * A given add impression, $e$, is transformed into a $n-$dimensional vector, $x$, where the $i^{th}$ index denotes the value of the $i^{th}$ categorical feature. + + * Continous features are binned, and the bin index is used as a categorical feature, thus applying a non-linear transformation to the features. + + * Categorical features that are tuple-like (i.e., have a tuple of values) can be converted into new categorical features by taking a cartesian product. + + * Boosted decision trees can be used to implement the previous two transformations in one go. + + * Each tree is used as a categorical feature that takes the value of the index of the leaf node than an ad maps to. + + * The paper used the Gradient Boosting Machine with the $L_2-$TreeBoost algorithm. + + * Using the tree feature transformation improves the Normalized Cross-Entropy by $3.4\%$. + +* Model + + * Logistic Regression (LR) or Bayesian online learning scheme for probit regression (BOPR) algorithms are used for training a linear classifier model. + + * While both LR and BOPR models provide similar performance, the LR model is half the BOPR model's size and faster for performing training/inference. + +## Role of Data Freshness + +* When a model is trained on the data from a particular day and evaluated on data from the subsequent days, the model's performance degrades as the delay between training and test set increases. + +* This highlights the importance of the freshness of the training data. + +* One straightforward approach can be to train the model every day. + +* Alternatively, the linear classifier can be trained using online learning, while the boosted decision tree can still be trained daily. + +* Different choices for setting the learning rate (for online training of linear classifier) are compared, and the [per-coordinate learning rate](https://research.google/pubs/pub41159/) is found to perform best in practice. + +## Generating Real-Time Training Data + +* An "online joiner" system is used to generate real-time training data for the linear classifier. + +* The challenging part is, while there are data points with a "positive" label (i.e., the user clicked on the ad), there are no datapoints with a "negative" label (since there is no "no-click" button that the user can click). + +* An impression is considered to have the "no-click" label if the user does not click on the ad within a (long) time window of seeing the ad. + +* Too short a time window could mislabel some impressions, while too long a time window will delay the real-time training data. + +* The online joiner performs a distributed stream-to-stream join on the stream of ad impressions and stream of ad clicks using a HashQueue. + +* A HashQueue: + + * comprises of a First-In-First-Out queue as a buffer window and a hash map for fast random access to label impressions. + + * supports three operations on key-value pairs: enqueue, dequeue, and lookup. + +## Memory and Latency + +* Increasing the number of boosting trees shows diminishing returns, and most of the improvements come from the first 500 trees. + +* Top 10 features account for half of the total feature importance, while the last 300 features add less than 1% feature importance. + +* Features in the boosting model can be broadly classified as contextual or historical. + +* Historical feature provides much more explanatory power than the contextual features through contextual features are helpful to handle the cold start problem. + +* Models trained with just the contextual features rely more heavily on data freshness than models trained with just the historical features. + +* Uniform subsampling and negative downsampling techniques are used to limit the amount of training data. + +* In the case of negative downsampling, the model needs to be re-calibrated as well. \ No newline at end of file diff --git a/site/_posts/2021-03-15-The Tail at Scale.md b/site/_posts/2021-03-15-The Tail at Scale.md new file mode 100755 index 00000000..3ba83f09 --- /dev/null +++ b/site/_posts/2021-03-15-The Tail at Scale.md @@ -0,0 +1,109 @@ +--- +layout: post +title: The Tail at Scale +comments: True +excerpt: +tags: ['2013', 'Distributed Systems', ACM, Engineering, Latency, Scale, Systems] + +--- + +## Introduction + +* The paper presents some causes for (temporary) high-latency episodes in large-scale online systems and techniques to mitigate their impact so that the tail of latency distribution remains short. + +* [Link to the paper](https://research.google/pubs/pub40801/) + +## Why does variability in response time exist + +* Shared resources between processes on the same node + +* Background processes (daemons) could use cause a momentary spike in resource usage. + +* Processes running on different nodes may contend for global resources like shared file systems. + +* Maintenance activities like disk compaction or garbage collection. + +* Others like queueing, power limits, or energy management. + +* In the case of large-scale systems, the component-level variability is further amplified. + +## Reducing Component Variability + +* Use differentiated service classes to prioritize user requests over non-interactive requests. + +* Reduce head-of-line blocking by breaking long-running requests into smaller requests. + +* Synchronize maintenance jobs across nodes to minimize the window for high latency. + +* Caching generally does not help to address tail latency. + +## Adapting to Latency Variability + +* Two categories of adaptation approaches + + * Within Request Short-Term Adaptations + + * These approaches are more relevant for services that perform many read queries on loosely consistent datasets. + + * Hedged Request + + * Send the request to multiple replicas, and once one of the replicas returns the result, cancel the other requests. + + * In practice, start by sending the request to only one replica. Send the secondary requests if the first request is outstanding for more than $95^{th}$ percentile of expected latency. + + * This introduces an additional $5\%$ load while substantially shortening the latency tail. + + * This approach work because often, the cause of latency is not the query itself but other factors like overloaded nodes. + + * Tied Request + + * Hedged request approach makes a tradeoff regarding how long to wait before initiating requests to other replicas. The sooner the request is made, the lower should be the latency in serving the request, but more will be the overall load in the system. + + * The load in the system can be reduced by "tieing" requests (sent to different replicas) so that as soon as one replica starts processing the request, it can notify the other replicas, which could drop the request or deprioritize it. + + * In practice, "tieing" requests means that each replica has the identity of other replicas which may execute the request. + + * Note that there is a short window (of the average network message delay) when multiple replicas could start executing the request. This can be mitigated if the client (issuing the requests) introduces a delay to twice the average network message delay. + + * Submit the request to the least loaded replica + + * This is less effective for reasons like the load on a replica can change after the request is made but before it is executed. + + + * Cross-Request Long-Term Adaptations + + * These approaches are more relevant for situations where different services have different throughput. + + * Micro-partitions + + * Generate more paritions than the number of nodes. + + * The partitions can be dynamically assigned to machines to ensure proper load balancing. + + * In case of machine failure, many nodes can be used to quickly re-create the micro-partitions instead of waiting on one machine to read one single large partition. + + * Selective Replication + + * With micro-partitioning, replicas for micro-partitions can be created ahead of time to achieve good load balancing. + + * Latency induced probation + + * In some cases, removing a slow node can improve the overall latency of the system. The probated node can be re-incorporated when its latency improves. + +* Large Information Retrieval Systems + + * In such systems, speed can be more critical than the quality of the result. + + * The system should return a "good enough" result that is available with low latency instead of waiting for the "best result" that is available with high latency. + + * In some cases, a request could trigger an unexpected code path or cause some other exception that could slow down the entire system. + + * In such cases, the *canary request* technique can be used where the system sends the request initially to only 1 or 2 nodes. The request is sent over to the other nodes only after receiving a successful response from the initial nodes. + +* Requests that update state are easier to handle for several reasons: + + * The scale of latency-critical modifications is generally small. + + * The update can be performed asynchronously after responding to the user. + + * Quorum-based approaches (often used for ensuring consistent updates) are inherently tail-tolerant. \ No newline at end of file diff --git a/site/_posts/2021-03-22-Deep Neural Networks for YouTube Recommendations.md b/site/_posts/2021-03-22-Deep Neural Networks for YouTube Recommendations.md new file mode 100755 index 00000000..47adb3d4 --- /dev/null +++ b/site/_posts/2021-03-22-Deep Neural Networks for YouTube Recommendations.md @@ -0,0 +1,92 @@ +--- +layout: post +title: Deep Neural Networks for YouTube Recommendations +comments: True +excerpt: +tags: ['2016', 'Machine Learning', 'Recommender Systems', ACM, Engineering, Latency, ML, Ranking, Recommender, Scale, Systems] + +--- + +## Introduction + +* The paper describes YouTube's deep learning-based recommendation system. + +* [Link to the paper](https://research.google/pubs/pub45530/) + +## Challenges + +* Scale - Very large number of users and videos. + +* Freshness - Very large number of videos uploaded every hour. The recommendation system should take these new videos into account as well. + +* Noise - User satisfaction needs to be modeled from noisy implicit feedback signal as the explicit signal is very sparse. + +## System Overview + +* Two neural networks: one for candidate generation and another one for ranking. + +* Metrics + + * Offline metrics like precision, recall, ranking loss + + * A/B testing via live experiments + +### Candidate Generation + +* Input: events from a user's YouTube activity history. + +* Output: small subset (hundreds) of videos. + +* Approach: + + * Recommendation is modeled as extreme multiclass classification. + + * Predict the video (from a corpus) that a user will watch at a given time. + + * The neural network's task is to learn useful user embeddings, given the user's context and history. + + * For each positive class (relevant video), negative classes (non-relevant videos) are sampled from the video corpus. + +* Model Architecture + + * A feedforward network with input as user embeddings and context embeddings (watch history). + + * Watch history is a variable-length sequence of video ids, where each video id is mapped to an embedding. + + * The sequence of video ids is mapped to a sequence of embeddings, and this sequence is averaged to obtain fixed-sized embedding. + + * Additional signals like demographic features and search query embeddings can be added along with the context embeddings. + + * The age of a video is also used as a feature during training to account for the freshness of the content. This feature is set to zero (or slightly negative) during inference. + +* Other Insights + + * Training examples are generated from all YouTube watches, including the watches from the videos embedded on other sites, to surface new content. + + * Generating the same number of training examples per user is important to avoid a small set of active users from dominating the model training. + + * Predicting a user's next watch leads to better results than predicting a randomly held-out watch. This can be attributed to the general consumption pattern of videos (e.g., episodes are usually watched in order). + +### Ranking + +* Input: list of candidate videos to rank from. + +* Output: score for each video. + +* Approach + + * A feedforward network (similar to candidate generation model) trained using logistic regression loss. + +* Feature representation + + * Different types of features: categorical vs. continuous, univalent vs. multivalent, describes video vs. describes user or context. + + * Important signals include user's interaction with the video (or similar videos), which source/channel added the video to the candidate set. + + * Embeddings are shared across features. For example, the representation for a video id remains the same, irrespective of whether it is being used for representing the "video to recommend" or the "last seen video." + + * Feature normalization and transformations like exponents (square or square root) for continuous variables improve the performance. + +* To model the expected watch time, the logistic regression loss is weighted by the observed watch time. For example, if a video was watched, its weight is given by the observed watch time, and if the video was not watched, its weight is set to 1. + +* In practice, this means that the logistic regression model learns odds that approximate the expected watch time of the video. \ No newline at end of file