diff --git a/README.md b/README.md index 66f5ee08..262d72c4 100755 --- a/README.md +++ b/README.md @@ -5,6 +5,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators](https://shagunsodhani.com/papers-I-read/ELECTRA-Pre-training-Text-Encoders-as-Discriminators-Rather-Than-Generators) +* [Gradient based sample selection for online continual learning](https://shagunsodhani.com/papers-I-read/Gradient-based-sample-selection-for-online-continual-learning) * [Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One](https://shagunsodhani.com/papers-I-read/Your-Classifier-is-Secretly-an-Energy-Based-Model,-and-You-Should-Treat-it-Like-One) * [Massively Multilingual Neural Machine Translation in the Wild - Findings and Challenges](https://shagunsodhani.com/papers-I-read/Massively-Multilingual-Neural-Machine-Translation-in-the-Wild-Findings-and-Challenges) * [Observational Overfitting in Reinforcement Learning](https://shagunsodhani.com/papers-I-read/Observational-Overfitting-in-Reinforcement-Learning) diff --git a/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md b/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md index a48bb7ab..0a11b4dd 100755 --- a/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md +++ b/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md @@ -3,7 +3,7 @@ layout: post title: Gradient based sample selection for online continual learning comments: True excerpt: -tags: ['2019', 'Catastrophic Forgetting', Continual Learning', 'Lifelong Learning', 'NeurIPS 2019', 'Replay Buffer', AI, CL, LL] +tags: ['2019', 'Catastrophic Forgetting', Continual Learning', 'Lifelong Learning', 'NeurIPS 2019', 'Replay Buffer', AI, CL, LL, NeurIPS] --- diff --git a/site/_posts/2020-02-20-ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.md b/site/_posts/2020-02-20-ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.md new file mode 100755 index 00000000..46944517 --- /dev/null +++ b/site/_posts/2020-02-20-ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.md @@ -0,0 +1,93 @@ +--- +layout: post +title: ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators +comments: True +excerpt: +tags: ['2019', 'ICLR 2020', 'Natural Language Processing', AI, Attention, Finetuning, ICLR, NLP, Pretraining, Transformer] + + +--- + +## Introduction + +* Masked Language Modeling (MLM) is a common technique for pre-training language-based models. The idea is to "corrupt" some tokens in the input text (around 15%) by replacing them with the [MASK] token and then training the network to reconstruct (or predict) the corrupted tokens. + +* Since the network learns from only about 15% of the tokens, the computational cost of training using MLM can be quite high. + +* The paper proposes to use a "replaced token detection" task where some tokens in the input text are replaced by other plausible tokens. + +* For each token in the modified text, the network has to predict if the token has been replaced or not. + +* The alternative token is generated using a small generator network. + +* Unlike the previous MLM setup, the proposed task is defined for all the input tokens, thus utilizing the training data more efficiently. + +* [Link to the paper](https://openreview.net/forum?id=r1xMH1BtvB) + +## Approach + +* The proposed approach is called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) + +* Two neural networks - Generator (G) and Discriminator (D) are trained. + +* Each network has a Transformer-based text encoder that maps a sequence of words into a sequence of vectors. + +* Given an input sequence x (of length N), k indices are chosen for replacing the tokens. + +* For each index, the generator produces a distribution over tokens. A token is sampled to replace in the original sequence. The resulting sequence is referred to as the corrupted sequence. + +* Given the corrupted sequence, the Discriminator predicts which token comes from the data distribution and which comes from the generator. + +* The generator is trained using the MLM setup, and the Discriminator is trained using the discriminative loss. + +* After pre-training, only the Discriminator is finetuned on the downstream tasks. + +## Experiments + +* Datasets + + * GLUE Benchmark + + * Stanford QA dataset + +* Architecture Choices + + * Sharing word embeddings between generator and Discriminator helps. + + * Tying all the encoder weights leads to marginal improvement but forces the generator and the Discriminator to be of the same size. Hence only embeddings are shared. + + * Generator model is kept smaller than the discriminator model as a strong generator can make the training difficult for the Discriminator. + + * A two-stage training procedure was explored where only the generator is trained for n steps. Then the weights of the generator are used to initialize the Discriminator. The Discriminator is then trained for n steps while keeping the generator fixed. + + * This two-stage setup provides a nice curriculum for the Discriminator but does not outperform the joint training based setup. + + * An adversarial loss based setup is also explored but it does not work well probably because of the following reasons: + + * Adverserially trained generator is not as good as the MLM generator. + + * Adverserially trained generator produces a low entropy output distribution. + +* Results + + * Both small and large ELECTRA models outperform baselines models like [BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), [ELMo](https://arxiv.org/abs/1802.05365) and [GPT](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf). + +* Ablations + + * ELECTRA-15 is a variant of ELECTRA where the Discriminator is trained on only 15% of the tokens (similar to the MLM setup). This reduces performance significantly. + + * Replace MLM setup + + * Perform MLM training, but instead of using [MASK], use a toke sampled from the generator. + + * This improves the performance marginally. + + * All-token MLM + + * In the MLM setup, replace the [MASK] token by the sampled tokens and train the MLM model to generate all the words. + + * In practice, the MLM model can either generate a word or copy the existing word. + + * This approach closes much of the gap between BERT and ELECTRA. + +* Interestingly, ELECTRA outperforms All-token MLM BERT suggesting the ELECTRA may be benefitting from parameter efficiency since it does not have to learn a distribution over all the words. \ No newline at end of file diff --git a/site/_site b/site/_site index 51d5d556..a00804e9 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit 51d5d55606404f5999d05a9a47010440ad57260f +Subproject commit a00804e9672abd5ee580b623ea6efef4b869e9aa