diff --git a/README.md b/README.md
index 66f5ee08..262d72c4 100755
--- a/README.md
+++ b/README.md
@@ -5,6 +5,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators](https://shagunsodhani.com/papers-I-read/ELECTRA-Pre-training-Text-Encoders-as-Discriminators-Rather-Than-Generators)
+* [Gradient based sample selection for online continual learning](https://shagunsodhani.com/papers-I-read/Gradient-based-sample-selection-for-online-continual-learning)
 * [Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One](https://shagunsodhani.com/papers-I-read/Your-Classifier-is-Secretly-an-Energy-Based-Model,-and-You-Should-Treat-it-Like-One)
 * [Massively Multilingual Neural Machine Translation in the Wild - Findings and Challenges](https://shagunsodhani.com/papers-I-read/Massively-Multilingual-Neural-Machine-Translation-in-the-Wild-Findings-and-Challenges)
 * [Observational Overfitting in Reinforcement Learning](https://shagunsodhani.com/papers-I-read/Observational-Overfitting-in-Reinforcement-Learning)
diff --git a/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md b/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md
index a48bb7ab..0a11b4dd 100755
--- a/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md	
+++ b/site/_posts/2020-02-13-Gradient based sample selection for online continual learning.md	
@@ -3,7 +3,7 @@ layout: post
 title: Gradient based sample selection for online continual learning
 comments: True
 excerpt: 
-tags: ['2019', 'Catastrophic Forgetting', Continual Learning', 'Lifelong Learning', 'NeurIPS 2019', 'Replay Buffer', AI, CL, LL]
+tags: ['2019', 'Catastrophic Forgetting', Continual Learning', 'Lifelong Learning', 'NeurIPS 2019', 'Replay Buffer', AI, CL, LL, NeurIPS]
 
 ---
 
diff --git a/site/_posts/2020-02-20-ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.md b/site/_posts/2020-02-20-ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.md
new file mode 100755
index 00000000..46944517
--- /dev/null
+++ b/site/_posts/2020-02-20-ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.md	
@@ -0,0 +1,93 @@
+---
+layout: post
+title: ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators 
+comments: True
+excerpt: 
+tags: ['2019', 'ICLR 2020', 'Natural Language Processing', AI, Attention, Finetuning, ICLR, NLP, Pretraining, Transformer]
+
+
+---
+
+## Introduction
+
+* Masked Language Modeling (MLM) is a common technique for pre-training language-based models. The idea is to "corrupt" some tokens in the input text (around 15%) by replacing them with the [MASK] token and then training the network to reconstruct (or predict) the corrupted tokens.
+
+* Since the network learns from only about 15% of the tokens, the computational cost of training using MLM can be quite high.
+
+* The paper proposes to use a "replaced token detection" task where some tokens in the input text are replaced by other plausible tokens. 
+
+* For each token in the modified text, the network has to predict if the token has been replaced or not.
+
+* The alternative token is generated using a small generator network.
+
+* Unlike the previous MLM setup, the proposed task is defined for all the input tokens, thus utilizing the training data more efficiently.
+
+* [Link to the paper](https://openreview.net/forum?id=r1xMH1BtvB)
+
+## Approach
+
+* The proposed approach is called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
+
+* Two neural networks - Generator (G) and Discriminator (D) are trained.
+
+* Each network has a Transformer-based text encoder that maps a sequence of words into a sequence of vectors.
+
+* Given an input sequence x (of length N), k indices are chosen for replacing the tokens.
+
+* For each index, the generator produces a distribution over tokens. A token is sampled to replace in the original sequence. The resulting sequence is referred to as the corrupted sequence.
+
+* Given the corrupted sequence, the Discriminator predicts which token comes from the data distribution and which comes from the generator.
+
+* The generator is trained using the MLM setup, and the Discriminator is trained using the discriminative loss.
+
+* After pre-training, only the Discriminator is finetuned on the downstream tasks.
+
+## Experiments
+
+* Datasets
+
+    * GLUE Benchmark
+
+    * Stanford QA dataset
+
+* Architecture Choices
+
+    * Sharing word embeddings between generator and Discriminator helps.
+
+    * Tying all the encoder weights leads to marginal improvement but forces the generator and the Discriminator to be of the same size. Hence only embeddings are shared. 
+
+    * Generator model is kept smaller than the discriminator model as a strong generator can make the training difficult for the Discriminator.
+
+    * A two-stage training procedure was explored where only the generator is trained for n steps. Then the weights of the generator are used to initialize the Discriminator. The Discriminator is then trained for n steps while keeping the generator fixed.
+
+    * This two-stage setup provides a nice curriculum for the Discriminator but does not outperform the joint training based setup.
+
+    * An adversarial loss based setup is also explored but it does not work well probably because of the following reasons:
+
+        * Adverserially trained generator is not as good as the MLM generator.
+
+        * Adverserially trained generator produces a low entropy output distribution.
+
+* Results
+
+    * Both small and large ELECTRA models outperform baselines models like [BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), [ELMo](https://arxiv.org/abs/1802.05365) and [GPT](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf).
+
+* Ablations
+
+    * ELECTRA-15 is a variant of ELECTRA where the Discriminator is trained on only 15% of the tokens (similar to the MLM setup). This reduces performance significantly.
+
+    * Replace MLM setup
+
+        * Perform MLM training, but instead of using [MASK], use a toke sampled from the generator. 
+
+        * This improves the performance marginally.
+
+    * All-token MLM
+
+        * In the MLM setup, replace the [MASK] token by the sampled tokens and train the MLM model to generate all the words.
+
+        * In practice, the MLM model can either generate a word or copy the existing word.
+
+        * This approach closes much of the gap between BERT and ELECTRA.
+
+* Interestingly, ELECTRA outperforms All-token MLM BERT suggesting the ELECTRA may be benefitting from parameter efficiency since it does not have to learn a distribution over all the words.
\ No newline at end of file
diff --git a/site/_site b/site/_site
index 51d5d556..a00804e9 160000
--- a/site/_site
+++ b/site/_site
@@ -1 +1 @@
-Subproject commit 51d5d55606404f5999d05a9a47010440ad57260f
+Subproject commit a00804e9672abd5ee580b623ea6efef4b869e9aa