From b77494352c90fa3929a418a738e05fa238eb15f9 Mon Sep 17 00:00:00 2001
From: Shagun Sodhani <sshagunsodhani@gmail.com>
Date: Sun, 6 Sep 2020 12:32:49 -0400
Subject: [PATCH] Add papers

---
 README.md                                     |  4 +
 ...ss Balancing in Deep Multitask Networks.md |  2 +-
 ...radient Surgery for Multi-Task Learning.md | 72 +++++++++++++++
 ...Sparsely-Gated Mixture-of-Experts Layer.md | 91 +++++++++++++++++++
 ...forcement Learning and the Deadly Triad.md | 91 +++++++++++++++++++
 ...iting Fundamentals of Experience Replay.md | 89 ++++++++++++++++++
 site/_site                                    |  2 +-
 7 files changed, 349 insertions(+), 2 deletions(-)
 create mode 100755 site/_posts/2020-08-06-Gradient Surgery for Multi-Task Learning.md
 create mode 100755 site/_posts/2020-08-14-Outrageously Large Neural Networks--The Sparsely-Gated Mixture-of-Experts Layer.md
 create mode 100755 site/_posts/2020-08-31-Deep Reinforcement Learning and the Deadly Triad.md
 create mode 100755 site/_posts/2020-09-07-Revisiting Fundamentals of Experience Replay.md

diff --git a/README.md b/README.md
index 5174056d..136d6152 100755
--- a/README.md
+++ b/README.md
@@ -4,7 +4,11 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [Deep Reinforcement Learning and the Deadly Triad](https://shagunsodhani.com/papers-I-read/Deep-Reinforcement-Learning-and-the-Deadly-Triad)
 * [Alpha Net: Adaptation with Composition in Classifier Space](https://shagunsodhani.com/papers-I-read/Alpha-Net-Adaptation-with-Composition-in-Classifier-Space)
+* [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://shagunsodhani.com/papers-I-read/Outrageously-Large-Neural-Networks-The-Sparsely-Gated-Mixture-of-Experts-Layer)
+* [Gradient Surgery for Multi-Task Learning](https://shagunsodhani.com/papers-I-read/Gradient-Surgery-for-Multi-Task-Learning)
+* [GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks](https://shagunsodhani.com/papers-I-read/GradNorm-Gradient-Normalization-for-Adaptive-Loss-Balancing-in-Deep-Multitask-Networks)
 * [TaskNorm: Rethinking Batch Normalization for Meta-Learning](https://shagunsodhani.com/papers-I-read/TASKNORM-Rethinking-Batch-Normalization-for-Meta-Learning)
 * [Averaging Weights leads to Wider Optima and Better Generalization](https://shagunsodhani.com/papers-I-read/Averaging-Weights-leads-to-Wider-Optima-and-Better-Generalization)
 * [Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions](https://shagunsodhani.com/papers-I-read/Decentralized-Reinforcement-Learning-Global-Decision-Making-via-Local-Economic-Transactions)
diff --git a/site/_posts/2020-07-30-GradNorm--Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.md b/site/_posts/2020-07-30-GradNorm--Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.md
index ffc7bcc7..2a5c19f2 100755
--- a/site/_posts/2020-07-30-GradNorm--Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.md	
+++ b/site/_posts/2020-07-30-GradNorm--Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.md	
@@ -3,7 +3,7 @@ layout: post
 title: GradNorm--Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks
 comments: True
 excerpt: 
-tags: ['1027', 'Gradient Manipulation', 'Gradient Normalization', 'ICML 2018', 'Multi Task', AI, ICML]
+tags: ['2017', 'Gradient Manipulation', 'Gradient Normalization', 'ICML 2018', 'Multi Task', AI, ICML]
 
 
 ---
diff --git a/site/_posts/2020-08-06-Gradient Surgery for Multi-Task Learning.md b/site/_posts/2020-08-06-Gradient Surgery for Multi-Task Learning.md
new file mode 100755
index 00000000..b5b9aebf
--- /dev/null
+++ b/site/_posts/2020-08-06-Gradient Surgery for Multi-Task Learning.md	
@@ -0,0 +1,72 @@
+---
+layout: post
+title: Gradient Surgery for Multi-Task Learning
+comments: True
+excerpt: 
+tags: ['2019', 'Gradient Manipulation', 'Multi Task', AI]
+
+
+---
+
+
+
+* The paper hypothesizes that main optimization challenges in multi-task learning arise because of negative interference between different tasks' gradients.
+
+* It hypothesizes that negative interference happens when:
+
+	* The gradients are conflicting (i.e., have a negative cosine similarity).
+
+	* The gradients coincide with high positive curvature.
+
+	* The difference in gradient magnitude is quite large.
+
+
+* The paper proses to work around this problem by performing "gradient surgery."
+
+* If two gradients are conflicting, modify the gradients by projecting each onto the other's normal plane. 
+
+* This modification is equivalent to removing the conflicting component of the gradient.
+
+* This approach is referred to as *projecting conflicting gradients* (PCGrad).
+
+* [Link to the paper](https://arxiv.org/abs/2001.06782)
+
+* Theoretical Analysis
+	
+	* The paper proves the local conditions under which PCGrad improves multi-task gradient descent in the two-task setup.
+
+	* The conditions are:
+
+		* Angle between the task gradients is not too small.
+
+		* Difference in the magnitude of the gradients is sufficiently large.
+
+		* Curvature of the multi-task gradient is large.
+
+		* Large enough learning rate.
+
+* Experimental Setup
+
+	* Multi-task supervised learning
+
+		* MutliMNIST, Multi-task CIFAR100, NYUv2.
+
+		* For Multi-task CIFAR-100, PCGrad is used with the shared parameters of the routing networks.
+
+		* For NYUv2, PCGrad is combined with MTAN.
+
+		* In all the cases, using PCGrad improves the performance.
+
+	* Multi-task Reinforcement Learning
+
+		* Meta-World Benchmark
+
+		* PCGrad + SAC outperforms all other baselines.
+
+		* In the context of SAC, the paper suggests learning temperature $\alpha$ on a per-task basis.
+
+	* Goal-conditioned Reinforcement Learning
+
+		* Goal-conditioned robotic pushing task with a Sawyer robot.
+
+		* PCGrad + SAC outperforms vanilla SAC.
\ No newline at end of file
diff --git a/site/_posts/2020-08-14-Outrageously Large Neural Networks--The Sparsely-Gated Mixture-of-Experts Layer.md b/site/_posts/2020-08-14-Outrageously Large Neural Networks--The Sparsely-Gated Mixture-of-Experts Layer.md
new file mode 100755
index 00000000..37cdae93
--- /dev/null
+++ b/site/_posts/2020-08-14-Outrageously Large Neural Networks--The Sparsely-Gated Mixture-of-Experts Layer.md	
@@ -0,0 +1,91 @@
+---
+layout: post
+title: Outrageously Large Neural Networks--The Sparsely-Gated Mixture-of-Experts Layer
+comments: True
+excerpt: 
+tags: ['2017', 'Conditional Computation', 'Distributed Computing', 'ICLR 2017', 'Mixture of Experts', AI, Gating, ICLR]
+
+
+---
+
+
+
+## Introduction
+
+* Conditional computation is a technique to increase a model's capacity (without a proportional increase in computation) by activating parts of the network on a per example basis.
+
+* The paper describes (and address) the computational and algorithmic challenges in conditional computation. It introduces a sparsely-gated Mixture-of-Experts layer (MoE) with 1000s of feed-forward sub-networks.
+
+* [Link to the paper](https://arxiv.org/abs/1701.06538)
+
+## Practical Challenges
+
+* GPUs are fast at matrix arithmetic but slow at branching.
+
+* Large batch sizes amortizes the cost of updates. Conditional computation reduces the effective batch size for different components of the model.
+
+* Network bandwidth can be a bottleneck with the network demand overshadowing the computational demand.
+
+* Additional losses may be needed to achieve the desired level of sparsity.
+
+* Conditional computation is most useful for large datasets.
+
+## Architecture
+
+* *n* Expert Networks - $E_1$, ..., $E_n$.
+
+* Gating Network $G$ to select a sparse combination of experts.
+
+* Output of the MoE module is the weighted sum of predictions of experts (weighted by the output of the gate).
+
+* If the gating network's output is sparse, then some of the experts' value does not have to be computed.
+
+* In theory, one could use a hierarchical mixture of experts where a mixture of experts is trained at each level.
+
+### Choices for the Gating Network
+
+* Softmax Gating
+
+* Noisy top-k gating - Add tunable Gaussian noise to the output of softmax gating and retain only the top-k values. A second trainable weight matrix controls the amount of noise per component.
+
+## Addressing Performance Challenge
+
+* Shrinking Batch Problem
+
+	* If the MoE selects *k* out of *n* experts, the effective batch size reduces by a factor of *k* / *n*.
+
+	* This reduction in batch size is accounted for by combining data parallelism (for standard layers and gasting networks) and model parallelism (for experts in MoE). Thus, with *d* devices, the batch size changes by a factor of (*k* x *d* ) / *n*.
+
+	* For hierarchical MoE, the primary gating network uses data parallelism while secondary MoEs use model parallelism.
+
+	* The paper considers LSTM models where the MoE is applied once the previous layer has finished. This increases the batch size (for the current MoE layer) by a factor equal to the number of unrolling timesteps.
+
+	* Network Bandwith limitations can be overcome by ensuring that the ratio of computation (of each expert) to the input and output size is greater than (or equal to) the ratio of computational to network capacity.
+
+	* Computational efficiency can be improved by using larger hidden layers (or more hidden layers).
+
+* Balancing Expert Utilization
+
+	* Importance of an expert (relative to a batch of training examples) is defined as the batchwise sum of the expert's goal values.
+
+	* An additional loss, called importance loss, is added to encourage the experts to have equal importance.
+
+	* The importance loss is defined as the square of the coefficient of variation (of a set of importance values) multiplied by a (hand-tuned) scaling factor $w_{importance}$.
+
+	* In practice, an additional loss called $L_{load}$ might be needed to ensure that the different experts get equal load (along with equal importance).
+
+## Experiments
+
+* Datasets
+
+	* Billon Word Language modeling Benchmark
+
+	* 100 Billion word Google News Corpus
+
+	* Machine Translation datasets
+
+		* Single Language Pairs - WMT'14 En to Fr (36M sentence pairs) and En to De (5M sentence pairs).
+
+		* Multilingual Machine Translation - large combine dataset of twelve language pairs.
+
+* In all the setups, the proposed MoE models achieve significantly better results than the baseline models, at a lower computational cost.
diff --git a/site/_posts/2020-08-31-Deep Reinforcement Learning and the Deadly Triad.md b/site/_posts/2020-08-31-Deep Reinforcement Learning and the Deadly Triad.md
new file mode 100755
index 00000000..9752067c
--- /dev/null
+++ b/site/_posts/2020-08-31-Deep Reinforcement Learning and the Deadly Triad.md	
@@ -0,0 +1,91 @@
+---
+layout: post
+title: Deep Reinforcement Learning and the Deadly Triad
+comments: True
+excerpt: 
+tags: ['2018', 'Deep Reinforcement Learning', 'Empirical Advice', 'Off policy RL', 'Reinforcement Learning', AI, DRL, Empirical, RL]
+
+
+---
+
+
+## Introduction
+
+* The paper investigates the practical impact of the deadly triad (function approximation, bootstrapping, and off-policy learning) in deep Q-networks (trained with experience replay).
+
+* The deadly triad is called so because when all the three components are combined, TD learning can diverge, and value estimates can become unbounded.
+
+* However, in practice, the component of the deadly triad has been combined successfully. An example is training DQN agents to play Atari.
+
+* [Link to the paper](https://arxiv.org/abs/1812.02648)
+
+## Setup
+
+* The effect of each component of the triad can be regulated with some design choices:
+
+	* Bootstrapping - by controlling the number of steps before bootstrapping.
+
+	* Function approximation - by controlling the size of the neural network.
+
+	* Off-policy learning - by controlling how data points are sampled from the replay buffer (i.e., using different prioritization approaches)
+
+* The problem is studied in two contexts: toy example and Atari 2600 games.
+
+* The paper makes several hypotheses about how different components may interact in the triad and evaluate these hypotheses by training DQN with different hyperparameters:
+	
+	* Number of steps before bootstrapping - 1, 3, 10
+
+	* Four levels of prioritization (for sampling data from the replay buffer)
+
+	* Bootstrap target - Q-learning, target Q-learning, inverse double Q-learning, and double Q-learning
+
+	* Network sizes-small, medium, large and extra-large.
+
+* Each experiment was run with three different seeds.
+
+* The paper formulates a series of hypotheses and designs experiments to support/reject the hypotheses.
+
+
+## Hypothesis 1: Combining Q learning with conventional deep RL function spaces does not commonly lead to divergence
+
+* Rewards are clipped between -1 and 1, and the discount factor is set to 0.99. Hence, the maximum absolute action value is bound to smaller than 100. This upper bound is used soft-divergence in the value estimates.
+
+* The paper reports that while soft-divergence does occur, the values do not become unbounded, thus supporting the hypothesis.
+
+## Hypothesis 2: There is less divergence when correcting for overestimation bias or when bootstrapping on separate networks.
+
+* One manifestation of bootstrapping on separate networks is target-Q learning. While using separate networks helps on Atari, it does not entirely solve the problem on the toy setup.
+
+* One manifestation of correcting for the overestimation bias is using double Q-learning.
+
+* In the standard form, double Q-learning benefits by bootstrapping on a separate network. To isolate the gains by using each component independently, an inverse double Q-learning update is used that does not use a separate target-network for bootstrapping.
+
+* Experimentally, Q-learning is the most unstable while target Q-learning and double Q-learning are the most stable. This observation supports the hypothesis.
+
+## Hypothesis 3: Longer multi-step returns will diverge easily
+
+* This hypothesis is intuitive as the dependence on bootstrapping is reduced with multi-step returns.
+
+* Experimental results support this hypothesis.
+
+## Hypothesis 4: Larger, more capacity networks will diverge less easily.
+
+* This hypothesis is based on the assumption that more flexible value function approximations may behave more like the tabular case.
+
+* In practice, smaller networks show fewer instances of instability than the larger networks.
+
+* The hypothesis is not supported by the experiments.
+
+## Hypothesis 5: Stronger prioritization of updates will diverge more easily.
+
+* This hypothesis is supported by the experiments for all the four updates.
+
+## Effect of the deadly triad on the agent's performance
+
+* Generally, soft-divergence correlates with poor control performance.
+
+* For example, longer multi-step returns lead to fewer instances of instabilities and better performance.
+
+* The trend is more interesting in terms of network capacity. Large networks tend to diverge more but also perform the best.
+
+* While action-value estimates can grow to large values, they can recover to plausible values as training progresses.
\ No newline at end of file
diff --git a/site/_posts/2020-09-07-Revisiting Fundamentals of Experience Replay.md b/site/_posts/2020-09-07-Revisiting Fundamentals of Experience Replay.md
new file mode 100755
index 00000000..eca49db6
--- /dev/null
+++ b/site/_posts/2020-09-07-Revisiting Fundamentals of Experience Replay.md	
@@ -0,0 +1,89 @@
+---
+layout: post
+title: Revisiting Fundamentals of Experience Replay
+comments: True
+excerpt: 
+tags: ['2020', 'Deep Reinforcement Learning', 'ICML 2020', 'Off policy RL', 'Reinforcement Learning', 'Replay Buffer', AI, DRL, Empirical, ICML, RL]
+
+
+---
+
+
+## Introduction
+
+* The paper presents an extensive study of the effects of experience replay in Q-learning based methods.
+
+* It focuses explicitly on the replay capacity and replay ratio (ratio of learning updates to experience collected).
+
+* [Link to the paper](https://arxiv.org/abs/2007.06700)
+
+## Setup
+
+* Replay capacity is defined as the total number of transitions stored in the replay buffer.
+
+* Age of a transition (stored in the replay buffer) is defined as the number of gradient steps taken by the agent since the transition was stored.
+
+* More is the replay capacity, more will be the age of the oldest transition (also referred to as the age of the oldest policy).
+
+* More is the replay capacity, more will be the degree of "off-policyness" of the transitions in the buffer (with everything else held constant).
+
+* Replay ratio is the number of gradient updates per environment transition. This ratio can be used as a proxy for how often the agent uses old data (vs. collecting new data) and is related to off-policyness.
+
+* In [DQN paper](https://www.nature.com/articles/nature14236), the replay ratio is set to be 0.25.
+
+* For experiments, a subset  (of 14 games) is selected from Atari ALE (Arcade Learning Environment) with sticky actions.
+
+* Each experiment is repeated with three seeds.
+
+* Rainbow is used as the base algorithm.
+
+* Total number of gradient updates and batch size (per gradient update) are fixed for all the experiments.
+
+* Rainbow used replay capacity of 1M and oldest policy of age 250K.
+
+* In experiments, replay capacity varies from 0.1M to 10M ( 5 values), and the age of the oldest policy varies from 25K to 25M (4 values).
+
+## Observations
+
+* With the age of the oldest policy fixed, performance improves with higher replay capacity, probably due to increased state-action coverage.
+
+* With fixed replay capacity, reducing the oldest policy's age improves performance, probably due to the reduced off-policyness of the data in the replay buffer.
+
+* However, in some specific instances (with sparse reward, hard exploration setup), performance can drop when reducing the oldest policy's age.
+
+* Increasing replay capacity, while keeping the replay ratio fixed, provides varying improvements and depends on the particular values of replacy capacity and replay ratio.
+
+* The paper reports the effect of these choices for DQN as well.
+
+* Unlike Rainbow, DQN does not improve with larger replay capacity, irrespective of whether the replay ratio or age of the oldest policy is kept fixed.
+
+* Given that the Rainbow agent is a DQN agent with additional components, the paper explores which of these components leads to an improvement in Rainbow's performance as replay capacity increases.
+
+
+## Additive Experiments
+
+* Four new DQN variants are created by adding each of Rainbow's four components to the base DQN agent.
+
+* DQN with n-step returns is the only variant that benefits by increased replay capacity.
+
+* The usefulness of n-step returns is further validated by verifying that Rainbow agent without n-step returns does not benefit by increased replay capacity. While Rainbow agent without any other component benefits by the increased capacity.
+
+* Prioritized Experience Replay does not significantly affect the performance with increased replay capacity.
+
+* The observation that n-step returns are critical for taking advantage of larger replay sizes is surprising because the uncorrected n-step returns are theoretically not suitable for off-policy learning.
+
+* The paper tests the limits of increasing replay capacity (with n-step returns) by performing experiments in the offline-RL setup, the agent collects a dataset of about 200M frames. These frames are used to train another agent.
+
+* Even in this extreme setup, n-step returns improve the learning agent's performance.
+
+## Why do n-step returns help?
+
+* Hypothesis 1: n-step returns help to counter the increased off-policyness produced by a larger replay buffer.
+
+	* This hypothesis does not seem to hold as keeping the oldest policy fixed or using the same contrastive factor as an n-step update does not improve the 1-step update's performance.
+
+* Hypothesis 2: Increasing the replay buffer's capacity may reduce the variance of the n-step returns.
+
+	* This hypothesis is evaluated by training on environments with lesser variance or by turning off the sticky actions in the atari domain.
+
+	* While the hypothesis does explain the gains by using n-step returns to some extent, n-step gains are observed even in environments with low variance.
\ No newline at end of file
diff --git a/site/_site b/site/_site
index 0f81ea84..623ce9d6 160000
--- a/site/_site
+++ b/site/_site
@@ -1 +1 @@
-Subproject commit 0f81ea843098a26808cdbb3d072fd217adf62428
+Subproject commit 623ce9d663fa823837f7168bc35f61e145ee3baf