-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7eb0f0c
commit feea037
Showing
10 changed files
with
637 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
56 changes: 56 additions & 0 deletions
56
...embering for the Right Reasons - Explanations Reduce Catastrophic Forgetting.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
layout: post | ||
title: Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting | ||
comments: True | ||
excerpt: | ||
tags: ['2020', 'Catastrophic Forgetting', 'Continual Learning', 'Lifelong Learning', 'Replay Buffer', AI, CL, LL] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper hypothesizes that catastrophic forgetting can happen if the model can not rely on "reasoning" used for an old datapoint. If that is the case, catastrophic forgetting may be alleviated when the model "remembers" why it made a prediction previously. | ||
* The paper presents a simple instantiation of this hypothesis, in the form of a technique called Remembering for the Right Reasons (RRR). | ||
* The idea is to store model explanations, along with previous examples in the replay buffer. During replay, an additional *explanation loss* is used, along with the regular replay loss. | ||
* [Link to the paper](https://arxiv.org/abs/2010.01528) | ||
* [Link to the code](https://github.com/SaynaEbrahimi/Remembering-for-the-Right-Reasons) | ||
|
||
## Setup | ||
|
||
* The model is trained over a sequence of data distributions in the class-incremental learning setup. A single-head architecture is used so that the task ID is not required during inference. | ||
* Along with the standard replay buffer ($$M^{rep}$$) for the raw input examples (from different tasks), another replay buffer ($$M^{RRR}$$) is maintained for storing the "explanations" (in the form of saliency maps), corresponding to examples in $$M^{rep}$$. | ||
* RRR is implemented as an L1 loss on the error between the saliency map generated after training on the current task and the saliency map in $$M^{RRR}$$. | ||
* Saliency maps need to be generated while the model is training. This requirement rules out black-box saliency methods, which can be used only after training. | ||
* The gradient-based white-box explainability techniques that are used include: | ||
* Vanilla backpropagation - Perform a forward pass through the model and take the gradient of the given output class with respect to the input. | ||
* Backpropagation with SmoothGrad - Saliency maps generated using Vanilla backpropagation can be visually noisy. These maps can be improved by adding pixel-wise Gaussian noise to *n* copies of the image and averaging the resulting gradients. The paper used *n=40*. | ||
* Gradient-weighted Class Activation Mapping (Grad-CAM) - Uses gradients to determine the importance of feature map activations on a given prediction. | ||
* RRR can be easily used with memory and regularization based approaches. | ||
* The paper combined RRR with the following standard Class Incremental Learning (CIL) models: | ||
* [iTAML : An incremental task-agnostic meta-learning approach](https://arxiv.org/abs/2003.11652) | ||
* [End-to-end incremental learning (EEIL)](https://arxiv.org/abs/1807.09536) | ||
* [Large scale incremental learning (BiC)](https://arxiv.org/abs/1905.13260) | ||
* [TOpology-Preserving knowledge InCrementer (TOPIC)](https://arxiv.org/abs/2004.10956) | ||
* [iCaRL: Incremental Classifier and Representation Learning](https://arxiv.org/abs/1611.07725) | ||
* [Elastic Weight Consolidation](https://arxiv.org/abs/1612.00796) | ||
* [Learning without forgetting](https://arxiv.org/abs/1606.09282) | ||
|
||
## Experiments | ||
|
||
### Few-Shiot Class Incremental Learning | ||
|
||
* C-way K-shot class incremental learning with C classes and K training samples per class and b base classes to learn as the first task. | ||
* Caltech-UCSD Birds dataset with 100 base classes and remaining 100 classes divided into ten tasks, with three samples per class. The test set is not changed. | ||
* In teems of saliency maps., Grad-CAM is better than Vanilla Backpropagation, which in turn is comparable to SmoothGrad. The same trend is seen in terms of memory overhead, with Grad-CAM having the least memory overhead. | ||
* Adding the RRR loss improves the performance of all the baselines. | ||
|
||
### Standard Class Incremental Learning | ||
|
||
* CIFAR100 and ImageNet100 with a memory budget of 2000 samples. | ||
* Adding the RRR loss improves all the baselines' performance, and the gains for ImageNet100 are more significant than the gains for CIFAR100. | ||
|
||
### How often does the model remember its decision for the right reason? | ||
|
||
* The paper uses the Pointing Game (PG) experiment, which uses the ground truth image segmentation to define the true object region. | ||
* If the maximum attention location (in the predicted saliency map) falls inside the objects, it is considered a *hit*, else a *miss*. A *hit* on a previous example is considered a proxy for the model remembering its decision for the right reason. | ||
* The precision and recall are reported for the *hit* metric. Using RRR increases both precision (i.e., less often the model makes the correct decision without looking at the right evidence) and recall (i.e., less frequently does the model makes an incorrect decision, despite looking at the proper evidence). |
72 changes: 72 additions & 0 deletions
72
site/_posts/2020-10-19-Learning Explanations That Are Hard To Vary.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
--- | ||
layout: post | ||
title: Learning Explanations That Are Hard To Vary | ||
comments: True | ||
excerpt: | ||
tags: ['2020', AI, Invariance] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper builds on the principle "good explanations are hard to vary" to propose that *invariant mechanisms* can be identified by finding explanations (say model parameters) that are hard to vary across examples. | ||
* [Link to the paper](https://arxiv.org/abs/2009.00329) | ||
* [Link to the code](https://github.com/gibipara92/learning-explanations-hard-to-vary) | ||
|
||
## Setup | ||
|
||
* Collection of *d* different datasets (from different environments). Each dataset is a collection of input-target tuples. | ||
* Objective is to learn a function *f* (also called *mechanism*) to map the input to the target (for all the environments). | ||
* The standard approach is to pool the loss for examples corresponding to the different environments and perform gradient updates on this average-pooled loss. | ||
* In this standard gradient-based setup, the model may not learn invariances due to the following reasons: | ||
* Model learned the spurious features first, and now the training loss is too small. | ||
* The pooled loss is generally computed by summing (or averaging) the loss corresponding to individual examples. Thus the gradient for each example is calculated independently. Each sample can be thought of as a dataset of size 1, for which all the features are relevant. | ||
* Gradient descent with averaging (of gradients across the environments) greedily maximizes for the learning speed and not invariance. | ||
* Performing arithmetic mean can be seen as performing an OR operation (i.e., the sum can be high if any one of the constituents is high), whereas performing geometric mean can be seen as performing an AND operation (i.e., the product can be high only if all the constituents are high). | ||
|
||
### Invariant Learning Consistency(ILC) | ||
|
||
* Given an algorithm $$A$$, let $$\theta_{A}^{*}$$ denote the set of convergence points of $$A$$ when trained on all the environments. | ||
* Each convergence point is associated with a consistency score. | ||
* Intuitively, given a convergence point and an environment *e*, find the set of parameters equivalent to the convergence point (in terms of loss) with respect to *e*. Let's call this set as *S*. | ||
* Evaluate the points in this set for all the remaining environments. For the given convergence point, an environment *e'* is consistent with *e* if the maximum difference in the loss for two environments is small, for all points belonging to *S*. | ||
* This idea is used to define the invariant learning consistency score for algorithm $$A$$, which measures the expected consistency of the converged points (on the pooled data) across all the environments. | ||
* The paper shows that the converged points' consistency is linked to the Hessians' geometric mean and that for the convex quadratic case, using the elementwise geometric mean of gradients improves consistency. | ||
* However, there are some practical challenges: | ||
* Geometric mean is defined only when all signs are consistent. This issue can potentially be handled by treating different signs as 0. | ||
* There is very little flexibility in "partial" agreement, and even a single zero gradient component can stop optimization for that component. This can probably be handled by not masking if many environments have a gradient for that component. | ||
* Geometric component needs to be computed in the log-domain (for numerical scalability), but that can be computationally more expensive. | ||
* When using adaptive optimizers like Adam, the exact magnitude of geometric mean will be ignored because of rescaling for the local curvature adaptation. | ||
* Some of these challenges can be handled using average gradients when the geometric mean would be 0 and masking out components based on the sign. | ||
|
||
### AND-mask | ||
|
||
* The ideas from the previous section can be used to develop a practical algorithm called AND-mask. | ||
* Zero-out gradients that have inconsistent signs across some threshold number (hyper-parameter) of environments. | ||
* In the presence of purely random gradient patterns, the AND-mask decreases the signals' strength exponentially fast. | ||
|
||
## Experiments | ||
|
||
### Synthetic Memorization Dataset | ||
|
||
* This is a binary classification task with two kind of features: (i) "meaningful" features that are shared across environments but harder for the model to learn and (ii) "shortcut" features that are easy to learn but not shared across environments. | ||
* While the dataset may look simple, it is difficult to find the invariant mechanism because the "shortcut" features allow for a simple, linear decision boundary, with a large margin that is fast to learn, has perfect accuracy, robust to input noise, and no iid generalization gap. | ||
* Baselines: | ||
* MLPs trained with regularizers like dropout, L1, L2, and batch norm. | ||
* Domain Adversarial Neural Networks (DANN) | ||
* Invariant Risk Minimization (IRM) | ||
* In terms of results, AND-mask with L1/L2 regularizers gives the best results. | ||
* Empirically, the paper shows that the signal from the "meaningful" features is present when the gradients are averaged, but their magnitude is much smaller than the signal from the "shortcut" features. | ||
|
||
### Experiments on CIFAR-10 | ||
|
||
* A ResNet model is trained on the CIFAR-10 dataset with random labels, with and without the AND-mask. | ||
* The model with the AND-mask did not memorize the data, whereas the model without the AND-mask did. As sanity, the paper ensured that both the models generalize well when trained with the original labels. | ||
* Note that for this experiment, every example was treated to have come from its own environment. | ||
|
||
### Behavioral Cloning on CoinRun | ||
|
||
* Train an expert policy using PPO for 400M steps on the full distribution of levels. | ||
* Generate a dataset of state-action pairs. Training data consists of 1000 states from each of the 64 levels, while the test data comes from 2000 levels. | ||
* A ResNet18 model is used as an imitation learning policy. | ||
* The exact implementation of the AND-mask is a little more involved, but the key takeaway is that model trained with AND-mask identifies invariant mechanisms across different levels. |
79 changes: 79 additions & 0 deletions
79
...olution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
--- | ||
layout: post | ||
title: One Solution is Not All You Need - Few-Shot Extrapolation via Structured MaxEnt RL | ||
comments: True | ||
excerpt: | ||
tags: ['2020', 'Deep Reinforcement Learning', 'Latent Variable', 'NeurIPS 2020', 'Reinforcement Learning', AI, DRL, Generalization, NeurIPS, RL] | ||
|
||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* Key idea: Practicing and remembering diverse solutions to a task can lead to robustness to that task's variations. | ||
|
||
* The paper proposes a framework to implement this idea - train multiple policies such that they are *collectively* robust to a new distribution over environments while using a single training environment. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/2010.14484) | ||
|
||
## Setup | ||
|
||
* During training, the agent has access to only one MDP. | ||
|
||
* During the evaluation, the agent encounters a new MDP which has the same state and action space but may have a different reward and transition function. | ||
|
||
* The agent is allowed some interactions (say *k*) with the test MDP and is then evaluated on the test MDP. The setup is referred to as *few-shot robustness*. | ||
|
||
## Structured Maximum Entropy Reinforcement Learning (SMERL) | ||
|
||
* Represent a set of policies using a latent variable policy (i.e., a policy conditioned on a latent variable *z*). | ||
|
||
* This has two benefits: (i) Multiple policies can be represented by the same object, and (ii) diverse behaviors can be learned by encouraging the trajectories, corresponding to different *z* to be different, while being able to solve the task. | ||
|
||
* A diversity-inducing objective is used to encourage the agent to learn different trajectories for different *z*. | ||
|
||
* Specifically, the mutual information between *p(Z)* and marginal trajectory distribution for the latent variable policy is maximized, subject to the constraint that each policy achieves close to optimal returns in the train MDP. | ||
|
||
* The mutual information between *p(Z)* and marginal trajectory distribution for the latent variable policy is lower bounded by the sum of mutual information terms over individual states (appearing in the trajectory). | ||
|
||
* An unsupervised reward function is defined using the mutual information between states and latent variables. | ||
|
||
* $$r(s, a) = log(q_{\phi})(z\|s) - log(p(z))$$ where $$q_{\phi}$$ is a learned discriminator. | ||
|
||
* This unsupervised reward is optimized for only when the policy achieves close to an optimal return, i.e., the environment return is close to the optimal return. Otherwise, the agent optimizes only for the environment return. | ||
|
||
### Implementation | ||
|
||
* SMERL is implemented using SAC with a latent variable maximum entropy policy. | ||
|
||
* The set of latent variables is a fixed discrete set $$Z$$ and $$p(z)$$ is set to be a uniform distribution over this set. | ||
|
||
* At the start of an episode, a $$z$$ is sampled and used throughout the episode. | ||
|
||
* Discriminator $$q_{\phi}(z\|s)$$ is trained to infer $$z$$ from the visited states. | ||
|
||
* A baseline SAC agent is trained beforehand to evaluate if the current training policy achieves close to optimal environment return. | ||
|
||
* During the evaluation, the policy corresponding to each latent variable is executed in the test MDP, and the policy with the maximum return is returned. | ||
|
||
## Theoretical Analysis | ||
|
||
* Given an MDP $$M$$ and $$\epsilon>0$$, the MDP robustness set is defined as the set of all MDPs $$M'$$ where the optimal policy of $$M'$$ produces the same trajectory distribution in $$M'$$ as $$M$$. Moreover, on the training MDP $$M$$, the optimal policies (corresponding to $$M$$ and $$M'$$) obtain similar returns. | ||
|
||
* The paper shows that SMERL generalizes to MDPs belong to the robustness set. | ||
|
||
* It also provides a simplified view of the optimization objective and shows how it naturally leads to a trajectory-centric mutual information objective. | ||
|
||
## Experiments | ||
|
||
* Environments | ||
|
||
* 2D navigation environments with point mass. | ||
|
||
* Mujoco Environments: HalfCheetah-Goal, Walker2d-Velocity, Hopper-Velocity. | ||
|
||
* On the 2D navigation environment, the paper shows that SMERL learns to use different trajectories to reach the goal. | ||
|
||
* On the Mujoco setup, the evaluation shows that SMERL generally outperforms the best-performing baseline or is close to the best-performing baseline on different tasks. | ||
|
||
* Generally, higher train performance does not correlate with higher test performance, and there is no single policy that performs the best across all the tasks. Thus, it should be beneficial to learn multiple diverse policies that can be selected from during testing. |
Oops, something went wrong.