-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
58ff681
commit 5439837
Showing
5 changed files
with
223 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
93 changes: 93 additions & 0 deletions
93
..._posts/2020-06-18-On the Difficulty of Warm-Starting Neural Network Training.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
--- | ||
layout: post | ||
title: On the Difficulty of Warm-Starting Neural Network Training | ||
comments: True | ||
excerpt: | ||
tags: ['2019', 'Incremental Learning', 'Online Learning', 'Transfer Learning', AI, Empirical] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper considers learning scenarios where the training data is available incrementally (and not at once). | ||
|
||
* For example, in some applications, new data is available periodically (e.g., latest news articles come out every day). | ||
|
||
* The paper highlights that, in such scenarios, the conventional wisdom of "warm start" does not apply. | ||
|
||
* When new data is available, it is better to train a new model from scratch than to update the model trained on previously available data. | ||
|
||
* While the two setups lead to similar training performance, the randomly initialized model has a much better generalization performance. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1910.08475) | ||
|
||
|
||
## Basic Batch Updating | ||
|
||
* Create two random, equally-sized partitions of the training data. | ||
|
||
* Train the model till convergence on the first half of the data. Then train the model on the entire dataset. | ||
|
||
* Models: ResNet18, MLPs, Logisitic Regression (LR) | ||
|
||
* Dataset: CIFAR10, CIFAR100, SVHN | ||
|
||
* Optimizers: Adam, SGD | ||
|
||
* Warm starting hurts generalization in all the cases. | ||
|
||
* The effect is more pronounced in the case of ResNets and MLPs (compared to LR) and harder CIFAR 10 dataset (as compared to SVHN dataset). | ||
|
||
## Online Learning | ||
|
||
### Passive Online Learning | ||
|
||
* The model is given access to k new learning examples at each iteration. | ||
|
||
* A warm started model reuses the previously initialized model and trains (till convergence) on the new batch of k items. | ||
|
||
* A "randomly initialized" model is trained on all the examples (seen so far) from scratch. | ||
|
||
* Dataset: CIFAR10 | ||
|
||
* Model: ResNet18 | ||
|
||
* As more training data becomes available, the generalization gap between the two setups increases, and warmup starts hurting generalization. | ||
|
||
### Active Online Learning | ||
|
||
* In this setup, the learner is trained to sample k new examples to add to the training dataset (using margin-based sampling). | ||
|
||
* Like the previous setup, warmup strategy still hurts generalization. | ||
|
||
## Transfer Learning | ||
|
||
* Train a Resnet18 model on the CIFAR10 dataset and use this model to warm start training on the SVHN dataset. | ||
|
||
* When a small percentage of the SVHN dataset is used, the setup resembles pretraining / transfer learning and performs better than training from scratch. | ||
|
||
* As the percentage of the SVHN dataset increases, the warmup approach starts underperforming. | ||
|
||
## Overcoming warm start problem | ||
|
||
* ResNet18 model on CIFAR10 dataset | ||
|
||
* When performing a hyper-parameter sweep over the learning rate and batch size, it is possible to train warm start models to reach the same generalization performance as training from scratch. | ||
|
||
* Though, in that case, there are no computational savings as the warm-started models take about the same time (to converge) as the randomly initialized model. | ||
|
||
* The increased training time indicates that the warm started model probably needs to forget the knowledge from previous training rounds. | ||
|
||
* Warm start Resnet models, that generalize well, have a low correlation to their initialization stage (measured via Pearson correlation coefficient between the model weights). | ||
|
||
* Generalization is damaged even when using a model trained on incomplete data for only a few epochs. | ||
|
||
* For warm start models, the gradient (corresponding to the "new" data) is higher than that for randomly initialized models. This hints that regularisation may help to close the generalization gap. But in practice, regularization helps both the warmup and randomly initialized model. | ||
|
||
* Warm starting only a few layers also does not close the gap. | ||
|
||
* Adding some noise to the warm started model (with the motivation of having a partially random initialization) does help somewhat but also increases the training time. | ||
|
||
* Motivating the problem as an instance of catastrophic forgetting, the authors use the EWC algorithm but report that using EWC hurts model performance. | ||
|
||
* The paper does not propose a solution to the problem but provides a thorough analysis of the problem setup, which is quite useful for understanding the phenomenon itself. |
54 changes: 54 additions & 0 deletions
54
...ization-A Simple Technique for Generalization in Deep Reinforcement Learning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
--- | ||
layout: post | ||
title: Network Randomization - A Simple Technique for Generalization in Deep Reinforcement Learning | ||
comments: True | ||
excerpt: | ||
tags: ['2019', 'Deep Reinforcement Learning', 'ICLR 2020', 'Reinforcement Learning', AI, DRL, Generalization, ICLR, RL] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper proposed a Technique for improving the generalization ability of RL agents when evaluated on an unseen environment (which is similar to the training environment). | ||
|
||
* [Link to the paper](https://openreview.net/forum?id=HJgcvJBFvB) | ||
|
||
* [Link to the code](https://github.com/pokaxpoka/netrand) | ||
|
||
## Approach | ||
|
||
* The key idea is to learn features that are invariant across environments by using a randomized CNN (*f*) that randomly perturbs the inputs. | ||
|
||
* The policy is trained using the randomized observations obtained using *f*. | ||
|
||
* Invariant features are learned using a feature matching (FM) loss that matches the feature representation of the original and randomized observations. | ||
|
||
* The random network's parameters are initialized as $\alpha I + (1 - \alpha) N(0, \sqrt\frac{2}{n_{in} + n_{out}})$ where $\alpha \in \[0, 1\]$, $N$ denotes the Gaussian Distribution and $n_{in}, n_{out}$ denote the number of input and output channels respectively. | ||
|
||
* Xavier Normal distribution is used for randomization to maintain the variance between the input and the randomized input. | ||
|
||
* *f* is randomized per iteration. | ||
|
||
* During inference, the expected action is computed by approximating over *M* samples (i.e., randomizing the input *M* times). | ||
|
||
## Environments | ||
|
||
* 2D CoinRun, 3D DeepMind Lab, 3D Robotics Control Task | ||
|
||
* The evaluation environments consist of different styles of backgrounds, objects, and floors. | ||
|
||
## Baselines | ||
|
||
* Regularization methods: Dropout, L2 regularization, Batch Normalization | ||
|
||
* Dataset Augmentation methods: Cutout, Gray out, Inversion, Color Jitter | ||
|
||
## Results | ||
|
||
* On CoinRun, the proposed approaches significantly outperforms the other baselines during evaluation. The performance improvement saturates around 10 *M* samples. | ||
|
||
* Cycle consistency is used to measure the similarity between two trajectories. The proposed method improves the cycle consistency as compared to the vanilla PPO baseline. It also produces sharper activation maps in the evaluation environments. | ||
|
||
* For the large-scale experiments, when evaluated on 500 levels of CoinRun, the proposed method improves the success rates from 39.8% to 58.7%. | ||
|
||
* On DeepMind Lab and Surreal robotics control tasks, the proposed method leads to agents that generalize better on the unseen environments (during evaluation). |
70 changes: 70 additions & 0 deletions
70
site/_posts/2020-07-02-When to use parametric models in reinforcement learning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
layout: post | ||
title: When to use parametric models in reinforcement learning? | ||
comments: True | ||
excerpt: | ||
tags: ['2019', 'Deep Reinforcement Learning', 'Model-Based', 'Model-Free', 'Neurips 2019', 'Reinforcement Learning', AI, DRL, Neurips, Planning, RL] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper compares replay-based approaches with model-based approaches in Reinforcement Learning (RL). | ||
|
||
* It hypothesizes that if the parametric model is only used for generation transitions for the update rule, then under certain conditions, replay-based approaches will be as good as model-based approaches. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1906.05243) | ||
|
||
## Terminology | ||
|
||
* Planning: Any algorithm that uses additional computations (but not additional experience) to improve its performance. | ||
|
||
* Learning: Any algorithm that uses additional experience to improve its performance. | ||
|
||
* In some cases, a replay buffer can be seen as a model. For example, querying using state-action pair (from the replay buffer) is similar to querying the (expected) next-state and reward from a model. In general, the model will be more flexible as any arbitrary state-action pair can be used for querying. | ||
|
||
## Computation Properties | ||
|
||
* Parametric models require more computation than sampling from a replay buffer. In contrast, the cost of maintaining a replay buffer scales linearly with their capacity. | ||
|
||
* Parametric models are useful for planning multiple-steps into the future while it is much harder to do so with a replay buffer (even more so with pixel observations). | ||
|
||
* An imperfect model maybe be more suitable for selecting actions (instead of updating the policy) because the chosen action, when executed in the environment, will lead to transitions that would improve the model. | ||
|
||
* When planning with an imperfect model, it is better to plan backward, as the update is applied on an imaginary state (which would not be encountered if the model is poor). | ||
|
||
* If the model is accurate, forward and backward planning is equivalent. This distinction between forward and backward updates does not apply to replay buffers. | ||
|
||
## Failure to learn | ||
|
||
* When using a replay buffer and (i) uniformly replaying transitions, (ii) from a buffer containing only full episodes, and (iii) using TD updates, then the algorithm is stable. | ||
|
||
* When using a replay buffer and (i) uniformly replaying transitions, (ii) generating transitions using a model, and (iii) using TD updates, then the algorithm can diverge. | ||
|
||
* This case can be fixed by: | ||
|
||
* Repeatedly interating over the model and sampling transitions *to* and *from* the state model generates (not a satisfactory solution). | ||
|
||
* Using multiple-step returns (this can increase the variance). | ||
|
||
* Use algorithms specifically for stable off-policy learning (not a definitive solution). | ||
|
||
## Model-based algorithms at scale | ||
|
||
* The paper compares against SimPLe (model-based) with Rainbow DQN (replay-based). | ||
|
||
* The paper shows that when using a similar number of real interactions, Rainbow DQN needs fewer replay samples than model samples in SimPLe, making it more efficient (computation-wise). | ||
|
||
* Changes to Rainbow DQN: | ||
* Increase number of steps, for bootstrapping, from 3 to 20. | ||
* Reduce the number of steps, before sampling starts from the replay buffer, from 20K to 1600. | ||
|
||
* With these changes, Rainbow DQN outperforms SimPLe in 17 out of 26 games. | ||
|
||
## Conclusion | ||
|
||
* When using a parametric model in a replay-like setting (sampling observed states from the past), model-based learning can be unstable (in theory). Using a replay buffer is likely a better strategy under the state sampling distribution. | ||
|
||
* Parametric models are likely more useful when: | ||
* planning backward for credit assignment - even if the model is in-accurate, backward planning will only update fictional states. | ||
* planning forward for behavior - the resulting plan is only used to collect real *experience* in the environment (and not directly update the policy). |
Submodule _site
updated
from df8644 to dabee5