-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
eef99dc
commit c967563
Showing
4 changed files
with
155 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
78 changes: 78 additions & 0 deletions
78
site/_posts/2021-02-08-Continual learning with hypernetworks.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
--- | ||
layout: post | ||
title: Continual learning with hypernetworks | ||
comments: True | ||
excerpt: | ||
tags: ['2019', 'Continual Learning', 'ICLR 2020', 'Lifelong Learning', AI, CL, HyperNetwork, ICLR, LL] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper proposes the use of task-conditioned [HyperNetworks](https://shagunsodhani.com/papers-I-read/HyperNetworks) for lifelong learning / continual learning setups. | ||
|
||
* The idea is, the HyperNetwork would only need to remember the task-conditioned weights and not the input-output mapping for all the data points. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1906.00695) | ||
|
||
* [Author's Implementation](https://github.com/chrhenning/hypercl) | ||
|
||
## Terminology | ||
|
||
* $f$ denotes the network for the given $t^{th}$ task. | ||
|
||
* $h$ denotes the HyperNetwork that generates the weights for $f$. | ||
|
||
* $\Theta_{h}$ denotes the parameters of $h$. | ||
|
||
* $e^{t}$ denotes the input task-embedding for the $t^{th}$ task. | ||
|
||
## Approach | ||
|
||
* When training on the $t^{th}$ task, the HyperNetworks generates the weights for the network $f$. | ||
|
||
* The current task loss is computed using the generated weights, and the candidate weight update ($\Delta \Theta_{h}$) is computed for $h$. | ||
|
||
* The actual parameter change is computed by the following expression: | ||
|
||
$L_{total} = L{task}(\Theta_{h}, e^{T}, X^{T}, Y^{T}) + \frac{\beta_{output}}{T-1} \sum_{t=1}^{T-1} \| f_{h}(e^{t}, \Theta_{h}^*) - f_{h}(e^{(t)}, \Theta_{h} + \Delta \Theta_{h} ))\|^2$ | ||
|
||
* $L_{task}$ is the loss for the current task. | ||
|
||
* $(X^{T}, Y^{T})$ denotes the training datapoints for the $T^{th}$ task. | ||
|
||
* $\beta_{output}$ is a hyperparameter to control the regularizer's strength. | ||
|
||
* $\Theta_{h}^*$ denotes the optimal parameters after training on the $T-1$ tasks. | ||
|
||
* $\Theta_{h} + \Delta \Theta_{h}$ denotes the one-step update on the current $h$ model. | ||
|
||
* In practice, the task encoding $e^{t}$ is chunked into smaller vectors, and these vectors are fed as input to the HyperNetwork. | ||
|
||
* This enables the HyperNetwork to produce weights iteratively, instead of all at once, thus helping to scale to larger models. | ||
|
||
* The paper also considers the problem of inferring the task embedding from a given input pattern. | ||
|
||
* Specifically, the paper uses task-dependent uncertainty, where the task embedding with the least predictive uncertainty is chosen as the task embedding for the given unknown task. This approach is referred to as HNET+ENT. | ||
|
||
* The paper also considers using HyperNetworks to learn the weights for a task-specific generative model. This generative model will be used to generate pseudo samples for rehearsal-based approaches. The paper considers two cases: | ||
|
||
* HNET+R where the replay model (i.e., the generative model) is parameterized using a HyperNetwork. | ||
|
||
* HNET+TIR, where an auxiliary task inference classifier is used to predict the task identity. | ||
|
||
## Experiments | ||
|
||
* Three setups are considered | ||
|
||
* CL1 - Task identity is given to the model. | ||
|
||
* CL2 - Task identity is not given, but task-specific heads are used. | ||
|
||
* CL3 - Task identity needs to be explicitly inferred. | ||
|
||
* On the permuted MNIST task, the proposed approach outperforms baselines like Synaptic Intelligence and Online EWC, and the performance gap is more significant for larger task sequences. | ||
|
||
* Forward knowledge transfer is observed with the CIFAR datasets. | ||
|
||
* One potential limitation (which is more of a limitation of HyperNetworks) is that HyperNetworks may be harder to scale for larger models like ResNet50 or transformers, thus limiting their usefulness for lifelong learning use cases. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
--- | ||
layout: post | ||
title: When Do Curricula Work? | ||
comments: True | ||
excerpt: | ||
tags: ['2020', 'Curriculum Learning', 'ICLR 2021', AI, Empirical, ICLR] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper systematically investigates when does curriculum learning help. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/2012.03107) | ||
|
||
## Implicit Curricula | ||
|
||
* Implicit curricula refers to the order in which a network learns data points when trained using stochastic gradient descent, with iid sampling of data. | ||
|
||
* When training, let us say that the model makes a correct prediction for a given datapoint in the $i^{th}$ epoch (and correct prediction in all the subsequent epochs). The $i^{th}$ epoch is referred to as the *learned iteration* of the datapoint (iteration in which the datapoint was learned). | ||
|
||
* The paper studied multiple models (VGG, ResNet, WideResNet, DenseNet, and EfficientNet) with different optimizers (Adam and SGD with momentum). | ||
|
||
* The resulting implicit curricula are broadly consistent within the model families, making the following discussion less dependent on the model architecture. | ||
|
||
## Explicit Curricula | ||
|
||
* When defining an explicit curriculum, three important components stand out. | ||
|
||
### Scoring Function | ||
|
||
* Maps a data point to a numerical score of *difficulty*. | ||
|
||
* Choices: | ||
|
||
* Loss function for a model | ||
|
||
* *learned iteration* | ||
|
||
* Estimated c-score - It captures a given model's consistency to correctly predict a given datapoint's label when trained on an iid dataset (not containing the datapoint). | ||
|
||
* The three scoring functions are computed for two models on the CIFAR dataset. | ||
|
||
* The resulting six scores have a high Spearman Rank correlation. Hence for the rest of the discussion, only the c-score is used. | ||
|
||
### Pacing Function | ||
|
||
* This function, denoted by $g(t)$, controls the size of the training dataset at step $t$. | ||
|
||
* At step $t$, the model would be trained on the first $g(t)$ examples (as per the ordering). | ||
|
||
* Choices: logarithmic, exponential, step, linear, quadratic, and root. | ||
|
||
### Order | ||
|
||
* Order in which the data points are picked: | ||
|
||
* *Curriculum* - Ordering points from lowest score to highest and training on the easiest data points first. | ||
|
||
* *Anti Curriculum* - Ordering points from highest score to lowest and training on the hardest data points first. | ||
|
||
* *Random* - Randomly selecting the data points to train on. | ||
|
||
|
||
## Observations | ||
|
||
* The paper performed a hyperparameter sweep over 180 pacing functions and three orderings for three random seeds over the CIFAR10 and CIFAR100 datasets. For both the datasets, the best performance is obtained with random ordering, indicating that curricula did not give any benefits. | ||
|
||
* However, the curriculum is useful when the number of training iterations is small. | ||
|
||
* It also helps with noisy data training (which is simulated by randomly permuting the labels). | ||
|
||
* The observations for the smaller CIFAR10/100 dataset generalize to slightly larger datasets like FOOD101 and FOOD101N. | ||
|
Submodule _site
updated
from f89811 to 64e043