From c967563c8dd765e629e395c3add2061cbb3353f7 Mon Sep 17 00:00:00 2001 From: Shagun Sodhani Date: Sun, 11 Apr 2021 11:47:19 -0400 Subject: [PATCH] Update papers --- README.md | 2 + ...8-Continual learning with hypernetworks.md | 78 +++++++++++++++++++ .../2021-02-15-When Do Curricula Work.md | 74 ++++++++++++++++++ site/_site | 2 +- 4 files changed, 155 insertions(+), 1 deletion(-) create mode 100755 site/_posts/2021-02-08-Continual learning with hypernetworks.md create mode 100755 site/_posts/2021-02-15-When Do Curricula Work.md diff --git a/README.md b/README.md index 21417f41..04e7621b 100755 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [When Do Curricula Work?](https://shagunsodhani.com/papers-I-read/When-Do-Curricula-Work) +* [Continual learning with hypernetworks](https://shagunsodhani.com/papers-I-read/Continual-learning-with-hypernetworks) * [Zero-shot Learning by Generating Task-specific Adapters](https://shagunsodhani.com/papers-I-read/Zero-shot-Learning-by-Generating-Task-specific-Adapters) * [HyperNetworks](https://shagunsodhani.com/papers-I-read/HyperNetworks) * [Energy-based Models for Continual Learning](https://shagunsodhani.com/papers-I-read/Energy-based-Models-for-Continual-Learning) diff --git a/site/_posts/2021-02-08-Continual learning with hypernetworks.md b/site/_posts/2021-02-08-Continual learning with hypernetworks.md new file mode 100755 index 00000000..650e2122 --- /dev/null +++ b/site/_posts/2021-02-08-Continual learning with hypernetworks.md @@ -0,0 +1,78 @@ +--- +layout: post +title: Continual learning with hypernetworks +comments: True +excerpt: +tags: ['2019', 'Continual Learning', 'ICLR 2020', 'Lifelong Learning', AI, CL, HyperNetwork, ICLR, LL] + +--- + +## Introduction + +* The paper proposes the use of task-conditioned [HyperNetworks](https://shagunsodhani.com/papers-I-read/HyperNetworks) for lifelong learning / continual learning setups. + +* The idea is, the HyperNetwork would only need to remember the task-conditioned weights and not the input-output mapping for all the data points. + +* [Link to the paper](https://arxiv.org/abs/1906.00695) + +* [Author's Implementation](https://github.com/chrhenning/hypercl) + +## Terminology + +* $f$ denotes the network for the given $t^{th}$ task. + +* $h$ denotes the HyperNetwork that generates the weights for $f$. + +* $\Theta_{h}$ denotes the parameters of $h$. + +* $e^{t}$ denotes the input task-embedding for the $t^{th}$ task. + +## Approach + +* When training on the $t^{th}$ task, the HyperNetworks generates the weights for the network $f$. + +* The current task loss is computed using the generated weights, and the candidate weight update ($\Delta \Theta_{h}$) is computed for $h$. + +* The actual parameter change is computed by the following expression: + +$L_{total} = L{task}(\Theta_{h}, e^{T}, X^{T}, Y^{T}) + \frac{\beta_{output}}{T-1} \sum_{t=1}^{T-1} \| f_{h}(e^{t}, \Theta_{h}^*) - f_{h}(e^{(t)}, \Theta_{h} + \Delta \Theta_{h} ))\|^2$ + +* $L_{task}$ is the loss for the current task. + +* $(X^{T}, Y^{T})$ denotes the training datapoints for the $T^{th}$ task. + +* $\beta_{output}$ is a hyperparameter to control the regularizer's strength. + +* $\Theta_{h}^*$ denotes the optimal parameters after training on the $T-1$ tasks. + +* $\Theta_{h} + \Delta \Theta_{h}$ denotes the one-step update on the current $h$ model. + +* In practice, the task encoding $e^{t}$ is chunked into smaller vectors, and these vectors are fed as input to the HyperNetwork. + +* This enables the HyperNetwork to produce weights iteratively, instead of all at once, thus helping to scale to larger models. + +* The paper also considers the problem of inferring the task embedding from a given input pattern. + +* Specifically, the paper uses task-dependent uncertainty, where the task embedding with the least predictive uncertainty is chosen as the task embedding for the given unknown task. This approach is referred to as HNET+ENT. + +* The paper also considers using HyperNetworks to learn the weights for a task-specific generative model. This generative model will be used to generate pseudo samples for rehearsal-based approaches. The paper considers two cases: + + * HNET+R where the replay model (i.e., the generative model) is parameterized using a HyperNetwork. + + * HNET+TIR, where an auxiliary task inference classifier is used to predict the task identity. + +## Experiments + +* Three setups are considered + + * CL1 - Task identity is given to the model. + + * CL2 - Task identity is not given, but task-specific heads are used. + + * CL3 - Task identity needs to be explicitly inferred. + +* On the permuted MNIST task, the proposed approach outperforms baselines like Synaptic Intelligence and Online EWC, and the performance gap is more significant for larger task sequences. + +* Forward knowledge transfer is observed with the CIFAR datasets. + +* One potential limitation (which is more of a limitation of HyperNetworks) is that HyperNetworks may be harder to scale for larger models like ResNet50 or transformers, thus limiting their usefulness for lifelong learning use cases. \ No newline at end of file diff --git a/site/_posts/2021-02-15-When Do Curricula Work.md b/site/_posts/2021-02-15-When Do Curricula Work.md new file mode 100755 index 00000000..2d29f701 --- /dev/null +++ b/site/_posts/2021-02-15-When Do Curricula Work.md @@ -0,0 +1,74 @@ +--- +layout: post +title: When Do Curricula Work? +comments: True +excerpt: +tags: ['2020', 'Curriculum Learning', 'ICLR 2021', AI, Empirical, ICLR] + +--- + +## Introduction + +* The paper systematically investigates when does curriculum learning help. + +* [Link to the paper](https://arxiv.org/abs/2012.03107) + +## Implicit Curricula + +* Implicit curricula refers to the order in which a network learns data points when trained using stochastic gradient descent, with iid sampling of data. + +* When training, let us say that the model makes a correct prediction for a given datapoint in the $i^{th}$ epoch (and correct prediction in all the subsequent epochs). The $i^{th}$ epoch is referred to as the *learned iteration* of the datapoint (iteration in which the datapoint was learned). + +* The paper studied multiple models (VGG, ResNet, WideResNet, DenseNet, and EfficientNet) with different optimizers (Adam and SGD with momentum). + +* The resulting implicit curricula are broadly consistent within the model families, making the following discussion less dependent on the model architecture. + +## Explicit Curricula + +* When defining an explicit curriculum, three important components stand out. + +### Scoring Function + +* Maps a data point to a numerical score of *difficulty*. + +* Choices: + + * Loss function for a model + + * *learned iteration* + + * Estimated c-score - It captures a given model's consistency to correctly predict a given datapoint's label when trained on an iid dataset (not containing the datapoint). + +* The three scoring functions are computed for two models on the CIFAR dataset. + +* The resulting six scores have a high Spearman Rank correlation. Hence for the rest of the discussion, only the c-score is used. + +### Pacing Function + +* This function, denoted by $g(t)$, controls the size of the training dataset at step $t$. + +* At step $t$, the model would be trained on the first $g(t)$ examples (as per the ordering). + +* Choices: logarithmic, exponential, step, linear, quadratic, and root. + +### Order + +* Order in which the data points are picked: + + * *Curriculum* - Ordering points from lowest score to highest and training on the easiest data points first. + + * *Anti Curriculum* - Ordering points from highest score to lowest and training on the hardest data points first. + + * *Random* - Randomly selecting the data points to train on. + + +## Observations + +* The paper performed a hyperparameter sweep over 180 pacing functions and three orderings for three random seeds over the CIFAR10 and CIFAR100 datasets. For both the datasets, the best performance is obtained with random ordering, indicating that curricula did not give any benefits. + +* However, the curriculum is useful when the number of training iterations is small. + +* It also helps with noisy data training (which is simulated by randomly permuting the labels). + +* The observations for the smaller CIFAR10/100 dataset generalize to slightly larger datasets like FOOD101 and FOOD101N. + diff --git a/site/_site b/site/_site index f898116a..64e043b0 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit f898116a395b873ddda09d3b3e1480ed91031d5d +Subproject commit 64e043b0a3f315f3dc73f1c086c7485a66676384