Skip to content

Commit

Permalink
Update papers
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Apr 11, 2021
1 parent eef99dc commit c967563
Show file tree
Hide file tree
Showing 4 changed files with 155 additions and 1 deletion.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [When Do Curricula Work?](https://shagunsodhani.com/papers-I-read/When-Do-Curricula-Work)
* [Continual learning with hypernetworks](https://shagunsodhani.com/papers-I-read/Continual-learning-with-hypernetworks)
* [Zero-shot Learning by Generating Task-specific Adapters](https://shagunsodhani.com/papers-I-read/Zero-shot-Learning-by-Generating-Task-specific-Adapters)
* [HyperNetworks](https://shagunsodhani.com/papers-I-read/HyperNetworks)
* [Energy-based Models for Continual Learning](https://shagunsodhani.com/papers-I-read/Energy-based-Models-for-Continual-Learning)
Expand Down
78 changes: 78 additions & 0 deletions site/_posts/2021-02-08-Continual learning with hypernetworks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
layout: post
title: Continual learning with hypernetworks
comments: True
excerpt:
tags: ['2019', 'Continual Learning', 'ICLR 2020', 'Lifelong Learning', AI, CL, HyperNetwork, ICLR, LL]

---

## Introduction

* The paper proposes the use of task-conditioned [HyperNetworks](https://shagunsodhani.com/papers-I-read/HyperNetworks) for lifelong learning / continual learning setups.

* The idea is, the HyperNetwork would only need to remember the task-conditioned weights and not the input-output mapping for all the data points.

* [Link to the paper](https://arxiv.org/abs/1906.00695)

* [Author's Implementation](https://github.com/chrhenning/hypercl)

## Terminology

* $f$ denotes the network for the given $t^{th}$ task.

* $h$ denotes the HyperNetwork that generates the weights for $f$.

* $\Theta_{h}$ denotes the parameters of $h$.

* $e^{t}$ denotes the input task-embedding for the $t^{th}$ task.

## Approach

* When training on the $t^{th}$ task, the HyperNetworks generates the weights for the network $f$.

* The current task loss is computed using the generated weights, and the candidate weight update ($\Delta \Theta_{h}$) is computed for $h$.

* The actual parameter change is computed by the following expression:

$L_{total} = L{task}(\Theta_{h}, e^{T}, X^{T}, Y^{T}) + \frac{\beta_{output}}{T-1} \sum_{t=1}^{T-1} \| f_{h}(e^{t}, \Theta_{h}^*) - f_{h}(e^{(t)}, \Theta_{h} + \Delta \Theta_{h} ))\|^2$

* $L_{task}$ is the loss for the current task.

* $(X^{T}, Y^{T})$ denotes the training datapoints for the $T^{th}$ task.

* $\beta_{output}$ is a hyperparameter to control the regularizer's strength.

* $\Theta_{h}^*$ denotes the optimal parameters after training on the $T-1$ tasks.

* $\Theta_{h} + \Delta \Theta_{h}$ denotes the one-step update on the current $h$ model.

* In practice, the task encoding $e^{t}$ is chunked into smaller vectors, and these vectors are fed as input to the HyperNetwork.

* This enables the HyperNetwork to produce weights iteratively, instead of all at once, thus helping to scale to larger models.

* The paper also considers the problem of inferring the task embedding from a given input pattern.

* Specifically, the paper uses task-dependent uncertainty, where the task embedding with the least predictive uncertainty is chosen as the task embedding for the given unknown task. This approach is referred to as HNET+ENT.

* The paper also considers using HyperNetworks to learn the weights for a task-specific generative model. This generative model will be used to generate pseudo samples for rehearsal-based approaches. The paper considers two cases:

* HNET+R where the replay model (i.e., the generative model) is parameterized using a HyperNetwork.

* HNET+TIR, where an auxiliary task inference classifier is used to predict the task identity.

## Experiments

* Three setups are considered

* CL1 - Task identity is given to the model.

* CL2 - Task identity is not given, but task-specific heads are used.

* CL3 - Task identity needs to be explicitly inferred.

* On the permuted MNIST task, the proposed approach outperforms baselines like Synaptic Intelligence and Online EWC, and the performance gap is more significant for larger task sequences.

* Forward knowledge transfer is observed with the CIFAR datasets.

* One potential limitation (which is more of a limitation of HyperNetworks) is that HyperNetworks may be harder to scale for larger models like ResNet50 or transformers, thus limiting their usefulness for lifelong learning use cases.
74 changes: 74 additions & 0 deletions site/_posts/2021-02-15-When Do Curricula Work.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
layout: post
title: When Do Curricula Work?
comments: True
excerpt:
tags: ['2020', 'Curriculum Learning', 'ICLR 2021', AI, Empirical, ICLR]

---

## Introduction

* The paper systematically investigates when does curriculum learning help.

* [Link to the paper](https://arxiv.org/abs/2012.03107)

## Implicit Curricula

* Implicit curricula refers to the order in which a network learns data points when trained using stochastic gradient descent, with iid sampling of data.

* When training, let us say that the model makes a correct prediction for a given datapoint in the $i^{th}$ epoch (and correct prediction in all the subsequent epochs). The $i^{th}$ epoch is referred to as the *learned iteration* of the datapoint (iteration in which the datapoint was learned).

* The paper studied multiple models (VGG, ResNet, WideResNet, DenseNet, and EfficientNet) with different optimizers (Adam and SGD with momentum).

* The resulting implicit curricula are broadly consistent within the model families, making the following discussion less dependent on the model architecture.

## Explicit Curricula

* When defining an explicit curriculum, three important components stand out.

### Scoring Function

* Maps a data point to a numerical score of *difficulty*.

* Choices:

* Loss function for a model

* *learned iteration*

* Estimated c-score - It captures a given model's consistency to correctly predict a given datapoint's label when trained on an iid dataset (not containing the datapoint).

* The three scoring functions are computed for two models on the CIFAR dataset.

* The resulting six scores have a high Spearman Rank correlation. Hence for the rest of the discussion, only the c-score is used.

### Pacing Function

* This function, denoted by $g(t)$, controls the size of the training dataset at step $t$.

* At step $t$, the model would be trained on the first $g(t)$ examples (as per the ordering).

* Choices: logarithmic, exponential, step, linear, quadratic, and root.

### Order

* Order in which the data points are picked:

* *Curriculum* - Ordering points from lowest score to highest and training on the easiest data points first.

* *Anti Curriculum* - Ordering points from highest score to lowest and training on the hardest data points first.

* *Random* - Randomly selecting the data points to train on.


## Observations

* The paper performed a hyperparameter sweep over 180 pacing functions and three orderings for three random seeds over the CIFAR10 and CIFAR100 datasets. For both the datasets, the best performance is obtained with random ordering, indicating that curricula did not give any benefits.

* However, the curriculum is useful when the number of training iterations is small.

* It also helps with noisy data training (which is simulated by randomly permuting the labels).

* The observations for the smaller CIFAR10/100 dataset generalize to slightly larger datasets like FOOD101 and FOOD101N.

2 changes: 1 addition & 1 deletion site/_site
Submodule _site updated from f89811 to 64e043

0 comments on commit c967563

Please sign in to comment.