diff --git a/README.md b/README.md index 31de5a09..c6c656e4 100755 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [Efficient Lifelong Learningi with A-GEM](https://shagunsodhani.in/papers-I-read/Efficient-Lifelong-Learning-with-A-GEM) * [Pre-training Graph Neural Networks with Kernels](https://shagunsodhani.in/papers-I-read/Pre-training-Graph-Neural-Networks-with-Kernels) * [Smooth Loss Functions for Deep Top-k Classification](https://shagunsodhani.in/papers-I-read/Smooth-Loss-Functions-for-Deep-Top-k-Classification) * [Hindsight Experience Replay](https://shagunsodhani.in/papers-I-read/Hindsight-Experience-Replay) diff --git a/site/_posts/2018-05-21-Net2Net - Accelerating Learning via Knowledge Transfer.md b/site/_posts/2018-05-21-Net2Net - Accelerating Learning via Knowledge Transfer.md index df658f51..407060ca 100755 --- a/site/_posts/2018-05-21-Net2Net - Accelerating Learning via Knowledge Transfer.md +++ b/site/_posts/2018-05-21-Net2Net - Accelerating Learning via Knowledge Transfer.md @@ -3,7 +3,7 @@ layout: post title: Net2Net-Accelerating Learning via Knowledge Transfer comments: True excerpt: The paper presents a simple yet effective approach for transferring knowledge from a trained neural network to a large, untrained network -tags: ['2016', 'ICLR 2016', 'Knowledge Transfer', 'Life Long Learning', AI, CV] +tags: ['2016', 'ICLR 2016', 'Knowledge Transfer', 'Lifelong Learning', AI, CV] --- ## Notes diff --git a/site/_posts/2019-01-08-Efficient Lifelong Learning with A-GEM.md b/site/_posts/2019-01-08-Efficient Lifelong Learning with A-GEM.md new file mode 100755 index 00000000..2e8f2c56 --- /dev/null +++ b/site/_posts/2019-01-08-Efficient Lifelong Learning with A-GEM.md @@ -0,0 +1,105 @@ +--- +layout: post +title: Efficient Lifelong Learning with A-GEM +comments: True +excerpt: +tags: ['2019', 'Catastrophic Forgetting', 'Continual Learning', 'ICLR 2019', 'Lifelong Learning', AI, CV, CL, ICLR] +--- + +## Contributions + +* A new (and more realistic) evaluation protocol for lifelong learning where each data point is observed just once and a disjoint set of tasks are used for training and validation. + +* A new metric that focuses on the efficiency of the models - in terms of sample complexity and computational (and memory) costs. + +* Modification of [Gradient Episodic Memory ie GEM](https://arxiv.org/abs/1706.08840) which reduces the computational overhead of GEM without compromising on the results. + +* Empirical validation that using task descriptors help lifelong learning models and improve their few-shot learning capabilities. + +* [Link to the paper](https://arxiv.org/abs/1812.00420) + +* [Link to the code](https://github.com/facebookresearch/agem/) + +## Learning Protocol + +* Two group of datasets - one for training and evaluation (DEV) and other for cross validation (DCV). + +* Data can be sampled multiple times for cross-validation dataset but only once from the training dataset. + +* Each group of dataset (say DEV or DCV) is a list of task-specific datasets Dk (k is the task index). + +* Each sample in Dk is of the form (x, t, y) where x is the data, t is the task descriptor and y is the output. + +* Dk contains Bk minibatches of data. + +## Metrics + +### Accuracy + +* ak,i,j = accuracy on test task j after training on ith minibatch of training task k. + +* Ak = mean over all j = 1 to k (ak, Bk, j) ie train the model on data for task k and then test it on all the tasks. + +### Forgetting Measure + +* fjk = forgetting on task j after training on all minibatches upto task k. + +* fjk = max over all l = 1 to k-1 (al, Blj - ak, Bkj) + +* Forgetting = Fk = mean over all j = 1 to k-1 (fjk) + +### LCA - Learning Curve Area + +* Zb = average b shot performance where b is the minibatch number. + +* Zb = mean over all k = 0 to T (ak, b, k) + +* LCAβ = mean over all b = 0 to β (Zb) + +* One special case is LCA0 which is the forward transfer performance or performance on the unseen task. + +* In experiments, β is kept small as we want the model to learn from few examples. + +## Model + +* GEM has been shown to be very effective in single epoch setting but introduces a very high computational overhead. + +* Average GEM (AGEM) reduces this overhead by sampling (and using) only some examples from the episodic memory instead of using all the examples. + +* While GEM provides better guarantees in terms of worst-case forgetting, AGEM provides better guarantees in terms of average accuracy. + +## Joint Embedding Model Using Compositional Task Descriptors + +* Compositional Task Descriptors are used to speed training on the subsequent tasks. + +* A matrix specifying the attribute value of objects (to be recognized in the task) are used. + +* A joint-embedding space between image features and attribute embeddings is learned. + +## Experiments + +### Datasets + +* [Permuted MNIST](https://arxiv.org/abs/1612.00796) + +* [Split CIFAR](https://arxiv.org/abs/1703.04200) + +* [Split CUB](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) + +* [Split AWA](http://cvml.ist.ac.at/papers/lampert-cvpr2009.pdf) + +### Setup + +* Integer task descriptors for MNIST and CIFAR and class attributes as descriptors for CUB and AWA + +* Baselines include [GEM](https://arxiv.org/abs/1706.08840), [iCaRL](https://arxiv.org/abs/1611.07725), [Elastic Weight Consolidation](https://arxiv.org/pdf/1612.00796.pdf), [Progressive Neural Networks](https://arxiv.org/abs/1606.04671) etc. + +## Results + +* AGEM outperforms other models on all the datasets expect MNIST where the Progressive Neural Networks lead. One reason could be that MNIST has a large number of training examples per task. But Progressive Neural Networks lead to bad utilization of capacity. + +* While AGEM and GEM have similar performance, GEM has a much higher computational and memory overhead. + +* Use of task descriptors improves the accuracy for all the models. + +* It seems that AGEM offers a good tradeoff between average accuracy performance and efficiency - in terms of sample efficiency, memory requirements and computational costs. diff --git a/site/_site b/site/_site index 1bb47d3b..197e7692 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit 1bb47d3be348374758763f448cdf58d78fdd52ac +Subproject commit 197e76927f70727a5a3ca3b29575f90b66d26909