-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3210413
commit 4ff09a0
Showing
6 changed files
with
314 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
67 changes: 67 additions & 0 deletions
67
...2020-07-16-Averaging Weights leads to Wider Optima and Better Generalization.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
layout: post | ||
title: Averaging Weights leads to Wider Optima and Better Generalization | ||
comments: True | ||
excerpt: | ||
tags: ['2018', 'Stochastic Gradient Descent', 'UAI 2018', AI, Generalization, SGD, SWA, UAI] | ||
|
||
|
||
--- | ||
|
||
|
||
|
||
## Introduction | ||
|
||
* The paper proposes Stochastic Weight Averaging (SWA) procedure for improving the generalization performance of models trained with SGD (with cyclic or constant learning rate). | ||
|
||
* Specifically, the model is checkpointed at several points along the training trajectory, and these checkpoints are averaged (in the parameter space) to obtain a single model. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1803.05407) | ||
|
||
## Idea | ||
|
||
* "Stochastic" in the name refers to the idea that with cyclical or constant learning rate, SGD proposals are approximately sampled from a neural network's loss surface and are hence stochastic. | ||
|
||
* SWA uses a learning rate schedule that allows exploration in the weight space. | ||
|
||
* SGD with cyclical and constant learning rates explore points (model instances) at the periphery of high-performing networks. | ||
|
||
* With different initializations, SGD will find different points (of low training loss) on this boundary, but will not move inside it. | ||
|
||
* Averaging the points provide a mechanism to move inside this periphery. | ||
|
||
* The train and the test error surfaces, while being similar, are not perfectly aligned. Hence, averaging several models (along the optimization trajectory) could lead to a more robust model. | ||
|
||
## Algorithm | ||
|
||
* Given a model $w$ and some training budget $B$, train the model in the conventional way for approx 75% of the budget. | ||
|
||
* Starting from that point, continue training with the remaining budget, with a constant or cyclical learning rate. | ||
|
||
* For fixed learning rate, checkpoint models at each epoch. For cyclical learning rate, checkpoint the model at the lowest learning rate in the cycle. | ||
|
||
* Average all the models to get the SWA model. | ||
|
||
* If the model has Batch Normalization layers, run an additional pass to compute the SWA model's running mean and standard deviation. | ||
|
||
* The computational and space complexity of computing the SWA model is relatively low. | ||
|
||
* The paper highlights the ensembling like the effect of SWA by showing that if the model checkpoints ($w_i$) are generated by training with Fast Geometric Ensembling (FGE), the difference between averaging the weights and averaging the predictions is of the order $O(\Delta)$ where $\Delta = max \|\|w_i - w_{SA}\|\|$. | ||
|
||
* Note that SWA does not have the overhead of an extra-forward pass during inference. | ||
|
||
## Experiments | ||
|
||
* Datasets: CIFAR10, CIFAR100, ImageNet | ||
|
||
* Models: VGG16, WideResNet, 164-layer preactivation ResNet, ShakeShake, Pyramid Net. | ||
|
||
* Baselines: Conventional SGD, Exponentially decaying average with SGD and FGE. | ||
|
||
* In all the CIFAR experiments, SWA consistently outperforms SGD in one budget and consistently improves with training. | ||
|
||
* SWA also achieves performance comparable to FGE, despite FGE being an ensemble method. | ||
|
||
* On ImageNet, SWA is run on a pre-trained model, and it improves performance in all the cases. | ||
|
||
* An ablation experiment (on CIFAR-100) shows that it is possible to train a network (with SWA) using a fixed learning rate. In that setup, using SWA improves performance by 16%. |
111 changes: 111 additions & 0 deletions
111
..._posts/2020-07-23-TASKNORM--Rethinking Batch Normalization for Meta-Learning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
--- | ||
layout: post | ||
title: TaskNorm--Rethinking Batch Normalization for Meta-Learning | ||
comments: True | ||
excerpt: | ||
tags: ['2020', 'Batch Normalisation', ICML 2020', 'Meta Learning', AI, BatchNorm, BN, ICML, MAML, Normalization] | ||
|
||
|
||
--- | ||
|
||
|
||
|
||
## Introduction | ||
|
||
* Meta-learning techniques are shown to benefit from the use of deep neural networks. | ||
|
||
* BatchNorm is a commonly used component when training deep networks, especially for vision tasks. | ||
|
||
* However, BatchNorm and meta-learning make contradictory assumptions, and their combination may not work well in practice. | ||
|
||
* The paper proposes TaskNorm, a normalization method that is designed explicitly for meta-learning. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/2003.03284) | ||
|
||
## Setup | ||
|
||
* Standard meta-learning setup with $k$ tasks, each task with its own context and target set. | ||
|
||
* Two sets of parameters are considered during meta-learning - (i) global parameters, and (ii) task-specific parameters. | ||
|
||
* Meta-learning setup can be viewed as an inference task, where the task-specific parameters are inferred using a context set and some additional (trainable) parameters. | ||
|
||
* Normalization layers are commonly used to accelerate the training of neural networks. The general approach is to use normalization moments (statistics) along with some learned parameters. | ||
|
||
* BatchNorm is a well-known and widely used normalization approach. It relies on the implicit assumption that the dataset comprises of iid samples from some underlying distribution. | ||
|
||
* However, in meta-learning, data points are assumed to be iid only within a specific task. | ||
|
||
* This leaves open the question of what moments to use during meta-train and meta-test time. | ||
|
||
## Variants of BatchNorm | ||
|
||
### Conventional BatchNorm (CBN) | ||
|
||
* Compute moments at meta train time and use during meta test time. | ||
|
||
* This is equivalent to lumping the moments with the global parameters. I.e., the running moments are shared globally, while the data is iid only locally. | ||
|
||
* Using CBN with MAML leads to poor results. | ||
|
||
* Moreover, meta-learning setup can some times require the use of a very small batch size. (e.g., 1-shot learning) In those cases, the computed statistics are likely to be inaccurate. | ||
|
||
### Transductive BatchNorm (TBN) | ||
|
||
* Use context/target set statistics at both meta-train and meta-test time. | ||
|
||
* This is the default BatchNorm mode used in MAML. | ||
|
||
### Instance-based normalization | ||
|
||
* Moments are computed separately for each instance. | ||
|
||
* This mode corresponds to treating the statistics as local at the observation level. | ||
|
||
* These methods provide only limited improvement in performance, and can sometimes have a large overhead. | ||
|
||
## Task Normalization (Proposed) | ||
|
||
* The normalization statistics are local at the task level, and statistics for a given data point should only depend on the context set's data point. It should not depend on the other elements of the target set. | ||
|
||
* Meta-Batch Normalisation (METABN) is a precursor to TaskNorm where the context set alone is used to compute the normalization statistics for both the context and the target set (during both meta-test and meta-train time). | ||
|
||
* METABN does not perform well when used with small context sets. | ||
|
||
* TaskNorm overcomes this limitation by using a set of non-transductive, secondary moments (computed from the input being normalized). | ||
|
||
* When the context is small, using additional moments will help to improve the moment estimates. | ||
|
||
* In the general case, a trainable blending factor, $\alpha$, is used to combine the two sets of moments. | ||
|
||
* While the computational cost of TaskNorm is slightly more than CBN, it converges faster than CBN in practice. | ||
|
||
* Normalization mechanism in Reptile can be interpreted as a particular case of TaskNorm. | ||
|
||
## Experiments | ||
|
||
* Small scale few-shot classification experiments | ||
|
||
* Omniglot and imin ImageNet dataset | ||
|
||
* First order MAML, with different kinds of normalization schemes. | ||
|
||
* Transductive BatchNorm performs the best. | ||
|
||
* Among non-transductive approaches, TaskNorm using Instance Normalisation augmentation performs the best. | ||
|
||
* Similar trend holds for the speed of convergence as well. | ||
|
||
* Large scale few-shot classification experiments | ||
|
||
* MetaDataset dataset | ||
|
||
* CNAPs model | ||
|
||
* The context set's size varies across tasks in this setup and can be as small as 5. | ||
|
||
* TaskNorm with Instance Normalisation ranks first in 10 (out of 13) datasets and is also the fastest to train. | ||
|
||
* While Instance-based methods (Instance Normalisation and Layer Normalisation) are the slowest to converge, they still outperform the running average based methods (conventional BatchNorm). | ||
|
||
* The results demonstrate that designing meta-learning specific normalization methods can significantly improve performance and that Transductive BatchNorm may not always be the optimal choice. |
68 changes: 68 additions & 0 deletions
68
...radient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
--- | ||
layout: post | ||
title: GradNorm--Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks | ||
comments: True | ||
excerpt: | ||
tags: ['1027', 'Gradient Manipulation', 'Gradient Normalization', 'ICML 2018', 'Multi Task', AI, ICML] | ||
|
||
|
||
--- | ||
|
||
|
||
|
||
## Introduction | ||
|
||
|
||
* The paper proposes GradNorm, a gradient normalization algorithm that improves multi-task training by dynamically tuning the magnitude of gradients corresponding to different tasks. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1711.02257) | ||
|
||
## Motivation | ||
|
||
* During multi-task training, some tasks can dominate the training, at the expense of others. | ||
|
||
* It is common to define the multi-task loss as a linearly weighted combination of the individual task losses. | ||
|
||
* The paper proposes two changes to this setup: | ||
|
||
* Adapt weight-coefficients, assigned to each loss term, at each training step. | ||
|
||
* Directly modify the gradient magnitudes, corresponding to different tasks, so that all the tasks are learning at similar rates. | ||
|
||
* Proposed GradNorm algorithm is similar to BatchNorm, but it performs normalization across tasks, not data batches. | ||
|
||
## Algorithm | ||
|
||
* Gradient norm at timestep $t$, for the $i^{th}$ task, is computed as the product between average gradient norm (across all tasks at timestep $t$) and $r_i(t) ^ {\alpha}$. | ||
|
||
* $r_i$ is the relative inverse training rate of task $i$. It is defined as the ratio between the loss ratio of task $i$ and the average loss ratio (across all the tasks). | ||
|
||
* $\alpha$ is a hyperparameter. | ||
|
||
* This computed per-task gradient norm is treated as the target value for actual gradient norms. | ||
|
||
* An additional $L_1$ loss is incorporated between the actual and the target gradient norms, summed over all the tasks, and optimizes the weight-coefficients only. | ||
|
||
* After every step, the weight-coefficients are renormalized to decouple the gradient normalization from the global learning rate. | ||
|
||
* Note that all the gradient norm computations are performed only for the layers on which GradNorm is applied. Generally, GradNorm is used with only the last shared layer of weights (to save on computational costs). | ||
|
||
## Experiments | ||
|
||
* Two variants of NYUv2 dataset -- NYUv2+seg (small dataset) and NYUv2+kpts (big dataset). | ||
|
||
* Both regression and classification setups were used. | ||
|
||
* Models: | ||
|
||
* SegNet with a symmetric VGG16 encoder/decoder | ||
|
||
* FCN with modified ResNet-50 as the encoder and shallow ResNet as the decoder. | ||
|
||
* Standard pixel-wise losses for each task. | ||
|
||
### Results | ||
|
||
* GradNorm with $\alpha=1.5$ outperforms the equal-weight baseline and either surpasses or matches the best performance of single networks for each task. | ||
|
||
* Almost any value of 0 < $\alpha$ < 3 improves the network's performance over an equal weight baseline. |
64 changes: 64 additions & 0 deletions
64
..._posts/2020-08-24-Alpha Net--Adaptation with Composition in Classifier Space.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
--- | ||
layout: post | ||
title: Alpha Net--Adaptation with Composition in Classifier Space | ||
comments: True | ||
excerpt: | ||
tags: ['2020', 'Long-tailed Dataset', 'Transfer Learning', AI, Classifier, Compositionality] | ||
|
||
|
||
--- | ||
|
||
|
||
## Introduction | ||
|
||
* Common transfer learning method focuses on transferring knowledge in the model feature space. | ||
|
||
* In contrast, the paper argues that the learned knowledge is more concisely captured in the "classifier space" as the classifier is fitted for all the samples for a given class, while the feature representation is specific to each sample. | ||
|
||
* Building on this intuition, the paper proposes to combine strong classifiers (trained on large datasets) with weak classifiers (trained on smaller datasets) to improve the weak classifiers' performance. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/2008.07073) | ||
|
||
## High-Level Idea | ||
|
||
* Given $n$ classifiers, $C_1, ..., C_n$, trained with a large amount of data and a weak classifier $a$ trained for a class with few samples. | ||
|
||
* Find the nearest neighbors of $a$. | ||
|
||
* Train a new classifier by linearly combining $a$ with its nearest classifiers. | ||
|
||
* The coefficients (for linearly combining the classifiers) are learned using another classifier called as AlphaNet. | ||
|
||
* In theory, this approach can be used with any set of classifiers. | ||
|
||
## Setup | ||
|
||
* A long-tailed dataset is one where some classes (referred to as the tail classes) have very few examples—for example, ImageNet-LT and Places-LT. | ||
|
||
* Split the long-tailed dataset into two splits - "base" classes with $B$ (number of) classes and "few" classes with $F$ (number of) classes. | ||
|
||
* Total number of classes $N = B + F$. | ||
|
||
* Start with a pre-trained model, with classifiers $w_j$ and biases $b_j$ for $j \in (1, N)$. | ||
|
||
* For a given target class $j$, find its top $k$ nearest neighbor classifiers and concatenate their output. | ||
|
||
* For each "few" class, learn a feedforward network that takes the concatenated representation (of classifiers) as the input and returns a vector of $k \alpha$ values. | ||
|
||
* These $\alpha$ values are interpreted as the classifier's strength (or confidence) in its nearest neighbors. | ||
|
||
* The (normalized) alpha values are used for defining the weight and bias for the classifier for the given "few" class. | ||
|
||
* The collection of all the "few" classifiers is referred to as the AlphaNet. | ||
|
||
* The paper outlines a degenerate case, where the confidence in the prediction of all the strong classifiers goes to 0. The paper proposes to counter this case by clamping the $\alpha$ values. | ||
|
||
* The entire setup is trained end-to-end using cross-entropy loss on AlphaNet. | ||
|
||
## Results | ||
|
||
* Given the proposed approach's flexibility, it is used to combine the state-of-the-art models on ImageNet-LT, namely retraining classifiers on class-balanced samples and training models with weight normalization. The combined setup outperforms the individual models. | ||
|
||
* One interesting observation is that it is useful to include the weak classifiers, along with the strong classifiers, as AlphaNet adjusts the position of weak classifiers towards the appropriate strong classifier. | ||
|
||
* While the idea is described in the context of long-tail data distribution, the idea is useful in the general context of non-stationary data distribution. One instantiation could be lifelong class incremental learning where the model encounters new data classes during training. For some time duration (till sufficient data points are seen), the newly seen classes are the "few" classes. This approach can help with faster adaptation when the model is yet to see sufficient examples for the unseen classes. |
Submodule _site
updated
from 25320b to 0f81ea