Skip to content

Commit

Permalink
Add papers to readme
Browse files Browse the repository at this point in the history
shagunsodhani committed Feb 24, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
1 parent 465d4a5 commit 038f118
Showing 5 changed files with 204 additions and 2 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -5,6 +5,9 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Massively Multilingual Neural Machine Translation in the Wild - Findings and Challenges](https://shagunsodhani.com/papers-I-read/Massively-Multilingual-Neural-Machine-Translation-in-the-Wild-Findings-and-Challenges)
* [Observational Overfitting in Reinforcement Learning](https://shagunsodhani.com/papers-I-read/Observational-Overfitting-in-Reinforcement-Learning)
* [Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML](https://shagunsodhani.com/papers-I-read/Rapid-Learning-or-Feature-Reuse-Towards-Understanding-the-Effectiveness-of-MAML)
* [Accurate, Large Minibatch SGD - Training ImageNet in 1 Hour](https://shagunsodhani.com/papers-I-read/Accurate-Large-Minibatch-SGD-Training-ImageNet-in-1-Hour)
* [Superposition of many models into one](https://shagunsodhani.com/papers-I-read/Superposition-of-many-models-into-one)
* [Towards a Unified Theory of State Abstraction for MDPs](https://shagunsodhani.com/papers-I-read/Towards-a-Unified-Theory-of-State-Abstraction-for-MDPs)
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@ layout: post
title: One-shot Learning with Memory-Augmented Neural Networks
comments: True
excerpt:
tags: ['2016', Memory Augmented Neural Network', 'Meta-Learning', 'One shot learning', AI, MANN, Memory]
tags: ['2016', 'Memory Augmented Neural Network', 'Meta Learning', 'One shot learning', AI, MANN, Memory]
---

## Introduction
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
layout: post
title: Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
comments: True
excerpt:
tags: ['2019', 'ICLR 2020', 'Meta Learning', AI, ICLR, MAML]

---


## Introduction

* The paper investigated two possible reasons behind the usefulness of MAML algorithm:

* **Rapid Learning** - Does MAML learn features that are amenable for rapid learning?

* **Feature Reuse** - Does the MAML initialization provide high-quality features that are useful for the unseen tasks.

* This leads to a follow-up question: how much task-specific inner loop adaptation is needed.

* [Link to the paper](https://arxiv.org/abs/1909.09157)

## Approach

* In a standard few-shot learning setup, the different datasets have different classes. Hence, the top-most layer (or the head) of the learning model should be different for different tasks.

* The subsequent discussion only applies to the body of the network (ie, network minus the head).

* **Freezing Layer Representations**

* In this setup, a subset (or all) of parameters are frozen (after MAML training) and are not adapted during the representation.

* Even when the entire network is frozen, the performance drops only marginally.

* This indicates that the representation learned by the meta-initialization is good enough to be useful on the test tasks (without requiring any adaptation step).

* Note that the head of the network is still adapted during testing.

* **Representational Similarity**

* In this setup, the paper reports the change in the latent representation (learned by the network) during the inner loop update with a fully trained model.

* Canonical Correlation Analysis (CCA) and Central Kernel Alignment (CKA) metrics are used to measure the similarity between the representations.

* The main finding is that the representations in the body of the network are very similar before and after the inner loop updates while the representations in the head of the network are very different.

* The above two observations indicate that feature reuse is the primary driving factor for the success of MAML.

* **When does feature reuse happen**

* The paper considers the model at different stages of training and compares the similarity in the representation (before and after the inner loop update).

* Even early in training, the CCA similarity between the representations (before and after the inner loop update) is quite high. Similarly, freezing the layers (for the test time update), early in training, does not degrade the test time performance much. This hints that the feature reuse happens early in the learning process.

## The ANIL (Almost No Inner Loop) Algorithm

* The empirical evidence suggests that the success of MAML lies in the feature reuse.

* The authors build on this observation and propose a simplification of the MAML algorithm: ANIL or Almost No Inner Loop Algorithm

* In this algorithm, the inner loop updates are applied only to the head of the network.

* Despite being much more straightforward, the performance of ANIL is close to the performance of MAML for both few-shot image classification and RL tasks.

* Removing most of the inner loop parameters speed up the computation by a factor of 1.7 (during training) and 4.1 (during inference).

## Removing the Inner Loop Update

* Given that it is possible to remove most of the parameters from the inner loop update (without affecting the performance), the next step is to check if the inner loop update can be removed entirely.

* This leads to the NIL (No Inner Loop) algorithm, which does not involve any inner loop adaptation steps.

### Algorithm

* A few-shot learning model is trained - either with MAML or ANIL.

* During testing, the head is removed.

* For each task, the K training examples are fed to the body to obtain class representations.

* For a given test data point, the representation of the data point is compared with the different class representations to obtain the target class.

* The NIL algorithm performs similar to the MAML and the ANIL algorithms for the few-shot image classification task.

* Note that it is still important to use MAML/ANIL during training, even though the learned head is not used during evaluation.

## Conclusion

* The paper discusses the different classes of meta-learning approaches. It concludes with the observation that feature reuse (and not rapid adaptation) seems to be the common model of operation for both optimization-based meta-learning (e.g., MAML) and model-based meta-learning.
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
layout: post
title: Massively Multilingual Neural Machine Translation in the Wild - Findings and Challenges
comments: True
excerpt:
tags: ['2019', 'Multi Domain', 'Multi Task', 'Natural Language Processing', 'Neural Machine Translation', AI, NLP, NMT, Scale]

---

## Introduction

* The paper proposes to build a universal neural machine translation system that can translate between any pair of languages.

* As a concrete instance, the paper prototypes a system that handles 103 languages (25 Billion translation pairs).

* [Link to the paper](https://arxiv.org/abs/1907.05019)

## Why universal Machine Translation

* Hypothesis: *The learning signal from one language should benefit the quality of other languages*[1](https://link.springer.com/article/10.1023/A:1007379606734)

* This positive transfer is evident for low resource languages but tends to hurt the performance for high resource languages.

* In practice, adding new languages reduces the effective per-task capacity of the model.

## Desiderata for Multilingual Translation Model

* Maximize the number of languages within one model.

* Maximize the positive transfer to low resource languages.

* Minimize the negative interference to high resource languages.

* Perform well ion the realistic, multi-domain settings.

## Datasets

* In-house corpus generated by crawling and extracting parallel sentences from the web.

* 102 languages, with 25 billion sentence pairs.

* Compared with the existing datasets, this dataset is much larger, spans more domains, has a good variation in the amount of data available for different language pairs, and is noisier. These factors bring additional challenges to the universal NMT setup.

## Baselines

* Dedicated Bilingual models (variants of Transformers).

* Most bilingual experiments used Transformer big and a shared source-target sentence-piece model (SPE).

* For medium and low resource languages, the Transformer Base was also considered.

* Batch size of 1 M tokes per-batch. Increasing the batch size improves model quality and speeds up convergence.

## Effect of Transfer and Interference

* The paper compares the following two setups with the baseline:

* Combine all the datasets and train over them as if it is a single dataset.

* Combine all the datasets but upsample low resource languages so all that all the languages are equally likely to appear in the combined dataset.

* A target "index" is prepended with every input sentence to indicate which language it should be translated into.

* Shared encoder and decoder are used across all the language pairs.

* The two setups use a batch size of 4M tokens.

### Results

* When all the languages are equally sampled, the performance on the low resource languages increases, at the cost of performance on high resource languages.

* Training over all the data at once reverse this trend.

### Countering Interference

* Temperature based sampling strategy is used to control the ratio of samples from different language pairs.

* A balanced sampling strategy improves the performance for the high resource languages (though not as good as the multilingual baselines) while retaining the high transfer performance on the low resource languages.

* Another reason behind the lagging performance (as compared to bilingual baselines) is the capacity of the multilingual models.

* Some open problems to consider:

* Task Scheduling - How to decide the order in which different language pairs should be trained.

* Optimization for multitask learning - How to design optimizer, loss functions, etc. that can exploit task similarity.

* Understanding Transfer:

* For the low resource languages, translating multiple languages to English leads to improved performance than translating English to multiple languages.

* This can be explained as follows: In the first case (many-to-one), the setup is that of a multi-domain model (each source language is a domain). In the second case (one-to-many), the setup is that of multitasking.

* NMT models seem to be more amenable to transfer across multiple domains than transfer across tasks (since the decoder distribution does not change much).

* In terms of zero-shot performance, the performance for most language pairs increases as the number of languages change from 10 to 102.

## Effect of preprocessing and vocabulary

* Sentence Piece Model (SPM) is used.

* Temperature sampling is used to sample vocabulary from different languages.

* Using smaller vocabulary (and hence smaller sub-word tokens) perform better for low resource languages, probably due to improved generalization.

* Low and medium resource languages tend to perform better with higher temperatures.

## Effect of Capacity

* Using deeper models improves performance (as compared to the wider models with the same number of parameters) on most language pairs.
2 changes: 1 addition & 1 deletion site/_site
Submodule _site updated from 4766c2 to 8c10e0

0 comments on commit 038f118

Please sign in to comment.