-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
5cc4eab
commit 8881498
Showing
3 changed files
with
123 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
--- | ||
layout: post | ||
title: Modular meta-learning | ||
comments: True | ||
excerpt: | ||
tags: ['2018', 'Meta Learning', 'Modular Meta Learning', 'Modular ML', 'Modular Network', Module] | ||
--- | ||
|
||
## Introduction | ||
|
||
* The paper proposes an approach for learning neural networks (modules) that can be combined in different ways to solve different tasks (combinatorial generalization). | ||
|
||
* The proposed model is called as BOUNCEGRAD. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1806.10166) | ||
|
||
* [Link to the code](https://github.com/FerranAlet/modular-metalearning) | ||
|
||
## Setup | ||
|
||
* Focuses on supervised learning. | ||
|
||
* Task distribution *p(T)*. | ||
|
||
* Each task is a joint distribution *p<sub>T</sub>(x, y)* over *(x, y)* data pairs. | ||
|
||
* Given data from *m* meta-training tasks, and a meta-test task, find a hypothesis *h* which performs well on the unseen data drawn from the meta-test task. | ||
|
||
## Structured Hypothesis | ||
|
||
* Given a compositional scheme *C*, a set of modules *F<sub>1</sub>, ..., F<sub>k</sub>* (represented as a whole by *F*) and the set of their respective parameters θ<sub>1</sub>, ..., θ<sub>k</sub> (represented as a whole by θ), *(C, F, θ)* represents the set of possible functional input-output mappings. These mappings form the hypothesis space. | ||
|
||
* A structured hypothesis model is specified by what modules to use and their parametric forms (but not the values). | ||
|
||
### Examples of compositional schemes | ||
|
||
* Choosing a single module for the task at hand. | ||
|
||
* Fixed compositional structure but different modules selected every time. | ||
|
||
* Weight ensemble (maybe using attention mechanism) | ||
|
||
* General function composition tree | ||
|
||
### Phases | ||
|
||
* Offline Meta Learning Phase: | ||
|
||
* Take training and validation dataset for the first *k* tasks and generate a parameterization for each module *θ<sub>1</sub>, ..., θ<sub>k</sub>*. | ||
|
||
* The hypothesis (or composition) to use comes from the online meta-test learning phase. | ||
|
||
* In this stage, find the best θ given a structure. | ||
|
||
* Online Meta-test Learning Phase | ||
|
||
* Given a hypothesis space and θ, the output is a compositional form (or hypothesis) that specifies how to compose the models. | ||
|
||
* In this stage, find the best structure, given a hypothesis space and θ. | ||
|
||
## Learning Algorithm | ||
|
||
* During Meta-test learning phase, simulated annealing is used to find the optimal structure, with temperature *T* decreased over time. | ||
|
||
* During meta-learning phrase, the actual objective function is replaced by a surrogate, smooth objective function (during the search step) to avoid local minima. | ||
|
||
* Once a structure has been picked, any gradient descent based approach can be used to optimize the modules. | ||
|
||
* Basically the state of optimization process comprises of the parameters and the temperature. Together, they are used to induce a distribution over the structures. Given a structure, θ is optimized and *T* is annealed over time. | ||
|
||
* The learning procedure can be improved upon by performing parameter tuning during the online (meta-test learning) phase as well. the resulting approach is referred to as MOMA - MOdular MAml. | ||
|
||
## Experiments | ||
|
||
### Approaches | ||
|
||
* Pooled - Single network using combined data of all the tasks. | ||
|
||
* MAML - Single network using MAML | ||
|
||
* BOUNCEGRAD - Modular Network without MAML adaptation in online learning. | ||
|
||
* MOMA - BOUNCEGRAD with MAML adaptation in online learning. | ||
|
||
### Domains | ||
|
||
#### Simple Functional Relationships | ||
|
||
* Sine-function prediction problem | ||
|
||
* In general, MOMA outperforms other models. | ||
|
||
* With a small amount of online training data, BOUNCEGRAD outperforms other models as it has a better structural prior. | ||
|
||
#### Predicting next frame of a kinematic skeleton (motion capture data) | ||
|
||
* 11 different objects (with different shapes) on 4 surfaces with different friction properties. | ||
|
||
* 2 meta-learning scenarios are considered. In the first case, the object-surface combination in the test case was present in some meta-training tasks and in the other case, it was not present. | ||
|
||
* For previously seen combinations, MOMA performs the best followed by BOUNCEGRAD and MAML. | ||
|
||
* For unseen combinations, all the 3 are equally good. | ||
|
||
* Compositional scheme is the attention mechanism. | ||
|
||
* An interesting result is that the modules seem to specialize (and activate more often) based on the shape of the object. | ||
|
||
### Predicting next frame of a kinematic selection (using motion capture data) | ||
|
||
* Composition Structure - generating kinematics subtrees for each body part (2 legs, 2 arms, 2 torsi). | ||
|
||
* Again 2 setups are used - one where all activities in the training and the meta-test task are shared while the other setup where the activities are not shared. | ||
|
||
* For known activities MOMA and BOUNCEGRAD perform the best while for unknown activities, MOMS performs the best. | ||
|
||
## Notes | ||
|
||
* While the approach is interesting, maybe a more suitable set of tasks (from the point of composition) would be more convincing. | ||
|
||
* It would be useful to see the computational tradeoff between MAML, BOUNCEGRAD, and MOMA. |
Submodule _site
updated
from 65e4bf to a705a1