From eef99dc7f68694adfb8b5fcb97e04b8aaaf83ed3 Mon Sep 17 00:00:00 2001 From: Shagun Sodhani <sshagunsodhani@gmail.com> Date: Sun, 11 Apr 2021 10:09:29 -0400 Subject: [PATCH] Add new papers --- README.md | 5 + site/_config_local.yml | 27 ++++ site/_config_server.yml | 27 ++++ site/_oldposts/2013-12-31-whats-jekyll.md | 13 ++ site/_oldposts/2014-01-01-example-content.md | 123 ++++++++++++++++++ .../2014-01-02-introducing-lanyon.md | 41 ++++++ ...4-Compositional Explanations of Neurons.md | 94 +++++++++++++ ...g with Micro-Batch Pipeline Parallelism.md | 50 +++++++ ...rgy-based Models for Continual Learning.md | 92 +++++++++++++ site/_posts/2021-01-25-HyperNetworks.md | 59 +++++++++ ...ng by Generating Task-specific Adapters.md | 70 ++++++++++ site/_site | 2 +- 12 files changed, 602 insertions(+), 1 deletion(-) create mode 100755 site/_config_local.yml create mode 100755 site/_config_server.yml create mode 100755 site/_oldposts/2013-12-31-whats-jekyll.md create mode 100755 site/_oldposts/2014-01-01-example-content.md create mode 100755 site/_oldposts/2014-01-02-introducing-lanyon.md create mode 100755 site/_posts/2021-01-04-Compositional Explanations of Neurons.md create mode 100755 site/_posts/2021-01-11-GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism.md create mode 100755 site/_posts/2021-01-18-Energy-based Models for Continual Learning.md create mode 100755 site/_posts/2021-01-25-HyperNetworks.md create mode 100755 site/_posts/2021-02-01-Zero-shot Learning by Generating Task-specific Adapters.md diff --git a/README.md b/README.md index f85fe24f..21417f41 100755 --- a/README.md +++ b/README.md @@ -4,6 +4,11 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [Zero-shot Learning by Generating Task-specific Adapters](https://shagunsodhani.com/papers-I-read/Zero-shot-Learning-by-Generating-Task-specific-Adapters) +* [HyperNetworks](https://shagunsodhani.com/papers-I-read/HyperNetworks) +* [Energy-based Models for Continual Learning](https://shagunsodhani.com/papers-I-read/Energy-based-Models-for-Continual-Learning) +* [GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism](https://shagunsodhani.com/papers-I-read/GPipe-Easy-Scaling-with-Micro-Batch-Pipeline-Parallelism) +* [Compositional Explanations of Neurons](https://shagunsodhani.com/papers-I-read/Compositional-Explanations-of-Neurons) * [Design patterns for container-based distributed systems](https://shagunsodhani.com/papers-I-read/Design-patterns-for-container-based-distributed-systems) * [Cassandra - a decentralized structured storage system](https://shagunsodhani.com/papers-I-read/Cassandra-a-decentralized-structured-storage-system) * [CAP twelve years later - How the rules have changed](https://shagunsodhani.com/papers-I-read/CAP-twelve-years-later-How-the-rules-have-changed) diff --git a/site/_config_local.yml b/site/_config_local.yml new file mode 100755 index 00000000..e30bc0de --- /dev/null +++ b/site/_config_local.yml @@ -0,0 +1,27 @@ +# Permalinks +# +# Use of `relative_permalinks` ensures post links from the index work properly. +permalink: '/:title' +# relative_permalinks: true + +# Setup +title: 'Papers I Read' +tagline: 'Notes and Summaries' +description: 'I am trying a new initiative - <i>A Paper A Week</i>. This blog will hold all the notes and summaries.' +# url: 'https://shagunsodhani.in/test' +baseurl: '' +paginate: 5 +gems: [jekyll-paginate] + +# About/contact +author: + name: Shagun Sodhani + url: https://shagunsodhani.in + email: sshagunsodhani@gmail.com + +# Custom vars +version: 1.0.0 +str_continue_reading: " Continue reading" + +github: + repo: https://github.com/shagunsodhani/papers-I-read diff --git a/site/_config_server.yml b/site/_config_server.yml new file mode 100755 index 00000000..c101ba71 --- /dev/null +++ b/site/_config_server.yml @@ -0,0 +1,27 @@ +# Permalinks +# +# Use of `relative_permalinks` ensures post links from the index work properly. +permalink: '/:title' +# relative_permalinks: true + +# Setup +title: 'Papers I Read' +tagline: 'Notes and Summaries' +description: 'I am trying a new initiative - <i>A Paper A Week</i>. This blog will hold all the notes and summaries.' +# url: 'https://shagunsodhani.in/test' +baseurl: 'https://shagunsodhani.in/papers-I-read' +paginate: 5 +gems: [jekyll-paginate] + +# About/contact +author: + name: Shagun Sodhani + url: https://shagunsodhani.in + email: sshagunsodhani@gmail.com + +# Custom vars +version: 1.0.0 +str_continue_reading: " Continue reading" + +github: + repo: https://github.com/shagunsodhani/papers-I-read diff --git a/site/_oldposts/2013-12-31-whats-jekyll.md b/site/_oldposts/2013-12-31-whats-jekyll.md new file mode 100755 index 00000000..7aadde76 --- /dev/null +++ b/site/_oldposts/2013-12-31-whats-jekyll.md @@ -0,0 +1,13 @@ +--- +layout: post +title: What's Jekyll? +comments: True +--- + +[Jekyll](http://jekyllrb.com) is a static site generator, an open-source tool for creating simple yet powerful websites of all shapes and sizes. From [the project's readme](https://github.com/mojombo/jekyll/blob/master/README.markdown): + + > Jekyll is a simple, blog aware, static site generator. It takes a template directory [...] and spits out a complete, static website suitable for serving with Apache or your favorite web server. This is also the engine behind GitHub Pages, which you can use to host your project’s page or blog right here from GitHub. + +It's an immensely useful tool and one we encourage you to use here with Lanyon. + +Find out more by [visiting the project on GitHub](https://github.com/mojombo/jekyll). diff --git a/site/_oldposts/2014-01-01-example-content.md b/site/_oldposts/2014-01-01-example-content.md new file mode 100755 index 00000000..d97de435 --- /dev/null +++ b/site/_oldposts/2014-01-01-example-content.md @@ -0,0 +1,123 @@ +--- +layout: post +title: Example content +comments: True +--- + + +<div class="message"> + Howdy! This is an example blog post that shows several types of HTML content supported in this theme. +</div> + +Cum sociis natoque penatibus et magnis <a href="#">dis parturient montes</a>, nascetur ridiculus mus. *Aenean eu leo quam.* Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum. + +> Curabitur blandit tempus porttitor. Nullam quis risus eget urna mollis ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit. + +Etiam porta **sem malesuada magna** mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur. + +## Inline HTML elements + +HTML defines a long list of available inline tags, a complete list of which can be found on the [Mozilla Developer Network](https://developer.mozilla.org/en-US/docs/Web/HTML/Element). + +- **To bold text**, use `<strong>`. +- *To italicize text*, use `<em>`. +- Abbreviations, like <abbr title="HyperText Markup Langage">HTML</abbr> should use `<abbr>`, with an optional `title` attribute for the full phrase. +- Citations, like <cite>— Mark otto</cite>, should use `<cite>`. +- <del>Deleted</del> text should use `<del>` and <ins>inserted</ins> text should use `<ins>`. +- Superscript <sup>text</sup> uses `<sup>` and subscript <sub>text</sub> uses `<sub>`. + +Most of these elements are styled by browsers with few modifications on our part. + +## Heading + +Vivamus sagittis lacus vel augue rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros. + +### Code + +Cum sociis natoque penatibus et magnis dis `code element` montes, nascetur ridiculus mus. + +{% highlight js %} +// Example can be run directly in your JavaScript console + +// Create a function that takes two arguments and returns the sum of those arguments +var adder = new Function("a", "b", "return a + b"); + +// Call the function +adder(2, 6); +// > 8 +{% endhighlight %} + +Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa. + +### Lists + +Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. + +* Praesent commodo cursus magna, vel scelerisque nisl consectetur et. +* Donec id elit non mi porta gravida at eget metus. +* Nulla vitae elit libero, a pharetra augue. + +Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue. + +1. Vestibulum id ligula porta felis euismod semper. +2. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. +3. Maecenas sed diam eget risus varius blandit sit amet non magna. + +Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis. + +<dl> + <dt>HyperText Markup Language (HTML)</dt> + <dd>The language used to describe and define the content of a Web page</dd> + + <dt>Cascading Style Sheets (CSS)</dt> + <dd>Used to describe the appearance of Web content</dd> + + <dt>JavaScript (JS)</dt> + <dd>The programming language used to build advanced Web sites and applications</dd> +</dl> + +Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Nullam quis risus eget urna mollis ornare vel eu leo. + +### Tables + +Aenean lacinia bibendum nulla sed consectetur. Lorem ipsum dolor sit amet, consectetur adipiscing elit. + +<table> + <thead> + <tr> + <th>Name</th> + <th>Upvotes</th> + <th>Downvotes</th> + </tr> + </thead> + <tfoot> + <tr> + <td>Totals</td> + <td>21</td> + <td>23</td> + </tr> + </tfoot> + <tbody> + <tr> + <td>Alice</td> + <td>10</td> + <td>11</td> + </tr> + <tr> + <td>Bob</td> + <td>4</td> + <td>3</td> + </tr> + <tr> + <td>Charlie</td> + <td>7</td> + <td>9</td> + </tr> + </tbody> +</table> + +Nullam id dolor id nibh ultricies vehicula ut id elit. Sed posuere consectetur est at lobortis. Nullam quis risus eget urna mollis ornare vel eu leo. + +----- + +Want to see something else added? <a href="https://github.com/poole/poole/issues/new">Open an issue.</a> diff --git a/site/_oldposts/2014-01-02-introducing-lanyon.md b/site/_oldposts/2014-01-02-introducing-lanyon.md new file mode 100755 index 00000000..3ef956f3 --- /dev/null +++ b/site/_oldposts/2014-01-02-introducing-lanyon.md @@ -0,0 +1,41 @@ +--- +layout: post +title: Introducing Lanyon +comments: True +excerpt: Lanyon is an unassuming Jekyll theme that places content first by tucking +tags: [Hello] +--- + +Lanyon is an unassuming [Jekyll](http://jekyllrb.com) theme that places content first by tucking away navigation in a hidden drawer. It's based on [Poole](http://getpoole.com), the Jekyll butler. + +### Built on Poole + +Poole is the Jekyll Butler, serving as an upstanding and effective foundation for Jekyll themes by [@mdo](https://twitter.com/mdo). Poole, and every theme built on it (like Lanyon here) includes the following: + +* Complete Jekyll setup included (layouts, config, [404](/404), [RSS feed](/atom.xml), posts, and [example page](/about)) +* Mobile friendly design and development +* Easily scalable text and component sizing with `rem` units in the CSS +* Support for a wide gamut of HTML elements +* Related posts (time-based, because Jekyll) below each post +* Syntax highlighting, courtesy Pygments (the Python-based code snippet highlighter) + +### Lanyon features + +In addition to the features of Poole, Lanyon adds the following: + +* Toggleable sliding sidebar (built with only CSS) via **☰** link in top corner +* Sidebar includes support for textual modules and a dynamically generated navigation with active link support +* Two orientations for content and sidebar, default (left sidebar) and [reverse](https://github.com/poole/lanyon#reverse-layout) (right sidebar), available via `<body>` classes +* [Eight optional color schemes](https://github.com/poole/lanyon#themes), available via `<body>` classes + +[Head to the readme](https://github.com/poole/lanyon#readme) to learn more. + +### Browser support + +Lanyon is by preference a forward-thinking project. In addition to the latest versions of Chrome, Safari (mobile and desktop), and Firefox, it is only compatible with Internet Explorer 9 and above. + +### Download + +Lanyon is developed on and hosted with GitHub. Head to the <a href="https://github.com/poole/lanyon">GitHub repository</a> for downloads, bug reports, and features requests. + +Thanks! diff --git a/site/_posts/2021-01-04-Compositional Explanations of Neurons.md b/site/_posts/2021-01-04-Compositional Explanations of Neurons.md new file mode 100755 index 00000000..385487ab --- /dev/null +++ b/site/_posts/2021-01-04-Compositional Explanations of Neurons.md @@ -0,0 +1,94 @@ +--- +layout: post +title: Compositional Explanations of Neurons +comments: True +excerpt: +tags: ['2020', 'Natural Language Inference', 'NeurIPS 2020', AI, Compositionality, Explainability, Interpretability, NeurIPS, NLI] + +--- + +## Introduction + +* The paper describes a method to explain/interpret the representations learned by individual neurons in deep neural networks. + +* The explanations are generated by searching for logical forms defined by a set of composition operators (like OR, AND, NOT) over primitive concepts (like water). + +* [Link to the paper](https://arxiv.org/abs/2006.14032) + +## Generating compositional explanations + +* Given a neural network *f*, the goal is to explain a neuron's behavior (of this network) in human-understandable terms. + +* [Previous work](http://netdissect.csail.mit.edu/) builds on the idea that a good explanation is a description that identifies the inputs for which the neuron activates. + +* Given a set of pre-defined atomic concepts $c \in C$ and a similarity measure $\delta(n, c)$ where $n$ represents the activation of the $n^{th}$ neuron, the explanation, for the $n^{th}$ neuron, is the concept most similar to $n$. + +* For images, a concept could be represented as an image segmentation map. For example, the water concept can be represented by the segments of the images that show water. + +* The similarity can be measured by first thresholding the neuron activations (to get a neuron mask) and then computing the IoU score (or Jaccard Similarity) between the neuron mask and the concept. + +* One limitation of this approach is that the explanations are restricted to pre-defined concepts. + +* The paper expands the set of candidate concepts by considering the logical forms of the atomics concepts. + +* In theory, the search space would explode exponentially. In practice, it is restricted to explanations with at most $N$ atomics concepts, and beam search is performed (instead of exhaustive search). + +## Setup + +* **Image Classification Setup** + + * Neurons from the final 512-unit convolutional layer of a ResNet-18 trained on the [Places365 dataset](https://ieeexplore.ieee.org/abstract/document/7968387). + + * Probing for concepts from [ADE20k scenes dataset](https://openaccess.thecvf.com/content_cvpr_2017/html/Zhou_Scene_Parsing_Through_CVPR_2017_paper.html) with atomic concepts defined by annotations in the [Broden dataset](http://netdissect.csail.mit.edu/) + +* **NLI Setup** + + * BiLSTM baseline followed by MLP layers trained on [Stanford Natural Language Inference (SNLI) corpus](https://nlp.stanford.edu/projects/snli/). + + * Probing the penultimate hidden layer (of the MLP component) for sentence-level explanations. + + * Concepts are created using the 2000 most common words in the validation split of the SNLI dataset. + + * Additional concepts are created based on the lexical overlap between premise and hypothesis. + +## Do neurons learn compositional concepts + +* **Image Classification Setup** + + * As $N$ increases, the mean IoU increases (i.e., the explanation quality increases) though the returns become diminishing beyond $N=10$. + + * Manual inspection of 128 neurons and their length 10 explanations show that 69% neurons learned some meaningful combination of concepts, while 31% learned some unrelated concepts. + + * The meaningful combination of concepts include: + + * perceptual abstraction that is also lexically coherent (e.g., "skyscraper OR lighthouse OR water tower"). + + * perceptual abstraction that is not lexically coherent (e.g., "cradle OR autobus OR fire escape"). + + * specialized abstraction of the form L1 AND NOT L2 (e.g. (water OR river) AND NOT blue). + +* **NLI Setup** + + * As $N$ increases, the mean IoU increases (as in the image classification setup) though the IoU keeps increasing past $N=30$. + + * Many neurons correspond to lexical features. For example, some neurons are gender-sensitive or activate for verbs like sitting, eating or sleeping. Some neurons are activated when the lexical overlap between premise and hypothesis is high. + +## Do interpretable neurons contribute to model accuracy? + +* In image classification setup, the more interpretable the neuron is, the more accurate is the model (when the neuron is active). + +* However, the opposite trend is seen in NLI models. i.e., the more interpretable neurons are less accurate. + +* Key takeaway - interpretability (as measured by the paper) is not correlated with performance. Given a concept space, the identified behaviors may be correlated or anti-correlated with the model's performance. + +## Targeting explanations to change model behavior + +* The idea is to construct examples that activate (or inhibit) certain neurons, causing a change in the model's predictions. + +* These adversarial examples are referred to as "copy-paste" adversarial examples. + +* For example, the neuron corresponding to "(water OR river) AND (NOT blue)" is a major contributor for detecting "swimming hole" classes. An adversarial example is created by making the water blue. This prompts the model to predict "grotto" instead of "swimming hole." + +* Similarly, in the NLI model, a neuron detects the word "nobody" in the hypothesis as highly indicative of contradiction. An adversarial example can be created by adding the word "nobody" to the hypothesis, prompting the model to predict contradiction while the true label should be neutral. + +* These observations support the hypothesis that one can use explanations to create adversarial examples. \ No newline at end of file diff --git a/site/_posts/2021-01-11-GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism.md b/site/_posts/2021-01-11-GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism.md new file mode 100755 index 00000000..aba27e03 --- /dev/null +++ b/site/_posts/2021-01-11-GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism.md @@ -0,0 +1,50 @@ +--- +layout: post +title: GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism +comments: True +excerpt: +tags: ['2018', 'Distributed Computing', 'Model Parallelism', 'NeurIPS 2019', AI, Engineering, NeurIPS, Scale, Systems] + +--- + +## Introduction + +* The paper introduces GPipe, a pipeline parallelism library for scaling networks that can be expressed as a sequence of layers. + +* [Link to the paper](https://arxiv.org/abs/1811.06965) + +## Design + +* Consider training a deep neural network with *L* layers using *K* accelerators (say GPUs). + +* Each of the *i<sup>th</sup>* layer has its *forward* function *f<sub>i</sub>*, *backward* function *b<sub>i</sub>*, weights *w<sub>i</sub>* and a cost *c<sub>i</sub>* (say the memory footprint or computational time). + +* GPipe partitions this network into *K* cells and places the *i<sup>th</sup>* cell on the *i<sup>th</sup>* accelerator. Output from the *i<sup>th</sup>* accelerator is passed to the *i+1<sup>th</sup>* accelerator as input. + +* During the forward pass, the input batch (of size *N*) is divided into *M* equal micro-batches. These micro-batches are pipelined through the *K* accelerators one after another. + +* During the backward pass, gradients are computed for each micro-batch. The gradients are accumulated and applied at the end of each minibatch. + +* In batch normalization, the statistics are computed over each micro-batch (used during training) and mini-batch (used during evaluation). + +* Micro-batching improves over the naive mode parallelism approach by reducing the underutilization of resources (due to the network's sequential dependencies). + +## Performance Optimization + +* GPipe supports re-materialization (or checkpointing), i.e., during the forward pass, only the output activations (at partition boundaries) are stored. + +* During backward pass, the forward function is recomputed at each accelerator. This trades off the memory requirement with increased time. + +* One potential downside is that partitioning can introduce some idle time per accelerator (referred to as the bubble overhead). However, with a sufficiently large number of micro-batches ( more than 4 times the number of partitions), the bubble overhead is negligible. + +## Performance Analysis + +* Two different types of model architectures are compared: AmoebaNet convolutional model and Transformer sequence-to-sequence model. + +* For AmoebaNet, the size of the largest trainable model (on a single 8GB Cloud TPU v2) increases from 82M to 318M. Further, a 1.8 billion parameter model can be trained on 8 accelerators (25x improvement in size using GPipe). + +* For transformers, GPipe scales the model size to 83.9 B parameters with 128 partitions (298x improvement in size compared to a single accelerator). + +* Since the computation is evenly distributed across transformer layers, the training throughput scales almost linearly with the number of devices. + +* Quantitative experiments on ImageNet and multilingual machine translation show that models can be effectively trained using GPipe. diff --git a/site/_posts/2021-01-18-Energy-based Models for Continual Learning.md b/site/_posts/2021-01-18-Energy-based Models for Continual Learning.md new file mode 100755 index 00000000..225ed8dd --- /dev/null +++ b/site/_posts/2021-01-18-Energy-based Models for Continual Learning.md @@ -0,0 +1,92 @@ +--- +layout: post +title: Energy-based Models for Continual Learning +comments: True +excerpt: +tags: ['2020', 'Catastrophic Forgetting', 'Continual Learning', 'Energy-Based Models', 'Lifelong Learning', 'Replay Buffer', AI, CL, EBM, LL] + +--- + +## Introduction + +* The paper proposes to use Energy-based Models (EBMs) for Continual Learning. + +* In classification tasks, the standard approach uses a cross-entropy objective function along with a normalized probability distribution. + +* However, cross-entropy reduces all negative classes' likelihood when updating the model for a given sample, potentially leading to catastrophic forgetting. + +* Classification can be seen as learning an EBM across separate classes. + +* During an update, the energy for a pair of samples and its ground truth class decreases while the energy corresponding to the pairs of sample and negative classes increases. + +* Unlike the cross-entropy loss, EBMs allow choosing the negative classes to update. + +* [Link to the paper](https://arxiv.org/abs/2011.12216) + + +## Applications of EBMs for Continual Learning + +* EBMs can be used for class-incremental learning without requiring a replay-buffer or generative model for replay. + +* EBMs can be used for continual learning in setups without task boundaries, i.e., setups where the data distribution can change without a clear separation between tasks. + +## EBMs + +* Boltzman distribution is used to define the conditional likelihood of label $y$, given an input $x$. ie, $p(y\|x) = \frac{exp(E(x, y))}{Z(x)}$ where $Z(x) = \sum_{y \in Y}(-E(x, y))$. Here $E$ is the learnt energy function that maps an input-label pair to a scalar energy value. + +* During training, the contrastive divergence loss is used. + +* During inference, the class, for which the input-class pair has the least energy, is selected as the predicted class. + +## EBMs for Continual Learning + +### Selection of Negative Samples + +* The paper considers several strategies for the selection of negative samples: + + * one negative class per sample. The negative class is sampled from the current batch of data. This selection approach performs best. + + * all the negative classes in a batch are used for creating the negative samples. + + * all the classes seen so far in training are used as the negative samples. This approach works the worst in practice. + +* Given the flexibility of sampling the negative classes, EBMs can be used in the boundary-agnostic setups (where the data distribution can change smoothly without an explicit task boundary). + +### Energy Network + +* EBMs take both the sample and the class as the input. The class can be treated as an attention filter to select the most relevant information between the sample and the class. + +* In theory, EBMs can train for any number of classes without knowing the number of classes beforehand. This is an advantage over the softmax-based approaches, where adding new classes requires changing the size of the softmax output layer. + +### Inference + +* During inference, all the classes seen so far are evaluated via the energy function. The class, which corresponds to the least energy sample-class pair, is returned as the selected class. + + +## Experiments + +### Datasets + +* Split MNIST + +* Permuted MNIST + +* CIFAR-10 + +* CIFAR-100 + + +### Results in Boundary-Aware Setting + + +* The paper outperforms the standard continual learning approaches that neither uses a replay-buffer nor a generative model. + +* Additionally, the paper shows that for the same number of parameters, the effective capacity of EMB models is higher than the effective capacity of standard classification models. + +* The paper also shows that standard classification models tend to assign a high probability to new classes for both old and new data. EBMs assign the probability more uniformly (and correctly) across the classes. + +* In an ablation study, the paper shows that both label conditioning and contrastive divergence loss help in improving the performance of EBMs. + +### Results in Boundary-Agnostic Setting + +* The performance gains in the boundary-agnostic setting are even more significant than the improvements in the boundary-aware setting. \ No newline at end of file diff --git a/site/_posts/2021-01-25-HyperNetworks.md b/site/_posts/2021-01-25-HyperNetworks.md new file mode 100755 index 00000000..6cd4c7ae --- /dev/null +++ b/site/_posts/2021-01-25-HyperNetworks.md @@ -0,0 +1,59 @@ +--- +layout: post +title: HyperNetworks +comments: True +excerpt: +tags: ['2016', 'ICLR 2017', AI, HyperNetwork, ICLR] + +--- + +## Introduction + +* The paper explores HyperNetworks. The idea is to use one network (HyperNetwork) to generate the weights for another network. + +* [Link to the paper](https://arxiv.org/abs/1609.09106) + +* [Author's implementation](https://github.com/hardmaru/supercell/blob/master/supercell.py) + + +## Approach + +### Static HyperNetworks - HyperNetworks for CNNs + +* Consider a $D$ layer CNN where the parameters for the $j^{th}$ layer are stored in a matrix $K^j$ of the shape $N_{in}f_{size} \times N_{out}f_{size}$. + +* The HyperNetwork is implemented as a two-layer linear network where the input is a layer embedding $z^j$, and the output is $K^j$. + +* The first layer (of the HyperNetwork) maps the input to $N_{in}$ different outputs using $N_{in}$ weight matrices. + +* The second layer maps the different $N_{in}$ inputs to $K_{i}$ using a shared matrix. The resulting $N_{in}$ (number of) $K_{i}$ matrices are concatenated to obtain $K^j$. + +* As a side note, HyperNetworks have much fewer params than the network for which it produces weights. + +* In a general case, the kernel dimensions (across layers) are not of the same size but integer multiples of some basic sizes. In that case, the HyperNetwork can generate kernels for the basic size, which can be concatenated to form larger kernels. This would require additional input embeddings but not require a change in the architecture of HyperNetwork. + +### Dynamic HyperNetworks - HyperNetworks for RNNs + +* HyperRNNs/HyperLSTMs denote HyperNetworks that generates weights for RNNs/LSTMs. + +* HyperRNNs implement a form of relaxed weight sharing - an alternative to the full weight sharing of the traditional RNNs. + +* At any timestamp $t$, the input to the HyperRNN is the concatenated vector $x_{t}$ (input to the RNN at time $t$) and the hidden state $h_{t-1}$ of the RNN. The output is the weight for the main RNN at timestep $t$. + +* In practice, a *weight scaling vector* $d$ is used to reduce the memory footprint, which would otherwise be $dim$ times the memory of a standard RNN. $dim$ is the dimensionality of the embedding vector $z_j$. + +## Experiments + +* HyperNetworks are used to train standard CNNs for MNIST and ResNets for CIFAR 10. In these experiments, HyperNetworks slightly underperform the best performing models but uses much fewer parameters. + +* HyperLSTMs trained on the Penn Treebank dataset and Hutter Prize Wikipedia dataset outperform the stacked LSTMs and perform similar to layer-norm LSTMs. Interestingly, using HyperLSTMs with layer-norm improves performance over HyperLSTMs. + +* Given the similar performance of HyperLSTMs and layer-norm LSTMs, the paper conducted an ablation study to understand if HyperLSTMs learned a weight adjustment policy similar to the statistics-based approach used by layer-norm LSTMs. + + * However, the analysis of the histogram of the hidden states suggests that using layer-norm reduces the saturation effect while in HyperLSTMs, the cell is saturated most of the time. This indicates that the two models are learning different policies. + +* HyperLSTMs are also evaluated for handwriting sequence generation by training in the IAM online handwriting dataset. + + * While HyperLSTMs are quite effective on this task, combining them with layer-norm degrades the performance. + +* On the WMT'14 En-to-Fr machine translation task, HyperLSTMs outperform LSTM based approaches. \ No newline at end of file diff --git a/site/_posts/2021-02-01-Zero-shot Learning by Generating Task-specific Adapters.md b/site/_posts/2021-02-01-Zero-shot Learning by Generating Task-specific Adapters.md new file mode 100755 index 00000000..bb2340da --- /dev/null +++ b/site/_posts/2021-02-01-Zero-shot Learning by Generating Task-specific Adapters.md @@ -0,0 +1,70 @@ +--- +layout: post +title: Zero-shot Learning by Generating Task-specific Adapters +comments: True +excerpt: +tags: ['2021', 'Natural Language Processing', 'Text-to-Text Transformer', 'Zero-Shot', 'Zero Shot Generalization', Adapter, AI, HyperNetwork, NLP, Transformer] + +--- + +## Introduction + +* The paper introduces HYPTER - a framework for zero-shot learning (ZSL) in text-to-text transformer models by training a [**Hyp**erNetwork](https://shagunsodhani.com/papers-I-read/HyperNetworks) to generate task-specific [adap**ter**s](https://arxiv.org/abs/1902.00751) from task descriptions. + +* The focus is on *in-task* zero-shot learning (e.g., learning to predict an unseen class or relation) and not on *cross-task* learning (e.g., training on sentiment analysis and evaluating on question-answering task). + +* [Link to the paper](https://arxiv.org/abs/2101.00420) + + +## Terminology + +* *Task* - a NLP task, like classification or question answering. + +* *Sub-task* + + * A class/relation/question within a task. + + * Denotes by a tuple $(d, D)$ where $d$ is the language description while $D$ represents the subtask's dataset. + +## Setup + +* Develop ZSL approach for transfer to new subtasks within a task, using the task description available for each subtask. + +## Approach + +* HYPTER has two main parts: + + * Main network + + * A pretrained text-to-text network + + * Instantiated as a BERT-Base/Large + + * HyperNetwork + + * Generates the weights for adapter networks that will be plugged into the main network. + + +* HyperNetwork has two parts: + + * Encoder + + * Encodes the task description + + * Instantiated as a RoBERTa-Base model + + * Decoder + + * Decodes the encoding into weights for multiple adapters (in parallel) + + * Instantiated as a Feedforward Network + +* The model trains in two phases: + + * Main network is trained on all the data by concatenating the task description with the input. + + * Adapters are trained by sampling a task from the train set while keeping the main network frozen. + +## Experiments + +* While the idea is very promising and interesting, the evaluation felt quite limited. It uses just two datasets [Zero-shot learning from Task Descriptions](https://leaderboard.allenai.org/zest/submissions/public) and [Zero-shot Relation Extraction](https://eval.ai/web/challenges/challenge-page/689/overview) and shows some improvements over the baseline of directly finetuning with task descriptions as the prompt. \ No newline at end of file diff --git a/site/_site b/site/_site index 61fbdfd6..f898116a 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit 61fbdfd6d42554525bec6603bfc2278112c66a8a +Subproject commit f898116a395b873ddda09d3b3e1480ed91031d5d