Skip to content

Commit

Permalink
Add large memory layer paper
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Sep 13, 2019
1 parent 30d952d commit ca9ded1
Show file tree
Hide file tree
Showing 3 changed files with 75 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Large Memory Layers with Product Keys](https://shagunsodhani.com/papers-I-read/Large-Memory-Layers-with-Product-Keys)
* [Abductive Commonsense Reasoning](https://shagunsodhani.com/papers-I-read/Abductive-Commonsense-Reasoning)
* [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models](https://shagunsodhani.com/papers-I-read/Deep-Reinforcement-Learning-in-a-Handful-of-Trials-using-Probabilistic-Dynamics-Models)
* [Assessing Generalization in Deep Reinforcement Learning](https://shagunsodhani.com/papers-I-read/Assessing-Generalization-in-Deep-Reinforcement-Learning)
Expand Down
73 changes: 73 additions & 0 deletions site/_posts/2019-08-22-Large Memory Layers with Product Keys.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
layout: post
title: Large Memory Layers with Product Keys
comments: True
excerpt:
tags: ['2019', 'Key Value', 'Natural Language Processing', AI, Attention, Memory, NLP]

---

## Introduction

* The paper proposes a structured key-value memory layer that:
* Can scale to a very large size (and capacity).
* Has very low computational overhead.
* Supports exact search in the keyspace.
* Can be easily integrated with neural networks.

* [Link to the paper](https://arxiv.org/abs/1907.05242)

## Architecture

* The memory layer is composed of 3 components:

* **Query Network**

* Maps input to a latent space.
* Can be implemented as a feed-forward network.
* Adding batch-norm on top of the query network helps to spread out keys.

* **Key selection module**

* Lets say there are a total of *K* keys of dimensionality *d<sub>q</sub>* of which we want to select top *k* keys.
* Partition the set of keys into two sets of *subkeys* (say *Q<sub>1</sub>* and *Q<sub>2</sub>*) where each subset has *K* keys of dimensionality *d_q/2*.
* The query is split into two subqueries (say *q<sub>1</sub>* and *q<sub>2</sub>*).
* Each of these two queries are compared with every query in their corresponding set of *subkeys*.
* For example, *q<sub>1</sub>* is compared with every query is *Q<sub>1</sub>*.
* Top *k* ranked keys are selected from each set to create two new sets *C<sub>1</sub>* and *C2<sub>2</sub>*.
* The keys from these two sets are combined uder the concatenation operator to obtain *k<sub>2</sub>* vectors.
* the final top *k* (concatenated) keys are searched from these *k<sup>2* keys.
* The overall complexity is $O((\sqrt K + k^2) \times d_q)$ where *K* is the total number of keys (whiuc)

* **Value lookup table**

* The values (corresponding to selected subkeys) are aggregated (using weighted sum operation) to obtain the output.

* All the parameters are trainable, though, in practice, only the selected *k* memory slots are updated.

* Using Multihead attention mechanism helps to improve the performance further.

## Experiments

* 1 or more feedforward layers in transformers are placed by the memory layers.

* The model is evaluated on large scale language modeling tasks with 140 Gb of data from common crawl corpora (28n billion words).

* Evaluation metrics

* Perplexity on the test set.

* Fraction of accessed values.

* KL divergence between the (normalized) weights of key access and uniform distribution.

* The last two metrics are used together to determine how well the keys are utilized.

## Results

* Given the large size of the training dataset, adding more layers to the transformer model helps.

* Effect of using memory layer is more powerful than the effect of adding new layers to the transformer. For example, a 12 layer transformer + memory layer outperforms a 24 layer transformer while being almost twice as fast.

* The best position to place the memory is at an intermediate layer and placing the memory layer right after the input or just before the softmax layer does not work well in practice.

2 changes: 1 addition & 1 deletion site/_site
Submodule _site updated from 5dc50d to d8120a

0 comments on commit ca9ded1

Please sign in to comment.