Add new papers

shagunsodhani · Jan 6, 2020 · ac47383 · ac47383
1 parent 5c94a7f
commit ac47383
Show file tree

Hide file tree

Showing 7 changed files with 392 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -5,6 +5,9 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [Superposition of many models into one](https://shagunsodhani.com/papers-I-read/Superposition-of-many-models-into-one)
+* [Towards a Unified Theory of State Abstraction for MDPs](https://shagunsodhani.com/papers-I-read/Towards-a-Unified-Theory-of-State-Abstraction-for-MDPs)
+* [ALBERT - A Lite BERT for Self-supervised Learning of Language Representations](https://shagunsodhani.com/papers-I-read/ALBERT-A-Lite-BERT-for-Self-supervised-Learning-of-Language-Representations)
 * [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://shagunsodhani.com/papers-I-read/Mastering-Atari,-Go,-Chess-and-Shogi-by-Planning-with-a-Learned-Model)
 * [Contrastive Learning of Structured World Models](https://shagunsodhani.com/papers-I-read/Contrastive-Learning-of-Structured-World-Models)
 * [Gossip based Actor-Learner Architectures for Deep RL](https://shagunsodhani.com/papers-I-read/Gossip-based-Actor-Learner-Architectures-for-Deep-RL)

diff --git a/site/_posts/2019-02-12-Gossip based Actor-Learner Architectures for Deep RL.md b/site/_posts/2019-02-12-Gossip based Actor-Learner Architectures for Deep RL.md
diff --git a/...Happens for a Reason - Discovering the Purpose of Actions in Procedural Text.md b/...Happens for a Reason - Discovering the Purpose of Actions in Procedural Text.md
@@ -0,0 +1,116 @@
+---
+layout: post
+title: Everything Happens for a Reason - Discovering the Purpose of Actions in Procedural Text
+comments: True
+excerpt: 
+tags: ['2019', 'EMNLP 2019', 'Procedural Text', 'Relation Learning', 'Relational Learning', AI, Dataset, ENMLP, Graph, NLP, Reasoning]
+
+---
+
+## Introduction
+
+* Procedural text comprehension tasks focus on modeling the effect of actions and predicting what happens next.
+
+* But they do not consider *why* some actions need to happen before other actions.
+
+* The paper proposes a new model called XPAD (eXPlainable Action Dependency) that considers the *purpose* of actions while predicting their effect.
+
+* The model favors *effects* that:
+
+   * explain more of actions in the text.
+
+   * are more plausible given the context.
+
+* An existing procedural text benchmark dataset (Propara) is expanded by adding the task of explaining actions by predicting their dependencies.
+
+* [Link to the paper](https://arxiv.org/abs/1909.04745)
+
+* [Link to the dataset](http://data.allenai.org/propara/)
+
+## Setup
+
+* Input 
+
+   * Procedural (chronologically ordered text) sequence of *T* sentences.
+
+   * List of *N* participant entities, whose state changes at some step.
+
+* Output
+
+   * State change matrix $\pi(T \times N)$ with four possible states - move, create destroy, none.
+
+   * This matrix tracks how property changes after each step.
+
+* Dependency Explanation Graph
+
+   * Identify what steps are necessary to execute a given step (say *s<sub>i</sub>*) and represent this dependency in the form of a dependency explanation graph *G = <S, E>*.
+
+   * In this graph, each node is a step and the direction of edge describes the order of dependency.
+
+## Dependency Graph Dataset
+
+* [Propara dataset](https://arxiv.org/abs/1805.06975) is expanded to extract the dependency graph using both heuristic and automated methods.
+
+* The automated method is based on the coherence assumption that if step *s<sub>j</sub>* changes state of entity *e<sub>k</sub>* then *s<sub>j</sub>* is a precondition for the first subsequent step that changes the state of *e<sub>k</sub>*.
+
+## XPAD Model
+
+* The model is based on the ProStruct system and uses an encoder-decoder based architecture.
+
+* Encoder
+
+   * Input: Sentence *s<sub>t</sub>* and entity *e<sub>j</sub>*.
+
+   * Sentence is encoded using the GloVe vectors and a BiLSTM model and the entity is encoded as an indicator variable.
+
+   * The combined representation is denoted as *c<sub>tj</sub>*.
+
+   * This representation is passed through an MLP to generate *k* logits that encode the probability of each entity *j* undergoing a state change at step *t*.
+
+* Decoder
+
+   * Beam search is performed to decode the encoder representation into the state change matrix and dependency graph using a score function that ensures global consistency.
+
+   * Score function has two components:
+
+     * State change score - depends on the likelihood that the selected state changes at step *t* given the text and state change history from steps *s<sub>1</sub>* to *s<sub>t-1</sub>*.
+
+     * Dependency graph score
+
+       * This is based on the connectivity and likelihood of the resulting dependency explanation graph. 
+
+       * This score is used to bias the graph search towards:
+
+         * predictions that have an identifiable purpose ie checking if a particular state change prediction leads to a connection in the dependency explanation graph.
+
+         * graphs that are more likely according to the background knowledge to distinguish likely dependency links from the unlikely ones.
+
+* During training, XPAD has access to the correct path (in the search space) and learns to minimize the joint loss corresponding to predicting the state change and the dependency explanation graph.
+
+* During testing, XPAD performs beam search to predict the most likely state change and dependency explanation graph.
+
+## Experiments
+
+* Tasks:
+
+   * State change prediction
+
+   * Dependency explanation prediction
+
+* Baselines:
+
+   * [Recurrent Entity Networks](https://arxiv.org/abs/1612.03969)
+
+   * [Query-Reduction Networks](https://arxiv.org/abs/1606.04582)
+
+   * [ProLocal and ProGlobal](https://arxiv.org/abs/1805.06975)
+
+   * [ProStruct](https://arxiv.org/abs/1808.10012)
+
+* XPAD significantly outperforms all the baseline models on the dependency explanation task.
+
+* Improvements on the state change prediction task are less significant.
+
+* Removing dependency graph scores from XPAD leads to a drop in the F1 score.
+
+* The paper provides an elaborate discussion on the different types of errors that the XPAD system makes.
diff --git a/...LBERT - A Lite BERT for Self-supervised Learning of Language Representations.md b/...LBERT - A Lite BERT for Self-supervised Learning of Language Representations.md
@@ -0,0 +1,77 @@
+---
+layout: post
+title: ALBERT - A Lite BERT for Self-supervised Learning of Language Representations
+comments: True
+excerpt: 
+tags: ['2019', 'ICLR 2019', 'Natural Language Processing', 'Representation Learning', AI, Attention, ICLR, NLP, Transformer, SOTA]
+
+---
+
+
+## Introduction
+
+* The paper proposes parameter-reduction techniques to lower the memory consumption (and improve training speed) of BERT.
+
+* It also proposes to use a self-supervised loss (based on inter-sentence coherence) and argues that this loss is better than the NSP loss used by BERT.
+
+* [Link to the paper](https://arxiv.org/abs/1909.11942)
+
+## Architecture
+
+* ALBERT architecture is similar to that of BERT with three major differences.
+
+* Factorized Embedding Parameterization
+
+    * In BERT and followup works, the embedding size was tied to the size of the context vector. 
+
+    * Since context vector is expected to encoder the entire context, it needs to have a large dimensionality.
+
+    * One consequence of this choice is that even the embedding layer (which encodes the representation for each token) has a large size. This increases the overall memory footprint of the model.
+
+    * The paper proposed to factorize the embedding parameters into two smaller matrics.
+
+    * The embedding layer learns a low dimensional representation of the tokens and this representation is projected into a high dimensional space.
+
+* Cross-layer parameter sharing
+
+    * ALBERT shares all the parameters across the layers.
+
+* Inter-sentence coherence loss
+
+    * BERT uses two losses - Masked Language Modeling loss (MLM) and Next Sentence Prediction (NSP).
+
+    * In the NSP task, the model is provided a pair of sentences and it has to predict if the two sentences appear consecutively in the same document or not. Negative samples are created by sampling sentences from different documents.
+
+    * The paper argues that NSP is not effective as a loss function as it merges topic prediction and coherence prediction into one task (as the two sentences come from different documents). The topic prediction is an easier task as compared to coherence prediction.
+
+    * Hence the paper proposes to use the Sentence Order Prediction task where the model has to predict which of the two sentences comes first in a document. The negative samples are created by simply swapping the order in the positive samples. Hence both the sentences come from the same document and topic prediction alone can not be used to solve the task.
+
+## Setup
+
+* Different variants (in terms of size) of ALBERT and BERT models are compared (eg ALBERT, ALBERT-x, BERT-x, etc).
+
+* In general, ALBERT models have many-times fewer parameters as compared to the BERT models.
+
+* Datasets - BookCorpus, English Wikipedia.
+
+## Observations
+
+* ALBERT-xxlarge significantly outperforms the BERT-large model even though it has around 70% parameters as the BERT-large model.
+
+* BERT-xlarge performs worse than BERT-base hinting that it is difficult to train such large models.
+
+* ALBERT models also have better data throughput as compared to BERT models.
+
+* For the ALBERT models, an embedding size of 128 performs the best.
+
+* As the hidden dimension is increased, the model obtains better performance, but with diminishing returns.
+
+* Very wide ALBERT models (say with a context size of 1024) do not benefit much from depth.
+
+* Using additional training data boosts the performance for most of the downstream tasks.
+
+* The paper empirically shows that using dropout could hurt the performance of the ALBERT models. This observation may not hold for BERT as it does not share parameters across layers and hence may need regularization via dropout.
+
+* ALBERT also improves the state of the art performance on GLUE, SQuAD and RACE benchmarks, for both single-model and ensemble setup.
+
+
diff --git a/site/_posts/2019-12-26-Towards a Unified Theory of State Abstraction for MDPs.md b/site/_posts/2019-12-26-Towards a Unified Theory of State Abstraction for MDPs.md
@@ -0,0 +1,82 @@
+---
+layout: post
+title: Towards a Unified Theory of State Abstraction for MDPs
+comments: True
+excerpt: 
+tags: ['2006', 'Markov Decision Process', 'Reinforcement Learning', 'State Abstraction', AI, MDP, RL]
+
+---
+
+
+## Introduction
+
+* The paper studies five different techniques for stat abstraction in MDPs (Markov Decision Processes) and evaluates their usefulness for planning and learning.
+
+* The general idea behind abstraction is to map the actual (or observed) state to an abstract state that should be more amenable for learning.
+
+* It can be thought of as a mapping from one representation to another representation while preserving some useful properties.
+
+* [Link to the paper](https://pdfs.semanticscholar.org/ca9a/2d326b9de48c095a6cb5912e1990d2c5ab46.pdf)
+
+
+## General Definition
+
+* Consider a MDP $$M = <S, A, P, R, \gamma>$$ where $$S$$ is the finite set of states, $$A$$ is finite set of actions, $$P$$ is the transition function, $$R$$ is the bounded reward function and $$\gamma$$ is the discount factor.
+
+* The abstract version of the MDP is $$\widetilde{M} = <\widetilde{S}, A, \widetilde{P}, \widetilde{R}, \gamma>$$ where $$\widetilde{S}$$ is the finite set if abstract states, $$\widetilde{P}$$ is the transition function in the abstract state space and $$\widetilde{R}$$ is the bounded reward function in the abstract reward space.
+
+* Abstraction function $$\phi$$ is a function that maps a given state $$s$$ to its abstract counterpart $$\widetilde{s}$$.
+
+* The inverse image $$\phi^{-1}(\widetilde{s})$$ is the set of ground states that map to the $$\widetilde{s}$$ under the abstraction function $$\phi$$.
+
+* A wieghing functioon $$w(s)$$ is used to measure how much does a state $$s$$ contribute to the abstract state $$\phi(s)$$.
+
+## Topology of Abstraction Space
+
+* Given two abstraction functions $$\phi_{1}$$ and $$\phi_{2}$$, $$\phi_{1}$$ is said to be *finer* than $$\phi_{2}$$ iff for any states $$s_{1}, s_{2}$$ if $$\phi_{1}(s_{1}) = \phi_{1}(s_{2})$$ then $$\phi_{2}(s_{1}) = \phi_{2}(s_{2})$$.
+
+* This *finer* relation is reflex, antisymmetric, transitive and partially ordered.
+
+## Five Types of Abstraction
+
+* While many abstractions are possible, not all abstractions are equally important.
+
+* Model-irrelevance abstraction $$\phi_{model}$$:
+
+    * If two states $s_{1}$ and $s_{2}$ have the same abstracted state, then their one-step model is preserved.
+
+    * Consider any action $$a$$ and any abstract state $$\widetilde{s}$$, if $$\phi_{model}(s_{1} = \phi_{model}(s_{2})$$ then $$R(s_1, a) = R(s_2, a)$$ and $$\sum_{s' \in \phi_{model}^{-1}\widetilde(s)}P_{s_1, s'}^{a} = \sum_{s' \in \phi_{model}^{-1}\widetilde(s)}P_{s_2, s'}^{a}$$.
+
+* $$Q^{\pi}$$-irrelevance abstraction:
+
+    * It preserves the state-action value finction for all the states.
+
+    * $$\phi_{Q^{\pi}}(s_1) = \phi_{Q^{\pi}}(s_2)$$ implies $$Q^{\pi}(s_1, a) = Q^{\pi}(s_1, a)$$.
+
+* $$Q^{*}$$-irrelevance abstraction:
+
+    * It preserves the optimal state-action value function.
+
+* $$a^{*}$$-irrelevance abstraction:
+
+    * It preserves the optimal action and its value function.
+
+* $$\phi_{\pi^{*}}$$-irrelevance abstraction:
+
+    * It preserves the optimal action.
+
+* In terms of *fineness*, $$\phi_0 \geq \phi_{model} \geq \phi_{Q^{\pi}} \geq \phi_{Q^*} \geq \phi_{a^*} \geq \phi_{\pi^*} $$. Here $$\phi_0$$ is the identity mapping ie $$\phi_0(s) = s$$
+
+* If a property applies to any abstraction, it also applies to all the finer abstractions.
+
+## Key Theorems
+
+* As we go from finer to coarser abstractions, the information loss increases (ie fewer components can be recovered) while the state-space reduces (ie the efficiency of solving the problem increases). This leads to a tradeoff when selecting abstractions.
+
+* For example, with abstractions $$\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}, \phi_{a^*}$$, the optimal abstract policy $$\widetilde(\pi)^*$$ is optimal in the ground MDP.
+
+* Similarly, if each state-action pair is visited infinitely often and the step-size decays properly, Q-learning with $$\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}$$ converges to the optimal state-action value functions in the MDP. More conditions are needed for convergence in the case of the remaining two abstractions.
+
+* For $$\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}, \phi_{a^*}$$, the model built with the experience converges to the true abstract model with infinite experience if the weighing function $$w(s)$$ is fixed.
+
+