diff --git a/README.md b/README.md index f8a8fa73..58bd52c4 100755 --- a/README.md +++ b/README.md @@ -5,6 +5,9 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [Superposition of many models into one](https://shagunsodhani.com/papers-I-read/Superposition-of-many-models-into-one) +* [Towards a Unified Theory of State Abstraction for MDPs](https://shagunsodhani.com/papers-I-read/Towards-a-Unified-Theory-of-State-Abstraction-for-MDPs) +* [ALBERT - A Lite BERT for Self-supervised Learning of Language Representations](https://shagunsodhani.com/papers-I-read/ALBERT-A-Lite-BERT-for-Self-supervised-Learning-of-Language-Representations) * [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://shagunsodhani.com/papers-I-read/Mastering-Atari,-Go,-Chess-and-Shogi-by-Planning-with-a-Learned-Model) * [Contrastive Learning of Structured World Models](https://shagunsodhani.com/papers-I-read/Contrastive-Learning-of-Structured-World-Models) * [Gossip based Actor-Learner Architectures for Deep RL](https://shagunsodhani.com/papers-I-read/Gossip-based-Actor-Learner-Architectures-for-Deep-RL) diff --git a/site/_posts/2019-02-12-Gossip based Actor-Learner Architectures for Deep RL.md b/site/_posts/2019-02-12-Gossip based Actor-Learner Architectures for Deep RL.md deleted file mode 100644 index 63e8fb6c..00000000 --- a/site/_posts/2019-02-12-Gossip based Actor-Learner Architectures for Deep RL.md +++ /dev/null @@ -1,48 +0,0 @@ ---- -layout: post -title: Gossip based Actor-Learner Architectures for Deep RL -comments: True -excerpt: -tags: ['2019', 'Deep Reinforcement Learning', 'Distributed Reinforcement Learning', 'Neurips 2019', 'Reinforcement Learning', AI, DRL, Neurips, RL] - ---- - -* [Link to the paper](https://arxiv.org/abs/1906.04585) - -* The paper considers the task of training an RL system by sampling data from multiple simulators (over parallel devices). - -* The setup is that of distributed RL setting with *n* agents or actor-learners (composed of a single learner and several actors). These agents are trying to maximize a common value function. - -* One (existing) approach is to perform on-policy updates with a shared policy. The policy could be updated in synchronous (does not scale well) or asynchronous manner (can be unstable due to stale gradients). - -* Off policy approaches allow for better computational efficiency but can be unstable during training. - -* The paper proposed Gossip based Actor-Learner Architecture (GALA) which uses asynchronous communication (gossip) between the *n* agents to improve the training of Deep RL models. - -* These agents are expected to converge to the same policy. - -* During training, the different agents are not required to share the same policy and it is sufficient that the agent's policies remain $\epsilon$-close to each other. This relaxation allows the policies to be trained asynchronously. - -* GALA approach is combined with A2C agents resulting in GALA-A2C agents. They have better computational efficiency and scalability (as compared to A2C) and similar in performance to A3C and Impala. - -* Training alternates between one local policy-gradient (and TD update) and asynchronous gossip between agents. - -* During the gossip step, the agents send their parameters to some of the other agents (referred to as the peers) and update their parameters based on the parameters received from the other agents (for which the given agent is a peer). - -* GALA agents are implemented using non-blocking communication so that they can operate asynchronously. - -* The paper includes the proof that the policies learned by the different agents are within $\epsilon$ distance of each other (ie all the policies lie within an $\epsilon$-distance ball) thus ensuring that the policies do not diverge much from each other. - -* Six games from the Ataru 2600 games suite are used for the experiments. - -* Baselines: A2C, A3C, Impala - -* GALA agents are configured in a directed ring graph topology. - -* With A2C, as the number of simulators increases, the number of convergent runs (runs with a threshold reward) decreases. - -* Using gossip algorithms increases or maintains the number of convergent runs. It also improves the performance, sample efficiency and compute efficiency of A2C across all the six games. - -* When compared to Impala and A3C, GALA-A2C generally outperforms (or performs as well as) those baselines. - -* Given that the learned policies remain within an $\epsilon$ ball, the agent's gradients are less correlated as compared to the A2C agents. \ No newline at end of file diff --git a/site/_posts/2019-12-12-Everything Happens for a Reason - Discovering the Purpose of Actions in Procedural Text.md b/site/_posts/2019-12-12-Everything Happens for a Reason - Discovering the Purpose of Actions in Procedural Text.md new file mode 100755 index 00000000..fb0528a1 --- /dev/null +++ b/site/_posts/2019-12-12-Everything Happens for a Reason - Discovering the Purpose of Actions in Procedural Text.md @@ -0,0 +1,116 @@ +--- +layout: post +title: Everything Happens for a Reason - Discovering the Purpose of Actions in Procedural Text +comments: True +excerpt: +tags: ['2019', 'EMNLP 2019', 'Procedural Text', 'Relation Learning', 'Relational Learning', AI, Dataset, ENMLP, Graph, NLP, Reasoning] + +--- + +## Introduction + +* Procedural text comprehension tasks focus on modeling the effect of actions and predicting what happens next. + +* But they do not consider *why* some actions need to happen before other actions. + +* The paper proposes a new model called XPAD (eXPlainable Action Dependency) that considers the *purpose* of actions while predicting their effect. + +* The model favors *effects* that: + + * explain more of actions in the text. + + * are more plausible given the context. + +* An existing procedural text benchmark dataset (Propara) is expanded by adding the task of explaining actions by predicting their dependencies. + +* [Link to the paper](https://arxiv.org/abs/1909.04745) + +* [Link to the dataset](http://data.allenai.org/propara/) + +## Setup + +* Input + + * Procedural (chronologically ordered text) sequence of *T* sentences. + + * List of *N* participant entities, whose state changes at some step. + +* Output + + * State change matrix $\pi(T \times N)$ with four possible states - move, create destroy, none. + + * This matrix tracks how property changes after each step. + +* Dependency Explanation Graph + + * Identify what steps are necessary to execute a given step (say *si*) and represent this dependency in the form of a dependency explanation graph *G = *. + + * In this graph, each node is a step and the direction of edge describes the order of dependency. + +## Dependency Graph Dataset + +* [Propara dataset](https://arxiv.org/abs/1805.06975) is expanded to extract the dependency graph using both heuristic and automated methods. + +* The automated method is based on the coherence assumption that if step *sj* changes state of entity *ek* then *sj* is a precondition for the first subsequent step that changes the state of *ek*. + +## XPAD Model + +* The model is based on the ProStruct system and uses an encoder-decoder based architecture. + +* Encoder + + * Input: Sentence *st* and entity *ej*. + + * Sentence is encoded using the GloVe vectors and a BiLSTM model and the entity is encoded as an indicator variable. + + * The combined representation is denoted as *ctj*. + + * This representation is passed through an MLP to generate *k* logits that encode the probability of each entity *j* undergoing a state change at step *t*. + +* Decoder + + * Beam search is performed to decode the encoder representation into the state change matrix and dependency graph using a score function that ensures global consistency. + + * Score function has two components: + + * State change score - depends on the likelihood that the selected state changes at step *t* given the text and state change history from steps *s1* to *st-1*. + + * Dependency graph score + + * This is based on the connectivity and likelihood of the resulting dependency explanation graph. + + * This score is used to bias the graph search towards: + + * predictions that have an identifiable purpose ie checking if a particular state change prediction leads to a connection in the dependency explanation graph. + + * graphs that are more likely according to the background knowledge to distinguish likely dependency links from the unlikely ones. + +* During training, XPAD has access to the correct path (in the search space) and learns to minimize the joint loss corresponding to predicting the state change and the dependency explanation graph. + +* During testing, XPAD performs beam search to predict the most likely state change and dependency explanation graph. + +## Experiments + +* Tasks: + + * State change prediction + + * Dependency explanation prediction + +* Baselines: + + * [Recurrent Entity Networks](https://arxiv.org/abs/1612.03969) + + * [Query-Reduction Networks](https://arxiv.org/abs/1606.04582) + + * [ProLocal and ProGlobal](https://arxiv.org/abs/1805.06975) + + * [ProStruct](https://arxiv.org/abs/1808.10012) + +* XPAD significantly outperforms all the baseline models on the dependency explanation task. + +* Improvements on the state change prediction task are less significant. + +* Removing dependency graph scores from XPAD leads to a drop in the F1 score. + +* The paper provides an elaborate discussion on the different types of errors that the XPAD system makes. diff --git a/site/_posts/2019-12-19-ALBERT - A Lite BERT for Self-supervised Learning of Language Representations.md b/site/_posts/2019-12-19-ALBERT - A Lite BERT for Self-supervised Learning of Language Representations.md new file mode 100755 index 00000000..9254e584 --- /dev/null +++ b/site/_posts/2019-12-19-ALBERT - A Lite BERT for Self-supervised Learning of Language Representations.md @@ -0,0 +1,77 @@ +--- +layout: post +title: ALBERT - A Lite BERT for Self-supervised Learning of Language Representations +comments: True +excerpt: +tags: ['2019', 'ICLR 2019', 'Natural Language Processing', 'Representation Learning', AI, Attention, ICLR, NLP, Transformer, SOTA] + +--- + + +## Introduction + +* The paper proposes parameter-reduction techniques to lower the memory consumption (and improve training speed) of BERT. + +* It also proposes to use a self-supervised loss (based on inter-sentence coherence) and argues that this loss is better than the NSP loss used by BERT. + +* [Link to the paper](https://arxiv.org/abs/1909.11942) + +## Architecture + +* ALBERT architecture is similar to that of BERT with three major differences. + +* Factorized Embedding Parameterization + + * In BERT and followup works, the embedding size was tied to the size of the context vector. + + * Since context vector is expected to encoder the entire context, it needs to have a large dimensionality. + + * One consequence of this choice is that even the embedding layer (which encodes the representation for each token) has a large size. This increases the overall memory footprint of the model. + + * The paper proposed to factorize the embedding parameters into two smaller matrics. + + * The embedding layer learns a low dimensional representation of the tokens and this representation is projected into a high dimensional space. + +* Cross-layer parameter sharing + + * ALBERT shares all the parameters across the layers. + +* Inter-sentence coherence loss + + * BERT uses two losses - Masked Language Modeling loss (MLM) and Next Sentence Prediction (NSP). + + * In the NSP task, the model is provided a pair of sentences and it has to predict if the two sentences appear consecutively in the same document or not. Negative samples are created by sampling sentences from different documents. + + * The paper argues that NSP is not effective as a loss function as it merges topic prediction and coherence prediction into one task (as the two sentences come from different documents). The topic prediction is an easier task as compared to coherence prediction. + + * Hence the paper proposes to use the Sentence Order Prediction task where the model has to predict which of the two sentences comes first in a document. The negative samples are created by simply swapping the order in the positive samples. Hence both the sentences come from the same document and topic prediction alone can not be used to solve the task. + +## Setup + +* Different variants (in terms of size) of ALBERT and BERT models are compared (eg ALBERT, ALBERT-x, BERT-x, etc). + +* In general, ALBERT models have many-times fewer parameters as compared to the BERT models. + +* Datasets - BookCorpus, English Wikipedia. + +## Observations + +* ALBERT-xxlarge significantly outperforms the BERT-large model even though it has around 70% parameters as the BERT-large model. + +* BERT-xlarge performs worse than BERT-base hinting that it is difficult to train such large models. + +* ALBERT models also have better data throughput as compared to BERT models. + +* For the ALBERT models, an embedding size of 128 performs the best. + +* As the hidden dimension is increased, the model obtains better performance, but with diminishing returns. + +* Very wide ALBERT models (say with a context size of 1024) do not benefit much from depth. + +* Using additional training data boosts the performance for most of the downstream tasks. + +* The paper empirically shows that using dropout could hurt the performance of the ALBERT models. This observation may not hold for BERT as it does not share parameters across layers and hence may need regularization via dropout. + +* ALBERT also improves the state of the art performance on GLUE, SQuAD and RACE benchmarks, for both single-model and ensemble setup. + + diff --git a/site/_posts/2019-12-26-Towards a Unified Theory of State Abstraction for MDPs.md b/site/_posts/2019-12-26-Towards a Unified Theory of State Abstraction for MDPs.md new file mode 100755 index 00000000..8f66e341 --- /dev/null +++ b/site/_posts/2019-12-26-Towards a Unified Theory of State Abstraction for MDPs.md @@ -0,0 +1,82 @@ +--- +layout: post +title: Towards a Unified Theory of State Abstraction for MDPs +comments: True +excerpt: +tags: ['2006', 'Markov Decision Process', 'Reinforcement Learning', 'State Abstraction', AI, MDP, RL] + +--- + + +## Introduction + +* The paper studies five different techniques for stat abstraction in MDPs (Markov Decision Processes) and evaluates their usefulness for planning and learning. + +* The general idea behind abstraction is to map the actual (or observed) state to an abstract state that should be more amenable for learning. + +* It can be thought of as a mapping from one representation to another representation while preserving some useful properties. + +* [Link to the paper](https://pdfs.semanticscholar.org/ca9a/2d326b9de48c095a6cb5912e1990d2c5ab46.pdf) + + +## General Definition + +* Consider a MDP $$M = $$ where $$S$$ is the finite set of states, $$A$$ is finite set of actions, $$P$$ is the transition function, $$R$$ is the bounded reward function and $$\gamma$$ is the discount factor. + +* The abstract version of the MDP is $$\widetilde{M} = <\widetilde{S}, A, \widetilde{P}, \widetilde{R}, \gamma>$$ where $$\widetilde{S}$$ is the finite set if abstract states, $$\widetilde{P}$$ is the transition function in the abstract state space and $$\widetilde{R}$$ is the bounded reward function in the abstract reward space. + +* Abstraction function $$\phi$$ is a function that maps a given state $$s$$ to its abstract counterpart $$\widetilde{s}$$. + +* The inverse image $$\phi^{-1}(\widetilde{s})$$ is the set of ground states that map to the $$\widetilde{s}$$ under the abstraction function $$\phi$$. + +* A wieghing functioon $$w(s)$$ is used to measure how much does a state $$s$$ contribute to the abstract state $$\phi(s)$$. + +## Topology of Abstraction Space + +* Given two abstraction functions $$\phi_{1}$$ and $$\phi_{2}$$, $$\phi_{1}$$ is said to be *finer* than $$\phi_{2}$$ iff for any states $$s_{1}, s_{2}$$ if $$\phi_{1}(s_{1}) = \phi_{1}(s_{2})$$ then $$\phi_{2}(s_{1}) = \phi_{2}(s_{2})$$. + +* This *finer* relation is reflex, antisymmetric, transitive and partially ordered. + +## Five Types of Abstraction + +* While many abstractions are possible, not all abstractions are equally important. + +* Model-irrelevance abstraction $$\phi_{model}$$: + + * If two states $s_{1}$ and $s_{2}$ have the same abstracted state, then their one-step model is preserved. + + * Consider any action $$a$$ and any abstract state $$\widetilde{s}$$, if $$\phi_{model}(s_{1} = \phi_{model}(s_{2})$$ then $$R(s_1, a) = R(s_2, a)$$ and $$\sum_{s' \in \phi_{model}^{-1}\widetilde(s)}P_{s_1, s'}^{a} = \sum_{s' \in \phi_{model}^{-1}\widetilde(s)}P_{s_2, s'}^{a}$$. + +* $$Q^{\pi}$$-irrelevance abstraction: + + * It preserves the state-action value finction for all the states. + + * $$\phi_{Q^{\pi}}(s_1) = \phi_{Q^{\pi}}(s_2)$$ implies $$Q^{\pi}(s_1, a) = Q^{\pi}(s_1, a)$$. + +* $$Q^{*}$$-irrelevance abstraction: + + * It preserves the optimal state-action value function. + +* $$a^{*}$$-irrelevance abstraction: + + * It preserves the optimal action and its value function. + +* $$\phi_{\pi^{*}}$$-irrelevance abstraction: + + * It preserves the optimal action. + +* In terms of *fineness*, $$\phi_0 \geq \phi_{model} \geq \phi_{Q^{\pi}} \geq \phi_{Q^*} \geq \phi_{a^*} \geq \phi_{\pi^*} $$. Here $$\phi_0$$ is the identity mapping ie $$\phi_0(s) = s$$ + +* If a property applies to any abstraction, it also applies to all the finer abstractions. + +## Key Theorems + +* As we go from finer to coarser abstractions, the information loss increases (ie fewer components can be recovered) while the state-space reduces (ie the efficiency of solving the problem increases). This leads to a tradeoff when selecting abstractions. + +* For example, with abstractions $$\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}, \phi_{a^*}$$, the optimal abstract policy $$\widetilde(\pi)^*$$ is optimal in the ground MDP. + +* Similarly, if each state-action pair is visited infinitely often and the step-size decays properly, Q-learning with $$\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}$$ converges to the optimal state-action value functions in the MDP. More conditions are needed for convergence in the case of the remaining two abstractions. + +* For $$\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}, \phi_{a^*}$$, the model built with the experience converges to the true abstract model with infinite experience if the weighing function $$w(s)$$ is fixed. + + diff --git a/site/_posts/2020-01-02-Superposition of many models into one.md b/site/_posts/2020-01-02-Superposition of many models into one.md new file mode 100755 index 00000000..d5f49181 --- /dev/null +++ b/site/_posts/2020-01-02-Superposition of many models into one.md @@ -0,0 +1,113 @@ +--- +layout: post +title: Superposition of many models into one +comments: True +excerpt: +tags: ['2019', 'Continual Learning', 'Lifelong Learning', AI, CL, LL] + +--- + + +## Introduction + +* The paper proposes a technique (called Parameter Superposition or PSP) for training and storing multiple models within a single set (or instance) of parameters. + +* The different models exist in "superposition" and can be retrieved dynamically given task-specific context information. + +* [Link to the paper](https://arxiv.org/abs/1902.05522). + +## Parameter Substitution + +* Consider a task with input $$x \in R^N$$ and parameter $$W$ \in R^{M \times N}$$ where the output (target or features) are given as $$y=Wx$$. + +* Now consider $$K$$ such tasks with parameters $$W_1, W_2, \cdots W_K$$. + +* If each $$W_k$$ requires only a small subspace in $$R^N$$, then a linear transformation $$C_k^{-1}$$ can be used such that each $$W_kC_k^{-1}$$ occupies a mutually orthogonal subspace in $$R^N$$. + +* The set of parameters $$W_1, \cdots W_K$$ can be represented by a single $$W^{M \times N}$$ by adding $$W_kC_k^{-1}$$. + +* The parameter corresponding to the $$k^{th}$$ task can be retrived (with some noise) using the context $$C_k$$ as $$W^{~}_k = WC_k$$ + +* Even though the retrieval is noisy, the effect of noise is limited for the context vectors used in the paper. + +* Finally, $$\widetilde(y) = \widetilde(W)_{k}x = (WC_{k})x = W(C_{k}x)$$ + +* Instead of learning $$K$$ separate models, only $$K$$ context vectors (along with 1 superimposed model) needs to be learned. + +* The key assumption is that $$N$$ (in $$x \in R^N)$$ is large enough such that each $$W_k$$ requires only a small subspace of $$R^N$$. + +* Since images and speech signals tend to occupy a low dimensional manifold, this requirement can be satisfied by over-parameterizing x. + +## Choice of Context C + +* Rotational Superposition (pspRotation) + + * Sample rotations uniformly from the orthogonal group $$O(M)$$. + + * Downside is that if $$M \sim N$$, it requires storing as many parameters as learning $$K$$ individual models (since $$C$$ is of the size of ##M \times M$$). + +* Complex Superposition (pspComplex) + + * The design of rotational superposition can be improved by choosing $$C_k$$ to be a diagonal matrix ie $$C_k = diag(c_k)$$ where $$c_k$$ is a vector of size $$M$$. + + * Choosing $$c_k$$ to be a vector of complex numbers (of the form $$c_{k}^{j} = e^{i\phi_{j}(k)}$$ where $$\phi_{j}(k)$$ or the phase is sampled uniformly from $$[-\pi, \pi]$$) leads to $$C_k$$ being a digonal orthogonal matrix. + +* Powers of a single context + + * The memory footprint can be further reduced by choosing the context vectors to be integral powers of the first context vector. + +* Binary Superposition (pspBinary) + + * This is a special case of complex superposition where the context vectors are binary. + +## Neural Network Superposition + +* The parameter superposition principle can be applied to all the linear layers of a network. + +* For the convolutional layers, it makes more sense to apply superposition to the convolutional kernel and not to the input image (as the dimensionality of convolutional parameters is smaller than that of inputs). + +## Experiments + +* For all the experiments, the baseline is a standard supervised learning setup, unless mentioned otherwise. + +* The metric is the performance on the previous tasks when the model has been trained on the newer tasks. + +* Input Interference + + * The input distribution changes over time. + + * Permuted MNIST dataset is used where each permutation of the pixels corresponds to a new task. + + * A new task is sampled every 1000 mini-batches. + + * As the network size increases, the performance of Parameter Superposition (psp) outperforms the baseline significantly. + + * pspRotation > pspComplex > pspBinary in terms of both performance and the number of additional parameters required for each new task. + + * Given that pspBinary is the easiest to implement while being comparable to more sophisticated baselines like Elastic Weight Consolidation (EWC) and Synaptic Intelligence, the paper presents most of the results with the pspBinary model. + +* Continous Domain Shift + + * Rotating-MNIST and Rotating-FashionMNIST tasks are proposed to simulate continuous domain shift. + + * In these tasks, the input images are rotated in-plane by a small angle such that the rotation is complete after 1000 steps. + + * A new context is assigned after 100 steps as per step changes in the angle would be very small. + + * The 10 context vectors used in the first 1000 steps are reused for the subsequent steps. + +* Randomly changing the context vector + + * The paper considers an ablation where the context vector is randomly changed at every step (of the 1000 step cycle). This required the superposition model to store 1000 models. + + * This approach is better than the supervised learning baseline but not as good as the proposed psp* models. + +* Output Interference + + * This is the setup where the model transitions from one classification task to another. + + * Incremental CIFAR dataset is used with Resnet18 as the base model. + + * Baseline is a standard supervised learning model where a new classification head is used for each task (since the classes have a different meaning in each dataset). The model component before the classification layer is shared across the tasks. + + * Even though the labels are different across the datasets, the pspBinary model, trained with a single output layer, outperforms the multi-headed baseline. \ No newline at end of file diff --git a/site/_site b/site/_site index e504b430..afdf6d73 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit e504b430d86a958c12e6e2ee4529384e8e410658 +Subproject commit afdf6d73354aed30f37483f70cb681889c101241