Skip to content

Commit

Permalink
added some papers
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Mar 18, 2019
1 parent 03ab463 commit f86df76
Show file tree
Hide file tree
Showing 3 changed files with 269 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
layout: post
title: TuckER - Tensor Factorization for Knowledge Graph Completion
comments: True
excerpt:
tags: ['2019', 'Graph Representation', 'Linear Algebra', 'Linear Model', 'Matrix Factorization', Tucker Decomposition', AI, Embedding, Factorization, Graph, Matrix]
---

## Introduction

* TuckER is a simple, yet powerful linear model that uses Tucker decomposition for the task of link prediction in knowledge graphs.

* [Paper](https://arxiv.org/abs/1901.09590)

* [Implementation](https://github.com/ibalazevic/TuckER)

## Knowledge Graph as a Tensor

* Let E be the set of all the entities and R be the set of all the relations in a given knowledge graph (KG).

* The KG can be represented as a list of triples of the form (source entity, relation, object entity) or (e<sub>s</sub>, r, e<sub>o</sub>).

* The list of triples can be represented as a third-order tensor (of binary values) where each element corresponds to a triple and each element's value corresponds to ether that element is present in the KG or not.

* The link prediction task can be formulated as - given a set of all triples, learn a scoring function that assigns a score to each triple. The score indicates whether the triple is actually present in the KG or not.

## TuckER Decomposition

* Tucker decomposition factorizes a tensor into a set of factor matrices and a smaller core tensor.

* In the specific case of three-mode tensors (alternate representation of a KG), the given original tensor **X** (of shape *IxJxK*) can be factorized into a core tensor **W** (of shape *PxQxR*) and 3 factor matrics - **A** (of shape *IxP*), **B** (of shape *JxQ*) and **C** (of shape *KxR*) such that **X** is approximately **W** x<sub>1</sub> **A** x<sub>2</sub> **B** x<sub>3</sub> **C**, where X<sub>n</sub> denotes the tensor product along the nth mode.

* Generally, *P, Q, R* are smaller than *I, J K* (respectively) and **W** can be seen as a compressed version of **X**.

## TuckER Decomposition for Link Prediction

* Two embedding matrics are used for embedding the entities and the relations respectively.

* Entity embedding matrix **E** is shared for both subject and the object ie **E** = **A** = **B**.

* The scoring function is gives as **W** x<sub>1</sub> **e<sub>s</sub>** x<sub>2</sub> **w<sub>r</sub>** x<sub>3</sub> **e<sub>0</sub>** where **e<sub>s</sub>**, **w<sub>r</sub>** and **e<sub>o</sub>** are the embedding vectors corresonding to e<sub>s</sub>, e<sub>r</sub> and e<sub>o</sub> respectively. Note that both the core tensor and the factor matrices are to be learnt.

* Model is trained with the standard negative log-likelihood loss given as (for one triple): y * log(p) + (1-y) * log(1-p)

* To speed up training and increase accuracy, 1-N scoring is used. A given (e<sub>s</sub>, r) is simultaneously scored for all the entities using the local-closed world assumption (knowledge graph is only locally complete).

* Handling asymmetric relations is straightforward by learning a relation embedding alongside a relation-agnostic core tensor which enables knowledge sharing across relations.

## Theoretical Analysis

* One important consideration would be the expressive power of TuckER models, especially in relation to other models like ComplEx and SimplE.

* It can be shown the TuckER is fully expressive ie give any ground truth over E and R, there exists a TuckER model which can perfectly represent the data - using 1-hot entity and relation embedding.

* For full expressiveness, dimensionality of entity (relation) is n<sub>E</sub> (n<sub>R</sub>) where n<sub>E</sub> (n<sub>R</sub>) are the number of entities (relations). In comparsion, the required dimensionality for ComplEx is n<sub>E</sub> * n<sub>R</sub> (for both entity and relations) and for SimplE, it is min(<sub>E</sub> * n<sub>R</sub>, number of facts + 1) (for both entity and relations).

* Many existing models like RESCAL, DistMult, ComplEx, SimplE etc can be seen as special cases of TuckER.

## Experiments

### Datasets

* FB15k, FB15k-237, WN18, WN18RR

* The max number of entities is around 41K and max number of relations is around 1.3K

### Implementation

* BatchNorm, Dropout and Learning rate decay are used.

### Metrics

* Mean Reciprocal Rank (MRR) - the average of the inverse of mean rank assigned to the true triple overall n<sub>e</sub> generated triples.

* hits@k (k = 1, 3, 10) - percentage of times the true triple is ranked in the top k of the n<sub>e</sub> generated triples.

* Higher is better for both the metrics.

### Results

* TuckER outperforms all the baseline models on all but one task.

* Dropout is an important factor with higher dropout rates (0, 3, 0.4, 0.5) needed for datasets with fewer training examples per relation (hence more prone to overfitting).

* TuckER improves performance more significantly when the number of relations is large.

* Even with lower embedding dimensions, TuckER's performance does not deteriorate as much as other models.
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
layout: post
title: Model Primitive Hierarchical Lifelong Reinforcement Learning
comments: True
excerpt:
tags: ['2019', 'AAMAS 2019', 'Catastrophic Forgetting', 'Continual Learning', 'Hierarchical Reinforcement Learning', 'Lifelong Learning', 'Reinforcement Learning', AAMAS, AI, CL, HRL, RL]
---

## Introduction

* The paper presents a framework that uses diverse suboptimal world models that can be used to break complex policies into simpler and modular sub-policies.

* Given a task, both the sub-policies and the controller are simultaneously learned in a bottom-up manner.

* The framework is called as Model Primitive Hierarchical Reinforcement Learning (MPHRL).

* [Link to the paper](https://arxiv.org/abs/1903.01567)

## Idea

* Instead of learning a single transition model of the environment (aka *world model*) that can model the transitions very well, it is sufficient to learn several (say *k*) suboptimal models (aka *model primitives*).

* Each *model primitive* will be good in only a small part of the state space (aka *region of specialization*).

* These *model primitives* can then be used to train a gating mechanism for selecting sub-policies to solve a given task.

* Since these *model primitives* are sub-optimal, they are not directly used with model-based RL but are used to obtain useful functional decompositions and sub-policies are trained with model-free approaches.

## Single Task Learning

* A gating controller is trained to choose the sub-policy whose *model primitive* makes the best prediction.

* This requires modeling *p(M<sub>k</sub> \| s<sub>t</sub>, a<sub>t</sub>, s<sub>t+1</sub>)* where *p(M<sub>k</sub>)* denotes the probability of selecting the *k<sup>th</sup> model primitive*. This is hard to compute as the system does not have access to *s<sub>t+1</sub>* and *a<sub>t</sub>* at time *t* before it has choosen the sub-policy.

* Properly marginalizing *s<sub>t+1</sub>* and *a<sub>t</sub>* would require expensive MC sampling. Hence an approximation is used and the gating controller is modeled as a categorical distribution - to produce *p(M<sub>k</sub> \| s<sub>t</sub>)*. This is trained via a conditional cross entropy loss where the ground truth distribution is obtained from transitions sampled in a rollout.

* The paper notes that technique is biased but reports that it still works for the downstream tasks.

* The gating controller composes the sub-policies as a mixture of Gaussians.

* For learning, PPO algorithm is used with each *model primitives* gradient weighted by the probability from the gating controller.

## Lifelong Learning

* Different tasks could share common subtasks but may require a different composition of subtasks. Hence, the learned sub-policies are transferred across tasks but not the gating controller or the baseline estimator (from PPO).

## Experiments

* Domains:

* Mujoco ant navigating different mazes.

* Stacker arm picking up and placing different boxes.

* Implementation Details:

* Gaussian subpolicies

* PPO as the baseline

* Model primitives are hand-crafted using the true next state provided by the environment simulator.

* Single Task

* Only maze task is considered with the start position (of the ant) and the goal position is fixed.

* Observation includes distance from the goal.

* Forcing the agent to decompose the problem, when a more direct solution may be available, causes the sample complexity to increase on one task.

* Lifelong Learning

* Maze

* 10 random Mujoco ant mazes used as the task distribution.

* MPHRL takes almost twice the number of steps (as compared to PPO baseline) to solve the first task but this cost gets amortized over the distribution and the model takes half the number of steps as compared to the baseline (summed over the 10 tasks).

* Pick and Place

* 8 Pick and Place tasks are created with max 3 goal locations.

* Observation includes the position of the goal.

* Ablations

* Overlapping *model primitives* can degrade the performance (to some extent). Similarly, the performance suffers when redundant primitives are introduced indicating that the gating mechanism is not very robust.

* Sub-policies could quickly adapt to the previous tasks (on which they were trained initially) despite being finetuned on subsequent tasks.

* The order of tasks (in the 10-Mazz task) does not degrage the performance.

* Transfering the gating controller leads to negative transfer.

* Notes

* I think the biggest strength of the work is that accurate dynamics model are not needed (which are hard to train anyways!) through the experimental results are not conclusive given the limited number of domains on which the approach is tested.
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
layout: post
title: To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
comments: True
excerpt:
tags: ['2019', 'Empirical Advice', 'Multi Task', 'Natural Language Processing', 'Transfer Learning', AI, NLP, QA]
---

* [Link to the paper](https://arxiv.org/abs/1903.05987)

* The paper provides useful empirical advice for adapting pretrained language models for a given target task.

* Pre-trained models considered

* ELMo

* BERT

* Tasks considered

* Named Entity Recognition (NER) - CoNLL 2003 dataset

* Sentiment Analysis (SA) - Stanford Sentiment Treebank (SST-2) dataset

* Natural Language Inference (NLI) - MultiNLI and Sentences Involving Compositional Knowledge (SICK-E) dataset

* Paraphrase Detection (PD) - Microsoft Research Paraphrase Corpus (MRPC)

* Semantic Textual Similarity (STS) - Semantic Textual Similarity Benchmark (STS-B) and SICK-R

* The last 3 tasks (NLI, PD, STS) are defined for sentence pairs.

* Adaptation Strategies

* Feature Extraction

* The pretrained model is only used for extracting features and its weights are kept fixed.

* For both ELMo and BERT, the contextual representation of the words from all the layers are extracted.

* A weighted combination of these layers is used as an input to the task-specific model.

* Task-specific models

* NER - BiLSTM with CRF layer

* SA - bi-attentive classification network

* NLI, PD, STS - [Enhanced Sequential Inference Model (ESIM)](https://arxiv.org/abs/1609.06038)

* Fine-tuning

* The pretrained model is finetuned on the target task.

* Task-specific models for ELMO

* NER - CRF on top of LSTM states

* SA - Max-pool over the language model states followed by a softmax layer

* NLI, PD, STS - cross sentence bi-attention between the language model states followed by pooling and softmax layer.

* Task-specific models for BERT

* NER - Extract representation of the first-word piece of each token followed by the softmax layer

* SA, NLI, PD, STS - standard BERT training

* Main observations

* Feature extraction and fine-tuning have comparable performance in most cases unless the two tasks are highly similar(fine-tuning is better) or highly dissimilar (feature extraction is better).

* For ELMo, feature extraction consistently outperforms fine-tuning for the sentence pair tasks (NLI, PD, STS). The reverse trend is observed for BERT with fine-tuning being better on sentence pair tasks.

* Adding extra parameters is helpful for feature extraction but not fine-tuning.

* ELMo fine-tuning requires careful tuning and other tricks like triangular learning rates, gradual unfreezing and discriminative fine-tuning.

* For the tasks considered, there is no correlation observed between the distance of the source and target domains and adaptation performance.

* Training a diagnostic classifier (on the intermediate representations) suggests that fine-tuning improves the performance of the classifier at all the intermediate layers (which is sort of expected).

* In terms of mutual information estimates, fine-tuned representations have a much higher mutual information as compared to the feature extraction based representations.

* Knowledge for single sentence tasks seems to be mostly concentrated in the last layers while for pair classification tasks, the knowledge seems gradually build un in the intermediate layers, all the way up to the last layer.

0 comments on commit f86df76

Please sign in to comment.