Added T-REX paper

shagunsodhani · Jun 14, 2019 · f4730cd · f4730cd
1 parent 05e9d77
commit f4730cd
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -5,6 +5,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations](https://shagunsodhani.com/papers-I-read/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations)
+* [Meta-Reinforcement Learning of Structured Exploration Strategies](https://shagunsodhani.com/papers-I-read/Meta-Reinforcement-Learning-of-Structured-Exploration-Strategies)
 * [Good-Enough Compositional Data Augmentation](https://shagunsodhani.com/papers-I-read/Good-Enough-Compositional-Data-Augmentation)
 * [Towards a natural benchmark for continual learning](https://shagunsodhani.com/papers-I-read/Towards-a-natural-benchmark-for-continual-learning)
 * [Meta-Learning Update Rules for Unsupervised Representation Learning](https://shagunsodhani.com/papers-I-read/Meta-Learning-Update-Rules-for-Unsupervised-Representation-Learning)

diff --git a/...boptimal Demonstrations via Inverse Reinforcement Learning from Observations.md b/...boptimal Demonstrations via Inverse Reinforcement Learning from Observations.md
@@ -0,0 +1,49 @@
+---
+layout: post
+title: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
+comments: True
+excerpt: 
+tags: ['2019', 'ICML 2019', 'Inverse Reinforcement Learning', 'Reinforcement Learning', AI, ICML, IRL, RL]
+
+---
+## Introduction
+
+* The paper proposes a new inverse RL (IRL) algorithm, called as Trajectory-ranked Reward EXtrapolation (T-REX) that learns a reward function from a collection of ranked trajectories.
+
+* Standard IRL approaches aim to learn a reward function that "justifies" the demonstration policy and hence those approaches cannot outperform the demonstration policy.
+
+* In contrast, T-REX aims to learn a reward function that "explains" the ranking over demonstrations and can learn a policy that outperforms the demonstration policy.
+
+* [Link to the paper](https://arxiv.org/abs/1904.06387)
+
+## Approach
+
+* The input is a sequence of trajectories *T<sub>1</sub>, ... T<sub>m</sub>* which are ranked in the order of preference. That is, given any pair of trajectories, we know which of the two trajectories is better.
+
+* The setup is to learn from observations where the learning agent does not have access to the true reward function or the action taken by the demonstration policy.
+
+* Reward Inference
+
+    * A parameterized reward function *r<sub>&theta;</sub>* is trained with the ranking information using a binary classification loss function which aims to predict which of the two given trajectory would be ranked higher.
+
+    * Given a trajectory, the reward function predicts the reward for each state. The sum of rewards (corresponding to the two trajectories) is used used to predict the preferred trajectory.
+
+    * T-REX uses partial trajectories instead of full trajectories as a data augmentation strategy.
+
+* Policy Optimization
+
+    * Once a reward function has been learned, standard RL approaches can be used to train a new policy.
+
+## Results
+
+* Environments: Mujoco (Half Cheetah, Ant, Hooper), Atari
+
+* Demonstrations generated using PPO (checkpointed at different stages of training).
+
+* Ensemble of networks used to learn the reward functions.
+
+* The proposed approach outperforms the baselines [Behaviour Cloning from Observations](https://arxiv.org/abs/1805.01954) and [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476).
+
+* In terms of reward extrapolation, T-REX can predict the reward for trajectories which are better than the demonstration trajectories.
+
+* Some ablation studies considered the effect of adding noise (random swapping the preference between trajectories) and found that the model is somewhat robust to noise up to an extent.
diff --git a/site/_site b/site/_site