From f4730cdd0921f412ab57e8fb69b5b6fb03e07f34 Mon Sep 17 00:00:00 2001
From: Shagun Sodhani <sshagunsodhani@gmail.com>
Date: Thu, 13 Jun 2019 23:59:49 -0400
Subject: [PATCH] Added T-REX paper

---
 README.md                                     |  2 +
 ...einforcement Learning from Observations.md | 49 +++++++++++++++++++
 site/_site                                    |  2 +-
 3 files changed, 52 insertions(+), 1 deletion(-)
 create mode 100755 site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md
diff --git a/README.md b/README.md
index 105f9e5f..3f9f0b07 100755
--- a/README.md
+++ b/README.md
@@ -5,6 +5,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations](https://shagunsodhani.com/papers-I-read/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations)
+* [Meta-Reinforcement Learning of Structured Exploration Strategies](https://shagunsodhani.com/papers-I-read/Meta-Reinforcement-Learning-of-Structured-Exploration-Strategies)
 * [Good-Enough Compositional Data Augmentation](https://shagunsodhani.com/papers-I-read/Good-Enough-Compositional-Data-Augmentation)
 * [Towards a natural benchmark for continual learning](https://shagunsodhani.com/papers-I-read/Towards-a-natural-benchmark-for-continual-learning)
 * [Meta-Learning Update Rules for Unsupervised Representation Learning](https://shagunsodhani.com/papers-I-read/Meta-Learning-Update-Rules-for-Unsupervised-Representation-Learning)
diff --git a/site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md b/site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md
new file mode 100755
index 00000000..3d42d497
--- /dev/null
+++ b/site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md	
@@ -0,0 +1,49 @@
+---
+layout: post
+title: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
+comments: True
+excerpt: 
+tags: ['2019', 'ICML 2019', 'Inverse Reinforcement Learning', 'Reinforcement Learning', AI, ICML, IRL, RL]
+
+---
+## Introduction
+
+* The paper proposes a new inverse RL (IRL) algorithm, called as Trajectory-ranked Reward EXtrapolation (T-REX) that learns a reward function from a collection of ranked trajectories.
+
+* Standard IRL approaches aim to learn a reward function that "justifies" the demonstration policy and hence those approaches cannot outperform the demonstration policy.
+
+* In contrast, T-REX aims to learn a reward function that "explains" the ranking over demonstrations and can learn a policy that outperforms the demonstration policy.
+
+* [Link to the paper](https://arxiv.org/abs/1904.06387)
+
+## Approach
+
+* The input is a sequence of trajectories *T<sub>1</sub>, ... T<sub>m</sub>* which are ranked in the order of preference. That is, given any pair of trajectories, we know which of the two trajectories is better.
+
+* The setup is to learn from observations where the learning agent does not have access to the true reward function or the action taken by the demonstration policy.
+
+* Reward Inference
+    
+    * A parameterized reward function *r<sub>&theta;</sub>* is trained with the ranking information using a binary classification loss function which aims to predict which of the two given trajectory would be ranked higher.
+
+    * Given a trajectory, the reward function predicts the reward for each state. The sum of rewards (corresponding to the two trajectories) is used used to predict the preferred trajectory.
+
+    * T-REX uses partial trajectories instead of full trajectories as a data augmentation strategy.
+
+* Policy Optimization
+
+    * Once a reward function has been learned, standard RL approaches can be used to train a new policy.
+
+## Results
+
+* Environments: Mujoco (Half Cheetah, Ant, Hooper), Atari
+
+* Demonstrations generated using PPO (checkpointed at different stages of training).
+
+* Ensemble of networks used to learn the reward functions.
+
+* The proposed approach outperforms the baselines [Behaviour Cloning from Observations](https://arxiv.org/abs/1805.01954) and [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476).
+
+* In terms of reward extrapolation, T-REX can predict the reward for trajectories which are better than the demonstration trajectories.
+
+* Some ablation studies considered the effect of adding noise (random swapping the preference between trajectories) and found that the model is somewhat robust to noise up to an extent.
\ No newline at end of file
diff --git a/site/_site b/site/_site
index f926066a..826701a0 160000
--- a/site/_site
+++ b/site/_site
@@ -1 +1 @@
-Subproject commit f926066ac659bf08fb252325869e5d4c9b486240
+Subproject commit 826701a05e5d3b066ee509705b9033ef5b4914e7