From f4730cdd0921f412ab57e8fb69b5b6fb03e07f34 Mon Sep 17 00:00:00 2001 From: Shagun Sodhani Date: Thu, 13 Jun 2019 23:59:49 -0400 Subject: [PATCH] Added T-REX paper --- README.md | 2 + ...einforcement Learning from Observations.md | 49 +++++++++++++++++++ site/_site | 2 +- 3 files changed, 52 insertions(+), 1 deletion(-) create mode 100755 site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md diff --git a/README.md b/README.md index 105f9e5f..3f9f0b07 100755 --- a/README.md +++ b/README.md @@ -5,6 +5,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations](https://shagunsodhani.com/papers-I-read/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations) +* [Meta-Reinforcement Learning of Structured Exploration Strategies](https://shagunsodhani.com/papers-I-read/Meta-Reinforcement-Learning-of-Structured-Exploration-Strategies) * [Good-Enough Compositional Data Augmentation](https://shagunsodhani.com/papers-I-read/Good-Enough-Compositional-Data-Augmentation) * [Towards a natural benchmark for continual learning](https://shagunsodhani.com/papers-I-read/Towards-a-natural-benchmark-for-continual-learning) * [Meta-Learning Update Rules for Unsupervised Representation Learning](https://shagunsodhani.com/papers-I-read/Meta-Learning-Update-Rules-for-Unsupervised-Representation-Learning) diff --git a/site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md b/site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md new file mode 100755 index 00000000..3d42d497 --- /dev/null +++ b/site/_posts/2019-06-13-Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations.md @@ -0,0 +1,49 @@ +--- +layout: post +title: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations +comments: True +excerpt: +tags: ['2019', 'ICML 2019', 'Inverse Reinforcement Learning', 'Reinforcement Learning', AI, ICML, IRL, RL] + +--- +## Introduction + +* The paper proposes a new inverse RL (IRL) algorithm, called as Trajectory-ranked Reward EXtrapolation (T-REX) that learns a reward function from a collection of ranked trajectories. + +* Standard IRL approaches aim to learn a reward function that "justifies" the demonstration policy and hence those approaches cannot outperform the demonstration policy. + +* In contrast, T-REX aims to learn a reward function that "explains" the ranking over demonstrations and can learn a policy that outperforms the demonstration policy. + +* [Link to the paper](https://arxiv.org/abs/1904.06387) + +## Approach + +* The input is a sequence of trajectories *T1, ... Tm* which are ranked in the order of preference. That is, given any pair of trajectories, we know which of the two trajectories is better. + +* The setup is to learn from observations where the learning agent does not have access to the true reward function or the action taken by the demonstration policy. + +* Reward Inference + + * A parameterized reward function *rθ* is trained with the ranking information using a binary classification loss function which aims to predict which of the two given trajectory would be ranked higher. + + * Given a trajectory, the reward function predicts the reward for each state. The sum of rewards (corresponding to the two trajectories) is used used to predict the preferred trajectory. + + * T-REX uses partial trajectories instead of full trajectories as a data augmentation strategy. + +* Policy Optimization + + * Once a reward function has been learned, standard RL approaches can be used to train a new policy. + +## Results + +* Environments: Mujoco (Half Cheetah, Ant, Hooper), Atari + +* Demonstrations generated using PPO (checkpointed at different stages of training). + +* Ensemble of networks used to learn the reward functions. + +* The proposed approach outperforms the baselines [Behaviour Cloning from Observations](https://arxiv.org/abs/1805.01954) and [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476). + +* In terms of reward extrapolation, T-REX can predict the reward for trajectories which are better than the demonstration trajectories. + +* Some ablation studies considered the effect of adding noise (random swapping the preference between trajectories) and found that the model is somewhat robust to noise up to an extent. \ No newline at end of file diff --git a/site/_site b/site/_site index f926066a..826701a0 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit f926066ac659bf08fb252325869e5d4c9b486240 +Subproject commit 826701a05e5d3b066ee509705b9033ef5b4914e7