Skip to content

Commit

Permalink
Added T-REX paper
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Jun 14, 2019
1 parent 05e9d77 commit f4730cd
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 1 deletion.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations](https://shagunsodhani.com/papers-I-read/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations)
* [Meta-Reinforcement Learning of Structured Exploration Strategies](https://shagunsodhani.com/papers-I-read/Meta-Reinforcement-Learning-of-Structured-Exploration-Strategies)
* [Good-Enough Compositional Data Augmentation](https://shagunsodhani.com/papers-I-read/Good-Enough-Compositional-Data-Augmentation)
* [Towards a natural benchmark for continual learning](https://shagunsodhani.com/papers-I-read/Towards-a-natural-benchmark-for-continual-learning)
* [Meta-Learning Update Rules for Unsupervised Representation Learning](https://shagunsodhani.com/papers-I-read/Meta-Learning-Update-Rules-for-Unsupervised-Representation-Learning)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
layout: post
title: Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
comments: True
excerpt:
tags: ['2019', 'ICML 2019', 'Inverse Reinforcement Learning', 'Reinforcement Learning', AI, ICML, IRL, RL]

---
## Introduction

* The paper proposes a new inverse RL (IRL) algorithm, called as Trajectory-ranked Reward EXtrapolation (T-REX) that learns a reward function from a collection of ranked trajectories.

* Standard IRL approaches aim to learn a reward function that "justifies" the demonstration policy and hence those approaches cannot outperform the demonstration policy.

* In contrast, T-REX aims to learn a reward function that "explains" the ranking over demonstrations and can learn a policy that outperforms the demonstration policy.

* [Link to the paper](https://arxiv.org/abs/1904.06387)

## Approach

* The input is a sequence of trajectories *T<sub>1</sub>, ... T<sub>m</sub>* which are ranked in the order of preference. That is, given any pair of trajectories, we know which of the two trajectories is better.

* The setup is to learn from observations where the learning agent does not have access to the true reward function or the action taken by the demonstration policy.

* Reward Inference

* A parameterized reward function *r<sub>&theta;</sub>* is trained with the ranking information using a binary classification loss function which aims to predict which of the two given trajectory would be ranked higher.

* Given a trajectory, the reward function predicts the reward for each state. The sum of rewards (corresponding to the two trajectories) is used used to predict the preferred trajectory.

* T-REX uses partial trajectories instead of full trajectories as a data augmentation strategy.

* Policy Optimization

* Once a reward function has been learned, standard RL approaches can be used to train a new policy.

## Results

* Environments: Mujoco (Half Cheetah, Ant, Hooper), Atari

* Demonstrations generated using PPO (checkpointed at different stages of training).

* Ensemble of networks used to learn the reward functions.

* The proposed approach outperforms the baselines [Behaviour Cloning from Observations](https://arxiv.org/abs/1805.01954) and [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476).

* In terms of reward extrapolation, T-REX can predict the reward for trajectories which are better than the demonstration trajectories.

* Some ablation studies considered the effect of adding noise (random swapping the preference between trajectories) and found that the model is somewhat robust to noise up to an extent.
2 changes: 1 addition & 1 deletion site/_site
Submodule _site updated from f92606 to 826701

0 comments on commit f4730cd

Please sign in to comment.