Skip to content

Commit

Permalink
Added DIAYN paper
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Feb 4, 2019
1 parent 8881498 commit 12d5e3f
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Diversity is All You Need - Learning Skills without a Reward Function](https://shagunsodhani.in/papers-I-read/Diversity-is-All-You-Need-Learning-Skills-without-a-Reward-Function)
* [Modular meta-learning](https://shagunsodhani.in/papers-I-read/Modular-meta-learning)
* [Hierarchical RL Using an Ensemble of Proprioceptive Periodic Policies](https://shagunsodhani.in/papers-I-read/Hierarchical-RL-Using-an-Ensemble-of-Proprioceptive-Periodic-Policies)
* [Efficient Lifelong Learningi with A-GEM](https://shagunsodhani.in/papers-I-read/Efficient-Lifelong-Learning-with-A-GEM)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
layout: post
title: Diversity is All You Need - Learning Skills without a Reward Function
comments: True
excerpt:
tags: ['2019', 'ICLR 2019', 'Information Theory', 'Reinforcement Learning', Entropy, ICLR, RL, Unsupervised]
---

## Introduction

* The paper proposes an approach to learn useful skills without a reward function by maximizing an information theoretic objective by using a maximum entropy policy.

* Skills are defined as latent-conditioned policies that alter the state of the environment in a consistent way.

* [Link to the paper](https://arxiv.org/abs/1802.06070)

* [Link to the code](https://github.com/ben-eysenbach/sac)

## Setup

* Unsupervised "exploration" stage followed by supervised stage.

## Desirable Qualities of Skills

* Skills should dictate the states that the agent visits. Different skills should visit different states to be distinguishable.

* States (not actions) should be used to distinguish between skills as not all actions change the state (for the outside observer).

* Skills are encouraged to be diverse and "exploratory" by learning skills that act randomly (have high entropy).

## Loss Formulation

* (S, A) - state and action

* z ~ p(z) - latent variable to condition the policy.

* Skill - policy conditioned on a fixed z.

* Objective is to maximize the mutual information between skill and state (MI(A; Z)) ie skill should control which state is visited or the skill should be inferrable from the state visited.

* Simultaneously minimize the mutual information between skills and actions given the state to ensure that the state (and not the action) is used to distinguish the skills.

* Maximize the entropy of the mixture of policies (p(z) and all the skills).

## Implementation

* Policy π(a \| s, z)

* Task reward replaced by the pseduoreward logq<sub>&phi;</sub>(z \| s) - log(p(z)).

* During unsupervised training, z is sampled at the start of the episode and then not changed during the episode.

* Learning agent gets rewards for visiting the states that are easy to discriminate while the discriminator updated to correctly predict z from the states visited.

## Observations

### Analysis of Learned Skills

* The agent learns a diverse set of primitive behaviors for all tasks ranging from 2 DoF to 111 DoF.

* for inverted pendulum and mountain car, the skills become increasingly diverse throughout training.

* Use of uniform prior, in place of a learned prior, for p(z) allows for discovery of more diverse skills.

* The proposed approach can be used as a pretraining technique where the best-performing primitives (from unsupervised training) can be finetuned with the task-specific rewards.

* The discovered skills can be used for hierarchical RL by learning a meta-policy(which chooses the skill to execute for k steps).

* Modifying the discriminator in the proposed formulation can be used to bias DIAYN towards discovering a particular type of policies. This provides a mechanism for incorporating "supervision" in the learning setup.

* The "discovered" primitives can also be used for imitation learning.
2 changes: 1 addition & 1 deletion site/_site
Submodule _site updated from a705a1 to 3179cc

0 comments on commit 12d5e3f

Please sign in to comment.