Skip to content

Commit

Permalink
Add MuZero paper
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Dec 9, 2019
1 parent ba5b6b5 commit 8a78469
Show file tree
Hide file tree
Showing 3 changed files with 129 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://shagunsodhani.com/papers-I-read/Mastering-Atari,-Go,-Chess-and-Shogi-by-Planning-with-a-Learned-Model)
* [Gossip based Actor-Learner Architectures for Deep RL](https://shagunsodhani.com/papers-I-read/Gossip-based-Actor-Learner-Architectures-for-Deep-RL)
* [How to train your MAML](https://shagunsodhani.com/papers-I-read/How-to-train-your-MAML)
* [PHYRE - A New Benchmark for Physical Reasoning](https://shagunsodhani.com/papers-I-read/PHYRE-A-New-Benchmark-for-Physical-Reasoning)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
layout: post
title: Gossip based Actor-Learner Architectures for Deep RL
comments: True
excerpt:
tags: ['2019', 'Deep Reinforcement Learning', 'Distributed Reinforcement Learning', 'Neurips 2019', 'Reinforcement Learning', AI, DRL, Neurips, RL]

---

* [Link to the paper](https://arxiv.org/abs/1906.04585)

* The paper considers the task of training an RL system by sampling data from multiple simulators (over parallel devices).

* The setup is that of distributed RL setting with *n* agents or actor-learners (composed of a single learner and several actors). These agents are trying to maximize a common value function.

* One (existing) approach is to perform on-policy updates with a shared policy. The policy could be updated in synchronous (does not scale well) or asynchronous manner (can be unstable due to stale gradients).

* Off policy approaches allow for better computational efficiency but can be unstable during training.

* The paper proposed Gossip based Actor-Learner Architecture (GALA) which uses asynchronous communication (gossip) between the *n* agents to improve the training of Deep RL models.

* These agents are expected to converge to the same policy.

* During training, the different agents are not required to share the same policy and it is sufficient that the agent's policies remain $\epsilon$-close to each other. This relaxation allows the policies to be trained asynchronously.

* GALA approach is combined with A2C agents resulting in GALA-A2C agents. They have better computational efficiency and scalability (as compared to A2C) and similar in performance to A3C and Impala.

* Training alternates between one local policy-gradient (and TD update) and asynchronous gossip between agents.

* During the gossip step, the agents send their parameters to some of the other agents (referred to as the peers) and update their parameters based on the parameters received from the other agents (for which the given agent is a peer).

* GALA agents are implemented using non-blocking communication so that they can operate asynchronously.

* The paper includes the proof that the policies learned by the different agents are within $\epsilon$ distance of each other (ie all the policies lie within an $\epsilon$-distance ball) thus ensuring that the policies do not diverge much from each other.

* Six games from the Ataru 2600 games suite are used for the experiments.

* Baselines: A2C, A3C, Impala

* GALA agents are configured in a directed ring graph topology.

* With A2C, as the number of simulators increases, the number of convergent runs (runs with a threshold reward) decreases.

* Using gossip algorithms increases or maintains the number of convergent runs. It also improves the performance, sample efficiency and compute efficiency of A2C across all the six games.

* When compared to Impala and A3C, GALA-A2C generally outperforms (or performs as well as) those baselines.

* Given that the learned policies remain within an $\epsilon$ ball, the agent's gradients are less correlated as compared to the A2C agents.
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
layout: post
title: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
comments: True
excerpt:
tags: ['2019', 'Deep Reinforcement Learning', 'Reinforcement Learning', AI, DRL, Model-Based, Model-Free, Planning, RL]

---

## Introduction

* The paper presents the MuZero algorithm that performs planning with a learned model.

* The algorithm achieves state of the art results on Atari suite (where generally model-free approaches perform the best) and on planning-oriented games like Chess and Go (where generall planning approaches perform the best).

* [Link to the paper](https://arxiv.org/abs/1911.08265)

## Relation to standard Model-Based Approaches

* Model-based approaches generally focus on reconstructing the true environment state or the sequence of full observations.

* MuZero focuses on predicting only those aspects that are most relevant for planning - policy, value functions, and rewards.

## Approach

* The model consists of three components: (representation) encoder, dynamics function, and the prediction network.

* The learning agent has two kinds of interactions - real interactions (ie the actions that are actually executed in the real environment) and hypothetical or imaginary actions (ie the actions that are executed in the learned model or the dynamics function).

* At any timestep *t*, the past observations *o<sub>1</sub>*, ... *o<sub>t</sub>* are encoded into the state *s<sub>t</sub>* using the encoder.

* Now the model takes hypothetical actions for the next *K* timesteps by unrolling the model for *K* steps.

* For each timestep *k = 1, ..., K*, the dynamics model predicts the immediate reward *r<sub>k</sub>* and a new hidden state *h<sub>k</sub>* using the previous hidden state *h<sub>k-1</sub>* and action *a<sub>k</sub>*.

* At the same time, the policy *p<sup>k</sup>* and the value function *v<sup>k</sup>* are computed using the prediction network.

* The initial hidden state *h<sub>0</sub>* is initialized using the state *s<sub>t</sub>*

* Any MDP Planning algorithm can be used to search for optimal policy and value function given the state transitions and the rewards induced by the dynamics function.

* Specifically, the MCTS (Monte Carlo Tree Search) algorithm is used and the action *a<sub>t+1</sub>* (ie the action that is executed in the actual environment) is selected from the policy outputted by MCTS.

## Collecting Data for the Replay Buffer

* At each timestep *t*, the MCTS algorithm is executed to choose the next action (which will be executed in the real environment).

* The resulting next observation *o<sub>t+1</sub>* and reward *r<sub>t+1</sub>* are stored and the trajectory is written to the replay buffer (at the end of the episode).

## Objective

* For every hypothetical step *k*, match the predicted policy, value, and reward to the actual target values.

* The target policy is generated by the MCTS algorithm.

* The target value function and reward are generated by actually playing the game (or the MDP).

## Relation to AlphaZero

* MuZero leverages the search-based policy iteration from AlphaZero.

* It extends AlphaZero to setups with a single agent (where self-play is not possible) and setups with a non-zero reward at the intermediate time steps.

* The encoder and the predictions functions are similar to ones used by AlphZero.

## Results

* *K* is set to 5.

* Environments: 57 games in Atari along with Chess, Go and Shogi

* MuZero achieves the same level of performance as AlphaZero for Chess and Shogi. In Go, MuZero slightly outperforms AlphaZero despite doing fewer computations per node in the search tree.

* In Atari, MuZero achieves a new state-of-the-art compared to both model-based and model-free approaches.

* The paper considers a variant called MuZero Reanalyze that reanalyzes old trajectories by re-running the MCTS algorithm with the updated network parameter. The motivation is to have a better sample complexity.

* MuZero performs well even when using a single simulation of MCTS (during inference).

* During training, using more simulations of MCTS helps to achieve better performance through even just 6 simulations per move is sufficient to learn a good model for Ms. Pacman.

0 comments on commit 8a78469

Please sign in to comment.