Comprehensive Reinforcement Learning Tutorial

This repository contains implementations of the most popular reinforcement learning algorithms, powered by Tensorflow 2.0 and Tensorlayer 2.0. We aim to make the reinforcement learning tutorial simple, transparent and straight-forward, as this would not only benefits new learners of reinforcement learning, but also provide convenience for senior researchers to testify their new ideas quickly.

A corresponding Springer textbook is also provided, you can get the free PDF if your institute has Springer license. We also released an RLzoo for simple usage.

Prerequisites:

python 3.5
tensorflow >= 2.0.0 or tensorflow-gpu >= 2.0.0a0
tensorlayer >= 2.0.1
tensorflow-probability

*** If you meet the errorAttributeError: module 'tensorflow' has no attribute 'contrib' when running the code after installing tensorflow-probability, try:

pip install --upgrade tf-nightly-2.0-preview tfp-nightly

Quick Start

conda create --name tl python=3.6.4  
conda activate tl
pip install tensorflow-gpu==2.0.0-rc1 # if no GPU, use pip install tensorflow==2.0.0
pip install tensorlayer
pip install tensorflow-probability==0.9.0
pip install gym
pip install gym[atari] # for others, use pip instal gym[all]

python tutorial_DDPG.py --train

Status: Beta

We are currently open to any suggestions or pull requests from you to make the reinforcement learning tutorial with TensorLayer2.0 a better code repository for both new learners and senior researchers. Some of the algorithms mentioned in the this markdown may be not yet available, since we are still trying to implement more RL algorithms and optimize their performances. However, those algorithms listed above will come out in a few weeks, and the repository will keep updating more advanced RL algorithms in the future.

To Use:

For each tutorial, open a terminal and run:

python ***.py --train for training and python ***.py --test for testing.

The tutorial algorithms follow the same basic structure, as shown in file: ./tutorial_format.py

The pretrained models and learning curves for each algorithm are stored here. You can download the models and load the weights in the policies for tests.

Algorithms	Action Space	Tutorial Env	Papers
value-based
Q-learning	Discrete	FrozenLake	Technical note: Q-learning. Watkins et al. 1992
Deep Q-Network (DQN)	Discrete	FrozenLake	Human-level control through deep reinforcement learning, Mnih et al. 2015.
Prioritized Experience Replay	Discrete	Pong, CartPole	Schaul et al. Prioritized experience replay. Schaul et al. 2015.
Dueling DQN	Discrete	Pong, CartPole	Dueling network architectures for deep reinforcement learning. Wang et al. 2015.
Double DQN	Discrete	Pong, CartPole	Deep reinforcement learning with double q-learning. Van et al. 2016.
Noisy DQN	Discrete	Pong, CartPole	Noisy networks for exploration. Fortunato et al. 2017.
Distributed DQN (C51)	Discrete	Pong, CartPole	A distributional perspective on reinforcement learning. Bellemare et al. 2017.
policy-based
REINFORCE(PG)	Discrete/Continuous	CartPole	Reinforcement learning: An introduction. Sutton et al. 2011.
Trust Region Policy Optimization (TRPO)	Discrete/Continuous	Pendulum	Abbeel et al. Trust region policy optimization. Schulman et al.2015.
Proximal Policy Optimization (PPO)	Discrete/Continuous	Pendulum	Proximal policy optimization algorithms. Schulman et al. 2017.
Distributed Proximal Policy Optimization (DPPO)	Discrete/Continuous	Pendulum	Emergence of locomotion behaviours in rich environments. Heess et al. 2017.
actor-critic
Actor-Critic (AC)	Discrete/Continuous	CartPole	Actor-critic algorithms. Konda er al. 2000.
Asynchronous Advantage Actor-Critic (A3C)	Discrete/Continuous	BipedalWalker	Asynchronous methods for deep reinforcement learning. Mnih et al. 2016.
DDPG	Discrete/Continuous	Pendulum	Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016
TD3	Discrete/Continuous	Pendulum	Addressing function approximation error in actor-critic methods. Fujimoto et al. 2018.
Soft Actor-Critic (SAC)	Discrete/Continuous	Pendulum	Soft actor-critic algorithms and applications. Haarnoja et al. 2018.

Examples of RL Algorithms:

Q-learning

Code: ./tutorial_Qlearning.py

Paper: Technical Note Q-Learning

Description:

Q-learning is a non-deep-learning method with TD Learning, Off-Policy, e-Greedy Exploration.

Central formula:
Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A))

See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.

Deep Q-Network (DQN)

Code: ./tutorial_DQN.py

Paper: Human-level control through deep reinforcementlearning

Playing Atari with Deep Reinforcement Learning

Description:

Deep Q-Network (DQN) is a method of TD Learning, Off-Policy, e-Greedy Exploration (GLIE).

Central formula:
Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A)),
delta_w = R + lambda * Q(newS, newA).

See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.

Double DQN / Dueling DQN / Noisy DQN

Code: ./tutorial_DQN_variants.py

Paper: Deep Reinforcement Learning with Double Q-learning

Description:

We implement Double DQN, Dueling DQN and Noisy DQN here.

-The max operator in standard DQN uses the same values both to select and to evaluate an action by:

   Q(s_t, a_t) = R\_{t+1\} + gamma \* max\_{a}Q\_\{target\}(s_{t+1}, a).

-Double DQN proposes to use following evaluation to address overestimation problem of max operator:

   Q(s_t, a_t) = R\_{t+1\} + gamma \* Q\_{target}(s\_\{t+1\}, max{a}Q(s_{t+1}, a)).

-Dueling DQN uses dueling architecture where the value of state and the advantage of each action is estimated separately.

-Noisy DQN propose to explore by adding parameter noises.

Prioritized Experience Replay

Code: ./tutorial_prioritized_replay.py

Paper: Prioritized Experience Replay

Description:

Prioritized experience replay is an efficient replay method that replay important transitions more frequently. Segment tree data structure is used to speed up indexing.

Distributed DQN (C51)

Code: ./tutorial_C51.py

Paper: A Distributional Perspective on Reinforcement Learning

Description:

Categorical 51 distributional RL algorithm is a distrbuted DQN, where 51 means the number of atoms. In this algorithm, instead of estimating actual expected value, value distribution over a series of  continuous sub-intervals (atoms) is considered.

Actor-Critic (AC)

Code:./tutorial_AC.py

Paper: Actor-Critic Algorithms

Description:

The implementation of Advantage Actor-Critic, using TD-error as the advantage.

Asynchronous Advantage Actor-Critic (A3C)

Code: ./tutorial_A3C.py

Paper: Asynchronous Methods for Deep Reinforcement Learning

Description:

The implementation of Asynchronous Advantage Actor-Critic (A3C), using multi-threading for distributed policy learning on Actor-Critic structure.

Soft Actor-Critic (SAC)

Code: ./tutorial_SAC.py

Paper: Soft Actor-Critic Algorithms and Applications

Description:

Actor policy in SAC is stochastic, with off-policy training.  And 'soft' in SAC indicates the trade-off between the entropy and expected return.  The additional consideration of entropy term helps with more explorative policy. And this implementation contains an automatic update for the entropy factor.

This version of Soft Actor-Critic (SAC) implementation contains 5 networks:
2 Q-networks, 2 target Q-networks and 1 policy network.

Vanilla Policy Gradient (PG or REINFORCE)

Code: ./tutorial_PG.py

Paper: Policy Gradient Methods for Reinforcement Learning with Function Approximation

Description:

The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance. It's an on-policy algorithm can be used for environments with either discrete or continuous action spaces.

To apply it on continuous action space, you need to change the last softmax layer and the choose_action function.

Deep Deterministic Policy Gradient (DDPG)

Code: ./tutorial_DDPG.py

Paper: Continuous Control With Deep Reinforcement Learning

Description:

An algorithm concurrently learns a Q-function and a policy.

It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.

Twin Delayed DDPG (TD3)

Code: ./tutorial_TD3.py

Paper: Addressing Function Approximation Error in Actor-Critic Methods

Description:

DDPG suffers from problems like overestimate of Q-values and sensitivity to hyper-parameters.

Twin Delayed DDPG (TD3) is a variant of DDPG with several tricks:

- Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.
- Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function.
- Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

The implementation of TD3 includes 6 networks:
2 Q-networks, 2 target Q-networks, 1 policy network, 1 target policy network.

Actor policy in TD3 is deterministic, with Gaussian exploration noise.

Trust Region Policy Optimization (TRPO)

Code: ./tutorial_TRPO.py

Paper: Trust Region Policy Optimization

Description:

PG method with a large step can crash the policy performance, even with a small step can lead a large differences in policy.

TRPO constraints the step in policy space using KL divergence (rather than in parameter space), which can monotonically improve performance and avoid a collapsed update.

Proximal Policy Optimization (PPO)

Code: ./tutorial_PPO.py

Paper: Proximal Policy Optimization Algorithms

Description:

A simple version of Proximal Policy Optimization (PPO) using single thread.

PPO is a family of first-order methods that use a few other tricks to keep new policies close to old.

PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.

Distributed Proximal Policy Optimization (DPPO)

Code: ./tutorial_DPPO.py

Paper: Emergence of Locomotion Behaviours in Rich Environments

Description:

A distributed version of OpenAI's Proximal Policy Optimization (PPO).

Distribute the workers to collect data in parallel, then stop worker's roll-out and train PPO on collected data.

More in recent weeks

Environment:

We typically apply game environments in Openai Gym for our tutorials. For other environment sources like DeepMind Control Suite and Marathon-Envs in Unity, they all have wrappers to convert into format of Gym environments, see here and here.

Our env wrapper: ./tutorial_wrappers.py

Authors

@zsdonghao Hao Dong: AC, A3C, Q-Learning, DQN, PG
@quantumiracle Zihan Ding: SAC, TD3.
@Tokarev-TT-33 Tianyang Yu @initial-h Hongming Zhang : PG, DDPG, PPO, DPPO, TRPO
@Officium Yanhua Huang: C51, DQN_variants, prioritized_replay, wrappers.

Recommended Materials

李宏毅RL视频
CS885 Spring 2018 - Reinforcement Learning by Pascal Poupart
Youtube Video By David Silver, 2015 @ UCL
Teaching Materials By David Silver @ UCL
Deep Reinforcement Learning: Fundamentals, Research and Applications By Hao Dong, Zihan Ding, Shanghang Zhang etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Comprehensive Reinforcement Learning Tutorial

Prerequisites:

Quick Start

Status: Beta

To Use:

Table of Contents:

value-based

Examples of RL Algorithms:

Environment:

Authors

Recommended Materials

Files

README.md

Latest commit

History

README.md

File metadata and controls

Comprehensive Reinforcement Learning Tutorial

Prerequisites:

Quick Start

Status: Beta

To Use:

Table of Contents:

value-based

Examples of RL Algorithms:

Environment:

Authors

Recommended Materials