Skip to content

Latest commit

 

History

History
377 lines (225 loc) · 16.5 KB

File metadata and controls

377 lines (225 loc) · 16.5 KB

Comprehensive Reinforcement Learning Tutorial

GitHub last commit (branch) Supported TF Version Documentation Status Build Status Downloads



This repository contains implementations of the most popular reinforcement learning algorithms, powered by Tensorflow 2.0 and Tensorlayer 2.0. We aim to make the reinforcement learning tutorial simple, transparent and straight-forward, as this would not only benefits new learners of reinforcement learning, but also provide convenience for senior researchers to testify their new ideas quickly.

A corresponding Springer textbook is also provided, you can get the free PDF if your institute has Springer license. We also released an RLzoo for simple usage.



Prerequisites:

  • python 3.5
  • tensorflow >= 2.0.0 or tensorflow-gpu >= 2.0.0a0
  • tensorlayer >= 2.0.1
  • tensorflow-probability

*** If you meet the errorAttributeError: module 'tensorflow' has no attribute 'contrib' when running the code after installing tensorflow-probability, try:

pip install --upgrade tf-nightly-2.0-preview tfp-nightly

Quick Start

conda create --name tl python=3.6.4  
conda activate tl
pip install tensorflow-gpu==2.0.0-rc1 # if no GPU, use pip install tensorflow==2.0.0
pip install tensorlayer
pip install tensorflow-probability==0.9.0
pip install gym
pip install gym[atari] # for others, use pip instal gym[all]

python tutorial_DDPG.py --train

Status: Beta

We are currently open to any suggestions or pull requests from you to make the reinforcement learning tutorial with TensorLayer2.0 a better code repository for both new learners and senior researchers. Some of the algorithms mentioned in the this markdown may be not yet available, since we are still trying to implement more RL algorithms and optimize their performances. However, those algorithms listed above will come out in a few weeks, and the repository will keep updating more advanced RL algorithms in the future.

To Use:

For each tutorial, open a terminal and run:

python ***.py --train for training and python ***.py --test for testing.

The tutorial algorithms follow the same basic structure, as shown in file: ./tutorial_format.py

The pretrained models and learning curves for each algorithm are stored here. You can download the models and load the weights in the policies for tests.

Table of Contents:

value-based

Algorithms Action Space Tutorial Env Papers
value-based
Q-learning Discrete FrozenLake Technical note: Q-learning. Watkins et al. 1992
Deep Q-Network (DQN) Discrete FrozenLake Human-level control through deep reinforcement learning, Mnih et al. 2015.
Prioritized Experience Replay Discrete Pong, CartPole Schaul et al. Prioritized experience replay. Schaul et al. 2015.
Dueling DQN Discrete Pong, CartPole Dueling network architectures for deep reinforcement learning. Wang et al. 2015.
Double DQN Discrete Pong, CartPole Deep reinforcement learning with double q-learning. Van et al. 2016.
Noisy DQN Discrete Pong, CartPole Noisy networks for exploration. Fortunato et al. 2017.
Distributed DQN (C51) Discrete Pong, CartPole A distributional perspective on reinforcement learning. Bellemare et al. 2017.
policy-based
REINFORCE(PG) Discrete/Continuous CartPole Reinforcement learning: An introduction. Sutton et al. 2011.
Trust Region Policy Optimization (TRPO) Discrete/Continuous Pendulum Abbeel et al. Trust region policy optimization. Schulman et al.2015.
Proximal Policy Optimization (PPO) Discrete/Continuous Pendulum Proximal policy optimization algorithms. Schulman et al. 2017.
Distributed Proximal Policy Optimization (DPPO) Discrete/Continuous Pendulum Emergence of locomotion behaviours in rich environments. Heess et al. 2017.
actor-critic
Actor-Critic (AC) Discrete/Continuous CartPole Actor-critic algorithms. Konda er al. 2000.
Asynchronous Advantage Actor-Critic (A3C) Discrete/Continuous BipedalWalker Asynchronous methods for deep reinforcement learning. Mnih et al. 2016.
DDPG Discrete/Continuous Pendulum Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016
TD3 Discrete/Continuous Pendulum Addressing function approximation error in actor-critic methods. Fujimoto et al. 2018.
Soft Actor-Critic (SAC) Discrete/Continuous Pendulum Soft actor-critic algorithms and applications. Haarnoja et al. 2018.

Examples of RL Algorithms:

  • Q-learning

    Code: ./tutorial_Qlearning.py

    Paper: Technical Note Q-Learning

    Description:

    Q-learning is a non-deep-learning method with TD Learning, Off-Policy, e-Greedy Exploration.
    
    Central formula:
    Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A))
    
    See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
    
  • Deep Q-Network (DQN)

    Code: ./tutorial_DQN.py

    Paper: Human-level control through deep reinforcementlearning

    Playing Atari with Deep Reinforcement Learning

    Description:

    Deep Q-Network (DQN) is a method of TD Learning, Off-Policy, e-Greedy Exploration (GLIE).
    
    Central formula:
    Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A)),
    delta_w = R + lambda * Q(newS, newA).
    
    See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
    
  • Double DQN / Dueling DQN / Noisy DQN

    Code: ./tutorial_DQN_variants.py

    Paper: Deep Reinforcement Learning with Double Q-learning

    Description:

    We implement Double DQN, Dueling DQN and Noisy DQN here.
    
    -The max operator in standard DQN uses the same values both to select and to evaluate an action by:
    
       Q(s_t, a_t) = R\_{t+1\} + gamma \* max\_{a}Q\_\{target\}(s_{t+1}, a).
    
    -Double DQN proposes to use following evaluation to address overestimation problem of max operator:
    
       Q(s_t, a_t) = R\_{t+1\} + gamma \* Q\_{target}(s\_\{t+1\}, max{a}Q(s_{t+1}, a)).
    
    -Dueling DQN uses dueling architecture where the value of state and the advantage of each action is estimated separately.
    
    -Noisy DQN propose to explore by adding parameter noises.
    
  • Prioritized Experience Replay

    Code: ./tutorial_prioritized_replay.py

    Paper: Prioritized Experience Replay

    Description:

    Prioritized experience replay is an efficient replay method that replay important transitions more frequently. Segment tree data structure is used to speed up indexing.
    
  • Distributed DQN (C51)

    Code: ./tutorial_C51.py

    Paper: A Distributional Perspective on Reinforcement Learning

    Description:

    Categorical 51 distributional RL algorithm is a distrbuted DQN, where 51 means the number of atoms. In this algorithm, instead of estimating actual expected value, value distribution over a series of  continuous sub-intervals (atoms) is considered.
    
  • Actor-Critic (AC)

    Code:./tutorial_AC.py

    Paper: Actor-Critic Algorithms

    Description:

    The implementation of Advantage Actor-Critic, using TD-error as the advantage.
    
  • Asynchronous Advantage Actor-Critic (A3C)

    Code: ./tutorial_A3C.py

    Paper: Asynchronous Methods for Deep Reinforcement Learning

    Description:

    The implementation of Asynchronous Advantage Actor-Critic (A3C), using multi-threading for distributed policy learning on Actor-Critic structure.
    
  • Soft Actor-Critic (SAC)

    Code: ./tutorial_SAC.py

    Paper: Soft Actor-Critic Algorithms and Applications

    Description:

    Actor policy in SAC is stochastic, with off-policy training.  And 'soft' in SAC indicates the trade-off between the entropy and expected return.  The additional consideration of entropy term helps with more explorative policy. And this implementation contains an automatic update for the entropy factor.
    
    This version of Soft Actor-Critic (SAC) implementation contains 5 networks:
    2 Q-networks, 2 target Q-networks and 1 policy network.
    
  • Vanilla Policy Gradient (PG or REINFORCE)

    Code: ./tutorial_PG.py

    Paper: Policy Gradient Methods for Reinforcement Learning with Function Approximation

    Description:

    The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance. It's an on-policy algorithm can be used for environments with either discrete or continuous action spaces.
    
    To apply it on continuous action space, you need to change the last softmax layer and the choose_action function.
    
  • Deep Deterministic Policy Gradient (DDPG)

    Code: ./tutorial_DDPG.py

    Paper: Continuous Control With Deep Reinforcement Learning

    Description:

    An algorithm concurrently learns a Q-function and a policy.
    
    It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.
    
  • Twin Delayed DDPG (TD3)

    Code: ./tutorial_TD3.py

    Paper: Addressing Function Approximation Error in Actor-Critic Methods

    Description:

    DDPG suffers from problems like overestimate of Q-values and sensitivity to hyper-parameters.
    
    Twin Delayed DDPG (TD3) is a variant of DDPG with several tricks:
    
    - Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.
    - Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function.
    - Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.
    
    The implementation of TD3 includes 6 networks:
    2 Q-networks, 2 target Q-networks, 1 policy network, 1 target policy network.
    
    Actor policy in TD3 is deterministic, with Gaussian exploration noise.
    
  • Trust Region Policy Optimization (TRPO)

    Code: ./tutorial_TRPO.py

    Paper: Trust Region Policy Optimization

    Description:

    PG method with a large step can crash the policy performance, even with a small step can lead a large differences in policy.
    
    TRPO constraints the step in policy space using KL divergence (rather than in parameter space), which can monotonically improve performance and avoid a collapsed update.
    
  • Proximal Policy Optimization (PPO)

    Code: ./tutorial_PPO.py

    Paper: Proximal Policy Optimization Algorithms

    Description:

    A simple version of Proximal Policy Optimization (PPO) using single thread.
    
    PPO is a family of first-order methods that use a few other tricks to keep new policies close to old.
    
    PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
    
    
    
  • Distributed Proximal Policy Optimization (DPPO)

    Code: ./tutorial_DPPO.py

    Paper: Emergence of Locomotion Behaviours in Rich Environments

    Description:

    A distributed version of OpenAI's Proximal Policy Optimization (PPO).
    
    Distribute the workers to collect data in parallel, then stop worker's roll-out and train PPO on collected data.
    
  • More in recent weeks

Environment:

We typically apply game environments in Openai Gym for our tutorials. For other environment sources like DeepMind Control Suite and Marathon-Envs in Unity, they all have wrappers to convert into format of Gym environments, see here and here.

Our env wrapper: ./tutorial_wrappers.py

Authors

  • @zsdonghao Hao Dong: AC, A3C, Q-Learning, DQN, PG
  • @quantumiracle Zihan Ding: SAC, TD3.
  • @Tokarev-TT-33 Tianyang Yu @initial-h Hongming Zhang : PG, DDPG, PPO, DPPO, TRPO
  • @Officium Yanhua Huang: C51, DQN_variants, prioritized_replay, wrappers.

Recommended Materials