This repository contains implementations of the most popular reinforcement learning algorithms, powered by Tensorflow 2.0 and Tensorlayer 2.0. We aim to make the reinforcement learning tutorial simple, transparent and straight-forward, as this would not only benefits new learners of reinforcement learning, but also provide convenience for senior researchers to testify their new ideas quickly.
A corresponding Springer textbook is also provided, you can get the free PDF if your institute has Springer license. We also released an RLzoo for simple usage.
- python 3.5
- tensorflow >= 2.0.0 or tensorflow-gpu >= 2.0.0a0
- tensorlayer >= 2.0.1
- tensorflow-probability
*** If you meet the errorAttributeError: module 'tensorflow' has no attribute 'contrib'
when running the code after installing tensorflow-probability, try:
pip install --upgrade tf-nightly-2.0-preview tfp-nightly
conda create --name tl python=3.6.4
conda activate tl
pip install tensorflow-gpu==2.0.0-rc1 # if no GPU, use pip install tensorflow==2.0.0
pip install tensorlayer
pip install tensorflow-probability==0.9.0
pip install gym
pip install gym[atari] # for others, use pip instal gym[all]
python tutorial_DDPG.py --train
We are currently open to any suggestions or pull requests from you to make the reinforcement learning tutorial with TensorLayer2.0 a better code repository for both new learners and senior researchers. Some of the algorithms mentioned in the this markdown may be not yet available, since we are still trying to implement more RL algorithms and optimize their performances. However, those algorithms listed above will come out in a few weeks, and the repository will keep updating more advanced RL algorithms in the future.
For each tutorial, open a terminal and run:
python ***.py --train
for training and python ***.py --test
for testing.
The tutorial algorithms follow the same basic structure, as shown in file: ./tutorial_format.py
The pretrained models and learning curves for each algorithm are stored here. You can download the models and load the weights in the policies for tests.
-
Q-learning
Code:
./tutorial_Qlearning.py
Paper: Technical Note Q-Learning
Description:
Q-learning is a non-deep-learning method with TD Learning, Off-Policy, e-Greedy Exploration. Central formula: Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A)) See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
-
Deep Q-Network (DQN)
Code:
./tutorial_DQN.py
Paper: Human-level control through deep reinforcementlearning
Playing Atari with Deep Reinforcement Learning
Description:
Deep Q-Network (DQN) is a method of TD Learning, Off-Policy, e-Greedy Exploration (GLIE). Central formula: Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A)), delta_w = R + lambda * Q(newS, newA). See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
-
Double DQN / Dueling DQN / Noisy DQN
Code:
./tutorial_DQN_variants.py
Paper: Deep Reinforcement Learning with Double Q-learning
Description:
We implement Double DQN, Dueling DQN and Noisy DQN here. -The max operator in standard DQN uses the same values both to select and to evaluate an action by: Q(s_t, a_t) = R\_{t+1\} + gamma \* max\_{a}Q\_\{target\}(s_{t+1}, a). -Double DQN proposes to use following evaluation to address overestimation problem of max operator: Q(s_t, a_t) = R\_{t+1\} + gamma \* Q\_{target}(s\_\{t+1\}, max{a}Q(s_{t+1}, a)). -Dueling DQN uses dueling architecture where the value of state and the advantage of each action is estimated separately. -Noisy DQN propose to explore by adding parameter noises.
-
Prioritized Experience Replay
Code:
./tutorial_prioritized_replay.py
Paper: Prioritized Experience Replay
Description:
Prioritized experience replay is an efficient replay method that replay important transitions more frequently. Segment tree data structure is used to speed up indexing.
-
Distributed DQN (C51)
Code:
./tutorial_C51.py
Paper: A Distributional Perspective on Reinforcement Learning
Description:
Categorical 51 distributional RL algorithm is a distrbuted DQN, where 51 means the number of atoms. In this algorithm, instead of estimating actual expected value, value distribution over a series of continuous sub-intervals (atoms) is considered.
-
Actor-Critic (AC)
Code:
./tutorial_AC.py
Paper: Actor-Critic Algorithms
Description:
The implementation of Advantage Actor-Critic, using TD-error as the advantage.
-
Asynchronous Advantage Actor-Critic (A3C)
Code:
./tutorial_A3C.py
Paper: Asynchronous Methods for Deep Reinforcement Learning
Description:
The implementation of Asynchronous Advantage Actor-Critic (A3C), using multi-threading for distributed policy learning on Actor-Critic structure.
-
Soft Actor-Critic (SAC)
Code:
./tutorial_SAC.py
Paper: Soft Actor-Critic Algorithms and Applications
Description:
Actor policy in SAC is stochastic, with off-policy training. And 'soft' in SAC indicates the trade-off between the entropy and expected return. The additional consideration of entropy term helps with more explorative policy. And this implementation contains an automatic update for the entropy factor. This version of Soft Actor-Critic (SAC) implementation contains 5 networks: 2 Q-networks, 2 target Q-networks and 1 policy network.
-
Vanilla Policy Gradient (PG or REINFORCE)
Code:
./tutorial_PG.py
Paper: Policy Gradient Methods for Reinforcement Learning with Function Approximation
Description:
The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance. It's an on-policy algorithm can be used for environments with either discrete or continuous action spaces. To apply it on continuous action space, you need to change the last softmax layer and the choose_action function.
-
Deep Deterministic Policy Gradient (DDPG)
Code:
./tutorial_DDPG.py
Paper: Continuous Control With Deep Reinforcement Learning
Description:
An algorithm concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.
-
Twin Delayed DDPG (TD3)
Code:
./tutorial_TD3.py
Paper: Addressing Function Approximation Error in Actor-Critic Methods
Description:
DDPG suffers from problems like overestimate of Q-values and sensitivity to hyper-parameters. Twin Delayed DDPG (TD3) is a variant of DDPG with several tricks: - Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions. - Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function. - Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action. The implementation of TD3 includes 6 networks: 2 Q-networks, 2 target Q-networks, 1 policy network, 1 target policy network. Actor policy in TD3 is deterministic, with Gaussian exploration noise.
-
Trust Region Policy Optimization (TRPO)
Code:
./tutorial_TRPO.py
Paper: Trust Region Policy Optimization
Description:
PG method with a large step can crash the policy performance, even with a small step can lead a large differences in policy. TRPO constraints the step in policy space using KL divergence (rather than in parameter space), which can monotonically improve performance and avoid a collapsed update.
-
Proximal Policy Optimization (PPO)
Code:
./tutorial_PPO.py
Paper: Proximal Policy Optimization Algorithms
Description:
A simple version of Proximal Policy Optimization (PPO) using single thread. PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
-
Distributed Proximal Policy Optimization (DPPO)
Code:
./tutorial_DPPO.py
Paper: Emergence of Locomotion Behaviours in Rich Environments
Description:
A distributed version of OpenAI's Proximal Policy Optimization (PPO). Distribute the workers to collect data in parallel, then stop worker's roll-out and train PPO on collected data.
-
More in recent weeks
We typically apply game environments in Openai Gym for our tutorials. For other environment sources like DeepMind Control Suite and Marathon-Envs in Unity, they all have wrappers to convert into format of Gym environments, see here and here.
Our env wrapper: ./tutorial_wrappers.py
- @zsdonghao Hao Dong: AC, A3C, Q-Learning, DQN, PG
- @quantumiracle Zihan Ding: SAC, TD3.
- @Tokarev-TT-33 Tianyang Yu @initial-h Hongming Zhang : PG, DDPG, PPO, DPPO, TRPO
- @Officium Yanhua Huang: C51, DQN_variants, prioritized_replay, wrappers.