Skip to content

Commit

Permalink
Report and updated weights
Browse files Browse the repository at this point in the history
  • Loading branch information
aweeraman committed Jan 14, 2019
1 parent 644eaf8 commit dde65ed
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 1 deletion.
63 changes: 63 additions & 0 deletions Report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Project: Navigation

This is a project that uses Deep Q-Networks to train an agent to capture yellow bananas and avoid
blue bananas through deep reinforcement learning in a Unity ML-Agents environment.

## Learning algorithm

Q-Learning is an approach which generates a Q-table that is used by an agent to determine best action
for a given state. This technique becomes difficult and inefficient in environments that have a large
state space. Deep Q-Networks on the other hand makes use of a neural network to approximate Q-values
for each action based on the input state.

However, there are drawbacks in Deep Q-Learning. A common issue is that the reinforcement learning tends
to be unstable or divergent when a non-linear function approximator such as neural networks are used to
represent Q. This instability comes from the correlations present in the sequence of observations, the fact
that small updates to Q may significantly change the policy and the data distribution, and the correlations
between Q and the target values. [1]

To overcome this, experience replay is a technique that was used in this solution that uses the biologically
inspired approach of replaying a random sample of prior actions to remove correlations in the observation
sequence and smooth changes in the data distribution.

## Model architecture and hyperparameters

Fully connected layer 1: Input 37 (state space), Output 32, RELU
Fully connected layer 2: Input 32, Output 32, RELU
Fully connected layer 3: Input 32, Output 4 (action space)

The hyperparameters for tweaking and optimizing the learning algorithm were:

max_t (750): maximum number of timesteps per episode
eps_start (1.0): starting value of epsilon, for epsilon-greedy action selection
eps_end (0.01): minimum value of epsilon
eps_decay (0.9): multiplacative factor (per episode) for decreasing epsilon

## Plot of rewards

Below is a training run of the above model architecture and hyperparameters:

```
Number of agents: 1
Number of actions: 4
Episode 100 Average Score: 3.97
Episode 200 Average Score: 9.51
Episode 287 Average Score: 13.12
Environment solved in 187 episodes! Average Score: 13.12
```

The plot of rewards for this run is as follows:

![Plot of rewards](https://raw.githubusercontent.com/aweeraman/deep-q-networks-navigation/master/graph.png)

## Ideas for future work

Further optimization of this architecture can be performed by training with different hyperparameters to
get faster and better learning outcomes. A couple of further approaches to try out are:

* Double Q-Learning
* Delayed Q-Learning

## Reference

1 - https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning
2 changes: 1 addition & 1 deletion bananas.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import matplotlib.pyplot as plt

# https://github.com/udacity/deep-reinforcement-learning/blob/master/dqn/solution/Deep_Q_Network_Solution.ipynb
def dqn(n_episodes=1000, max_t=750, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
def dqn(n_episodes=1000, max_t=750, eps_start=1.0, eps_end=0.01, eps_decay=0.9):
"""Deep Q-Learning.
Params
Expand Down
Binary file modified weights.pth
Binary file not shown.

0 comments on commit dde65ed

Please sign in to comment.