Updated doc

aweeraman · Feb 5, 2019 · 88d5f05 · 88d5f05
1 parent 5ae6530
commit 88d5f05
Show file tree

Hide file tree

Showing 5 changed files with 62 additions and 92 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# Udacity Deep Reinforcement Learning Nanodegree Project 1: Navigation
+# Project: Navigation
 
 This is a project that uses Deep Q-Networks to train an agent to capture yellow bananas and avoid
 blue bananas through deep reinforcement learning in a Unity ML-Agents environment.
@@ -72,33 +72,6 @@ To customize hyperparameters and train the agent, execute the following:
 
 ```
 $ python bananas.py --train
-Mono path[0] = '/Users/anuradha/ninsei/udacity/bananas/Banana.app/Contents/Resources/Data/Managed'
-Mono config path = '/Users/anuradha/ninsei/udacity/bananas/Banana.app/Contents/MonoBleedingEdge/etc'
-INFO:unityagents:
-'Academy' started successfully!
-Unity Academy name: Academy
-        Number of Brains: 1
-        Number of External Brains : 1
-        Lesson number : 0
-        Reset Parameters :
-
-Unity brain name: BananaBrain
-        Number of Visual Observations (per agent): 0
-        Vector Observation space type: continuous
-        Vector Observation space size (per agent): 37
-        Number of stacked Vector Observation: 1
-        Vector Action space type: discrete
-        Vector Action space size (per agent): 4
-        Vector Action descriptions: , , ,
-Number of agents: 1
-Number of actions: 4
-Episode 100	Average Score: 0.785
-Episode 200	Average Score: 4.03
-Episode 300	Average Score: 7.21
-Episode 400	Average Score: 9.00
-Episode 500	Average Score: 11.44
-Episode 574	Average Score: 13.02
-Environment solved in 474 episodes!	Average Score: 13.02
 ```
 
 # Environment details
@@ -118,7 +91,62 @@ The action space for the agent consists of the following four possible actions:
 
 The agent must collect a reward of +13 or more in over 100 consecutive episodes to solve the problem.
 
-# Troubleshooting Tips
+## Learning algorithm
+
+Q-Learning is an approach which generates a Q-table that is used by an agent to determine best action
+for a given state. This technique becomes difficult and inefficient in environments that have a large
+state space. Deep Q-Networks on the other hand makes use of a neural network to approximate Q-values
+for each action based on the input state.
+
+However, there are drawbacks in Deep Q-Learning. A common issue is that the reinforcement learning tends
+to be unstable or divergent when a non-linear function approximator such as neural networks are used to
+represent Q. This instability comes from the correlations present in the sequence of observations, the fact
+that small updates to Q may significantly change the policy and the data distribution, and the correlations
+between Q and the target values. [1]
+
+To overcome this, experience replay is a technique that was used in this solution that uses the biologically
+inspired approach of replaying a random sample of prior actions to remove correlations in the observation
+sequence and smooth changes in the data distribution.
+
+## Model architecture and hyperparameters
+
+* Fully connected layer 1: Input 37 (state space), Output 32, RELU activation
+* Fully connected layer 2: Input 32, Output 32, RELU activation
+* Fully connected layer 3: Input 32, Output 4 (action space)
+
+The hyperparameters for tweaking and optimizing the learning algorithm were:
+
+* max_t (750): maximum number of timesteps per episode
+* eps_start (1.0): starting value of epsilon, for epsilon-greedy action selection
+* eps_end (0.01): minimum value of epsilon
+* eps_decay (0.9): multiplacative factor (per episode) for decreasing epsilon
+
+## Plot of rewards
+
+Below is a training run of the above model architecture and hyperparameters:
+
+```
+Number of agents: 1
+Number of actions: 4
+Episode 100	Average Score: 3.97
+Episode 200	Average Score: 9.51
+Episode 287	Average Score: 13.12
+Environment solved in 187 episodes!	Average Score: 13.12
+```
+
+The plot of rewards for this run is as follows:
+
+![Plot of rewards](https://raw.githubusercontent.com/aweeraman/deep-q-networks-navigation/master/images/plot_of_rewards.png)
+
+## Future work
+
+Further optimization of this architecture can be performed by training with different hyperparameters to
+get faster and better learning outcomes. A couple of further approaches to try out are:
+
+* Double Q-Learning
+* Delayed Q-Learning
+
+## Troubleshooting Tips
 
 If you run into an error such as the following when training the agent:
 
@@ -131,3 +159,8 @@ Modify ~/.matplotlib/matplotlibrc and add the following line:
 ```
 backend: TkAgg
 ```
+
+## Reference
+
+1 - https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning
+
diff --git a/Report.md b/Report.md
diff --git a/graph_09.png b/graph_09.png
diff --git a/graph_099.png b/graph_099.png
diff --git a/graph.png → images/plot_of_rewards.png b/graph.png → images/plot_of_rewards.png