Initial commit

aweeraman · Dec 18, 2018 · dc1f4d9 · dc1f4d9
commit dc1f4d9
Show file tree

Hide file tree

Showing 7 changed files with 437 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,119 @@
+# Udacity Deep Reinforcement Learning Nanodegree Project: Navigation
+
+This is a project that uses Deep Q-Networks to train an agent to capture yellow bananas and avoid
+blue bananas through deep reinforcement learning in a Unity ML-Agents environment.
+
+The steps below will describe how to get this running on Linux/MacOS:
+
+## 1. Clone the repo
+
+```
+$ git clone https://github.com/aweeraman/deep-q-networks-navigation
+```
+
+## 2. Install Python & dependencies
+
+Using the Anaconda distribution, create a new python runtime and install the required dependencies:
+
+```
+$ conda create -n dqn python=3.6
+$ source activate dqn
+$ pip install -r requirements.txt
+```
+
+## 3. Install the Unity Environment
+
+Download a pre-built environment to run the agent. You will not need to install Unity for this. The
+environment is OS specific, so the correct version for the operating system must be downloaded.
+
+For MacOS, [use this link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana.app.zip)
+
+After uncompressing, there should be a directory called "Banana.app" in the root directory of the repository.
+
+## 4. Run the agent
+
+To run the pre-trained agent, execute the following:
+
+```
+$ python bananas.py --run
+Mono path[0] = '/Users/anuradha/ninsei/udacity/bananas/Banana.app/Contents/Resources/Data/Managed'
+Mono config path = '/Users/anuradha/ninsei/udacity/bananas/Banana.app/Contents/MonoBleedingEdge/etc'
+INFO:unityagents:
+'Academy' started successfully!
+Unity Academy name: Academy
+        Number of Brains: 1
+        Number of External Brains : 1
+        Lesson number : 0
+        Reset Parameters :
+
+Unity brain name: BananaBrain
+        Number of Visual Observations (per agent): 0
+        Vector Observation space type: continuous
+        Vector Observation space size (per agent): 37
+        Number of stacked Vector Observation: 1
+        Vector Action space type: discrete
+        Vector Action space size (per agent): 4
+        Vector Action descriptions: , , ,
+Number of agents: 1
+Number of actions: 4
+States look like: [1.         0.         0.         0.         0.84408134 0.
+ 0.         1.         0.         0.0748472  0.         1.
+ 0.         0.         0.25755    1.         0.         0.
+ 0.         0.74177343 0.         1.         0.         0.
+ 0.25854847 0.         0.         1.         0.         0.09355672
+ 0.         1.         0.         0.         0.31969345 0.
+ 0.        ]
+States have length: 37
+Score: 15.0
+```
+
+To customize hyperparameters and train the agent, execute the following:
+
+```
+$ python bananas.py --train
+(drl) sendai:bananas anuradha$ python bananas.py --train
+Mono path[0] = '/Users/anuradha/ninsei/udacity/bananas/Banana.app/Contents/Resources/Data/Managed'
+Mono config path = '/Users/anuradha/ninsei/udacity/bananas/Banana.app/Contents/MonoBleedingEdge/etc'
+INFO:unityagents:
+'Academy' started successfully!
+Unity Academy name: Academy
+        Number of Brains: 1
+        Number of External Brains : 1
+        Lesson number : 0
+        Reset Parameters :
+
+Unity brain name: BananaBrain
+        Number of Visual Observations (per agent): 0
+        Vector Observation space type: continuous
+        Vector Observation space size (per agent): 37
+        Number of stacked Vector Observation: 1
+        Vector Action space type: discrete
+        Vector Action space size (per agent): 4
+        Vector Action descriptions: , , ,
+Number of agents: 1
+Number of actions: 4
+Episode 100	Average Score: 0.785
+Episode 200	Average Score: 4.03
+Episode 300	Average Score: 7.21
+Episode 400	Average Score: 9.00
+Episode 500	Average Score: 11.44
+Episode 600	Average Score: 13.50
+Episode 700	Average Score: 15.07
+Episode 800	Average Score: 15.16
+Episode 900	Average Score: 15.90
+Episode 1000	Average Score: 16.83
+```
+
+# Troubleshooting
+
+If you run into an error such as the following when training the agent:
+
+```
+ImportError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information.
+```
+
+Modify ~/.matplotlib/matplotlibrc and add the following line:
+
+```
+backend: TkAgg
+```
diff --git a/bananas.py b/bananas.py
@@ -0,0 +1,116 @@
+from unityagents import UnityEnvironment
+import numpy as np
+import torch
+import random
+import argparse
+import random
+from dqn_agent import Agent
+from collections import deque
+import matplotlib.pyplot as plt
+
+# https://github.com/udacity/deep-reinforcement-learning/blob/master/dqn/solution/Deep_Q_Network_Solution.ipynb
+def dqn(n_episodes=1000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
+    """Deep Q-Learning.
+
+    Params
+    ======
+        n_episodes (int): maximum number of training episodes
+        max_t (int): maximum number of timesteps per episode
+        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
+        eps_end (float): minimum value of epsilon
+        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
+    """
+    scores = []                        # list containing scores from each episode
+    scores_window = deque(maxlen=100)  # last 100 scores
+    eps = eps_start                    # initialize epsilon
+    for i_episode in range(1, n_episodes+1):
+        env_info = env.reset(train_mode=True)[brain_name]
+        state = env_info.vector_observations[0]
+        score = 0
+        for t in range(max_t):
+            action = agent.act(state, eps)
+            env_info = env.step(action)[brain_name]
+            reward = env_info.rewards[0]
+            next_state = env_info.vector_observations[0]
+            done = env_info.local_done[0]
+            score += reward
+            agent.step(state, action, reward, next_state, done)
+            state = next_state
+            if done:
+                break
+        scores_window.append(score)       # save most recent score
+        scores.append(score)              # save most recent score
+        eps = max(eps_end, eps_decay*eps) # decrease epsilon
+        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
+        if i_episode % 100 == 0:
+            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
+            torch.save(agent.qnetwork_local.state_dict(), 'weights.pth')
+        if np.mean(scores_window)>=200.0:
+            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
+            torch.save(agent.qnetwork_local.state_dict(), 'weights.pth')
+            break
+    return scores
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Train an agent to navigate a large world and collect yellow bananas, while avoiding blue bananas using Deep Q-Networks')
+    parser.add_argument('--train', help='Train the agent', action='store_true')
+    parser.add_argument('--run', help='Run the agent', action='store_true')
+
+    env = UnityEnvironment(file_name="Banana.app")
+
+    # get the default brain
+    brain_name = env.brain_names[0]
+    brain = env.brains[brain_name]
+
+    # reset the environment
+    env_info = env.reset(train_mode=True)[brain_name]
+
+    # number of agents in the environment
+    print('Number of agents:', len(env_info.agents))
+
+    # number of actions
+    action_size = brain.vector_action_space_size
+    print('Number of actions:', action_size)
+
+    agent = Agent(state_size=37, action_size=4, seed=0)
+
+    args = parser.parse_args();
+
+    if args.train:
+        scores = dqn()
+
+        # plot the scores
+        fig = plt.figure()
+        ax = fig.add_subplot(111)
+        plt.plot(np.arange(len(scores)), scores)
+        plt.ylabel('Score')
+        plt.xlabel('Episode #')
+        plt.show()
+
+    elif args.run:
+        agent.qnetwork_local.load_state_dict(torch.load('weights.pth'))
+
+        state = env_info.vector_observations[0]
+        print('States look like:', state)
+        state_size = len(state)
+        print('States have length:', state_size)
+
+        env_info = env.reset(train_mode=False)[brain_name]
+        state = env_info.vector_observations[0]
+        score = 0
+        while True:
+            action = agent.act(state)
+            env_info = env.step(action)[brain_name]
+            next_state = env_info.vector_observations[0]
+            reward = env_info.rewards[0]
+            done = env_info.local_done[0]
+            score += reward
+            state = next_state
+            if done:
+                break
+
+        print("Score: {}".format(score))
+        env.close()
+
+    else:
+        parser.print_help()
diff --git a/dqn_agent.py b/dqn_agent.py
@@ -0,0 +1,158 @@
+import numpy as np
+import random
+from collections import namedtuple, deque
+
+from model import QNetwork
+
+import torch
+import torch.nn.functional as F
+import torch.optim as optim
+
+BUFFER_SIZE = int(1e5)  # replay buffer size
+BATCH_SIZE = 64         # minibatch size
+GAMMA = 0.99            # discount factor
+TAU = 1e-3              # for soft update of target parameters
+LR = 5e-4               # learning rate 
+UPDATE_EVERY = 4        # how often to update the network
+
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+
+class Agent():
+    """Interacts with and learns from the environment."""
+
+    def __init__(self, state_size, action_size, seed):
+        """Initialize an Agent object.
+        
+        Params
+        ======
+            state_size (int): dimension of each state
+            action_size (int): dimension of each action
+            seed (int): random seed
+        """
+        self.state_size = state_size
+        self.action_size = action_size
+        self.seed = random.seed(seed)
+
+        # Q-Network
+        self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device)
+        self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device)
+        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR)
+
+        # Replay memory
+        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed)
+        # Initialize time step (for updating every UPDATE_EVERY steps)
+        self.t_step = 0
+
+    def step(self, state, action, reward, next_state, done):
+        # Save experience in replay memory
+        self.memory.add(state, action, reward, next_state, done)
+
+        # Learn every UPDATE_EVERY time steps.
+        self.t_step = (self.t_step + 1) % UPDATE_EVERY
+        if self.t_step == 0:
+            # If enough samples are available in memory, get random subset and learn
+            if len(self.memory) > BATCH_SIZE:
+                experiences = self.memory.sample()
+                self.learn(experiences, GAMMA)
+
+    def act(self, state, eps=0.):
+        """Returns actions for given state as per current policy.
+        
+        Params
+        ======
+            state (array_like): current state
+            eps (float): epsilon, for epsilon-greedy action selection
+        """
+        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
+        self.qnetwork_local.eval()
+        with torch.no_grad():
+            action_values = self.qnetwork_local(state)
+        self.qnetwork_local.train()
+
+        # Epsilon-greedy action selection
+        if random.random() > eps:
+            return np.argmax(action_values.cpu().data.numpy())
+        else:
+            return random.choice(np.arange(self.action_size))
+
+    def learn(self, experiences, gamma):
+        """Update value parameters using given batch of experience tuples.
+
+        Params
+        ======
+            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples 
+            gamma (float): discount factor
+        """
+        states, actions, rewards, next_states, dones = experiences
+
+        # Get max predicted Q values (for next states) from target model
+        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
+        # Compute Q targets for current states 
+        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
+
+        # Get expected Q values from local model
+        Q_expected = self.qnetwork_local(states).gather(1, actions)
+
+        # Compute loss
+        loss = F.mse_loss(Q_expected, Q_targets)
+        # Minimize the loss
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+
+        # ------------------- update target network ------------------- #
+        self.soft_update(self.qnetwork_local, self.qnetwork_target, TAU)                     
+
+    def soft_update(self, local_model, target_model, tau):
+        """Soft update model parameters.
+        θ_target = τ*θ_local + (1 - τ)*θ_target
+
+        Params
+        ======
+            local_model (PyTorch model): weights will be copied from
+            target_model (PyTorch model): weights will be copied to
+            tau (float): interpolation parameter 
+        """
+        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
+            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
+
+
+class ReplayBuffer:
+    """Fixed-size buffer to store experience tuples."""
+
+    def __init__(self, action_size, buffer_size, batch_size, seed):
+        """Initialize a ReplayBuffer object.
+
+        Params
+        ======
+            action_size (int): dimension of each action
+            buffer_size (int): maximum size of buffer
+            batch_size (int): size of each training batch
+            seed (int): random seed
+        """
+        self.action_size = action_size
+        self.memory = deque(maxlen=buffer_size)  
+        self.batch_size = batch_size
+        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
+        self.seed = random.seed(seed)
+
+    def add(self, state, action, reward, next_state, done):
+        """Add a new experience to memory."""
+        e = self.experience(state, action, reward, next_state, done)
+        self.memory.append(e)
+
+    def sample(self):
+        """Randomly sample a batch of experiences from memory."""
+        experiences = random.sample(self.memory, k=self.batch_size)
+
+        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
+        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
+        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
+        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
+        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
+
+        return (states, actions, rewards, next_states, dones)
+
+    def __len__(self):
+        """Return the current size of internal memory."""
+        return len(self.memory)
diff --git a/graph.png b/graph.png