Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean reward first rises and then falls during the training process #8

Closed
meng-zha opened this issue Jul 17, 2024 · 2 comments
Closed

Comments

@meng-zha
Copy link

Can you provide an example training log of Go2?
It's weird that the mean reward first rises and then falls during the training process.
All settings follow the repository.
Screenshot from 2024-07-17 14-48-54
Screenshot from 2024-07-17 14-49-07

@lupinjia
Copy link

lupinjia commented Jul 17, 2024

I encountered the same situation before. Through inspecting the simulation visualization, I found that the decline of mean_reward is mainly due to the decline of mean_episode_length. After training for several iterations, many environments will terminate not long after they start the episode. Due to the increase of short episode, the mean_episode_length declines, and the mean_reward is calced by dividing reward_sum by max_episode_length. Because of the shorter episode, the reward_sum becomes smaller, and so the mean_reward declines.
You can check and refine your termination condition to possibly avoid this. Or you can also increase the information the agent receives, because the original version is a POMDP setting in my view.

@meng-zha
Copy link
Author

I encountered the same situation before. Through inspecting the simulation visualization, I found that the decline of mean_reward is mainly due to the decline of mean_episode_length. After training for several iterations, many environments will terminate not long after they start the episode. Due to the increase of short episode, the mean_episode_length declines, and the mean_reward is calced by dividing reward_sum by max_episode_length. Because of the shorter episode, the reward_sum becomes smaller, and so the mean_reward declines. You can check and refine your termination condition to possibly avoid this. Or you can also increase the information the agent receives, because the original version is a POMDP setting in my view.

Thanks for your kind advice. I have retried with a different random seed and got a normal result this time.
I will also look into the termination condition as your advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants