-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] EvalCallback using MPI #1069
Comments
Hello,
yes, the I would recommend you to try to switch to the
this is possible but non-trivial and would require you to define a custom callback (cf doc). |
Hi @araffin - thanks for the reply! Does VecEnv parallelise the gradient computation, or just the env part? https://twitter.com/hardmaru/status/1260852988475658242 I've got PPO + MPI working really well on a multicore machine with a custom callback to handle the parallelisation of the evaluation. I'm also hesitant to switch to SB3 as it doesn't support Tensorflow, which is a shame. Thanks for your help! |
Just the env part.
SB3 do not support MPI by default, but we would be happy to have an implementation of PPO MPI in our contrib repo ;) See Stable-Baselines-Team/stable-baselines3-contrib#11
The decision to move to PyTorch and drop MPI (for the default install) was not arbitrary, see #733 and #366 ;) |
@araffin has anything changed with regards to SB3 supporting the MPI or its still not supported? |
It is not (Stable-Baselines-Team/stable-baselines3-contrib#11, Stable-Baselines-Team/stable-baselines3-contrib#45), but contribution is welcomed ;) But with SB3, you can use multiple envs for evaluation which provide a great speed up. |
When running a training loop using MPI, the
EvalCallback
doesn't seem to make use of the parallelisation:for example, in this
train
function:https://github.com/hardmaru/slimevolleygym/blob/master/training_scripts/train_ppo_mpi.py
it seems that the
EvalCallback
will be called once per instance, after a combined total ofeval_freq
timesteps across all of the instances. This appears to be problematic if you want to use the callback decide whether to save out a new best model, as there will be multiple attempts at calculating the average reward and therefore thebest_model
file will be overwritten potentially several times on the same update. The best score will be naturally inflated the more instances you have, due to some reward calculations being slightly higher than average due to favourable random fluctuations.Is this correct?
If so, is there a way to instead split the
n_eval_episodes
across the workers and aggregate into a single score?The text was updated successfully, but these errors were encountered: