You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Motivated by #497, here is a summary of the current resume workflow, along with questions and thoughts for improvement:
MALA parallel Optuna usage is supposed to be as described in the Optuna docs -- just start the optimization script N times and let Optuna sync processes via a shared database, no MPI or anything. So we'd execute e.g. examples/basic/ex04_hyperparameter_optimization.py or examples/advanced/ex05_checkpoint_hyperparameter_optimization.py N times in parallel. On HPC, we can do so in a batch job such as this with N=12, where we use mpirun just to distribute Python processes across nodes:
Each of the 12 processes will run its own trials, on one GPU each. The batch job will get killed after the designated runtime. Then unfinished trials will remain in the Optuna database in state RUNNING.
The current workflow for resuming the study which makes use of MALA's own resume tooling (see examples/advanced/ex05_checkpoint_hyperparameter_optimization.py) is this: before submitting the batch job again and let the script do the resume work, a user needs to modify the database like so:
which will set the RUNNING trials to state WAITING (more on that in #497).
If this works, the assumption is now that when Optuna resumes, it will pick up and re-run those, before carrying on running the resumed study.
questions:
Does "injecting" jobs like this disturb Optuna's operation in any way? In particular, since MALA pickle-dumps and re-loads the Optuna study object, is there state in that object and the database that must be in sync, i.e. do those trials have to be run? If not, then one could probably also delete them from the database (if that is the only source of truth for the study's state before we resume). In this case, the study will have missing data points from trials that have been suggested for a reason, so even if Optuna would resume fine, we may still want to re-run them from an optimization point of view.
Optuna has some resume functionality, is this used in MALA? If not, could this help here?
The text was updated successfully, but these errors were encountered:
Motivated by #497, here is a summary of the current resume workflow, along with questions and thoughts for improvement:
MALA parallel Optuna usage is supposed to be as described in the Optuna docs -- just start the optimization script N times and let Optuna sync processes via a shared database, no MPI or anything. So we'd execute e.g.
examples/basic/ex04_hyperparameter_optimization.py
orexamples/advanced/ex05_checkpoint_hyperparameter_optimization.py
N times in parallel. On HPC, we can do so in a batch job such as this with N=12, where we usempirun
just to distribute Python processes across nodes:Each of the 12 processes will run its own trials, on one GPU each. The batch job will get killed after the designated runtime. Then unfinished trials will remain in the Optuna database in state RUNNING.
The current workflow for resuming the study which makes use of MALA's own resume tooling (see
examples/advanced/ex05_checkpoint_hyperparameter_optimization.py
) is this: before submitting the batch job again and let the script do the resume work, a user needs to modify the database like so:python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"
which will set the RUNNING trials to state WAITING (more on that in #497).
If this works, the assumption is now that when Optuna resumes, it will pick up and re-run those, before carrying on running the resumed study.
questions:
The text was updated successfully, but these errors were encountered: