Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optuna resume workflow overhaul #515

Open
elcorto opened this issue May 30, 2024 · 0 comments
Open

Optuna resume workflow overhaul #515

elcorto opened this issue May 30, 2024 · 0 comments

Comments

@elcorto
Copy link
Member

elcorto commented May 30, 2024

Motivated by #497, here is a summary of the current resume workflow, along with questions and thoughts for improvement:

MALA parallel Optuna usage is supposed to be as described in the Optuna docs -- just start the optimization script N times and let Optuna sync processes via a shared database, no MPI or anything. So we'd execute e.g. examples/basic/ex04_hyperparameter_optimization.py or examples/advanced/ex05_checkpoint_hyperparameter_optimization.py N times in parallel. On HPC, we can do so in a batch job such as this with N=12, where we use mpirun just to distribute Python processes across nodes:

#!/bin/bash
#SBATCH -N 3
#SBATCH --ntasks-per-node=4
#SBATCH --time=48:00:00
#SBATCH --job-name=HO
#SBATCH --gres=gpu:4
#SBATCH --mem=360G
#SBATCH -p some_ququw

module load ...
mpirun -np 12 python3 -u hyperopt01.py

Each of the 12 processes will run its own trials, on one GPU each. The batch job will get killed after the designated runtime. Then unfinished trials will remain in the Optuna database in state RUNNING.

The current workflow for resuming the study which makes use of MALA's own resume tooling (see examples/advanced/ex05_checkpoint_hyperparameter_optimization.py) is this: before submitting the batch job again and let the script do the resume work, a user needs to modify the database like so:

python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"

which will set the RUNNING trials to state WAITING (more on that in #497).

If this works, the assumption is now that when Optuna resumes, it will pick up and re-run those, before carrying on running the resumed study.

questions:

  • Does "injecting" jobs like this disturb Optuna's operation in any way? In particular, since MALA pickle-dumps and re-loads the Optuna study object, is there state in that object and the database that must be in sync, i.e. do those trials have to be run? If not, then one could probably also delete them from the database (if that is the only source of truth for the study's state before we resume). In this case, the study will have missing data points from trials that have been suggested for a reason, so even if Optuna would resume fine, we may still want to re-run them from an optimization point of view.
  • Optuna has some resume functionality, is this used in MALA? If not, could this help here?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants