Optuna resume workflow overhaul #515

elcorto · 2024-05-30T11:39:33Z

Motivated by #497, here is a summary of the current resume workflow, along with questions and thoughts for improvement:

MALA parallel Optuna usage is supposed to be as described in the Optuna docs -- just start the optimization script N times and let Optuna sync processes via a shared database, no MPI or anything. So we'd execute e.g. examples/basic/ex04_hyperparameter_optimization.py or examples/advanced/ex05_checkpoint_hyperparameter_optimization.py N times in parallel. On HPC, we can do so in a batch job such as this with N=12, where we use mpirun just to distribute Python processes across nodes:

#!/bin/bash
#SBATCH -N 3
#SBATCH --ntasks-per-node=4
#SBATCH --time=48:00:00
#SBATCH --job-name=HO
#SBATCH --gres=gpu:4
#SBATCH --mem=360G
#SBATCH -p some_ququw

module load ...
mpirun -np 12 python3 -u hyperopt01.py

Each of the 12 processes will run its own trials, on one GPU each. The batch job will get killed after the designated runtime. Then unfinished trials will remain in the Optuna database in state RUNNING.

The current workflow for resuming the study which makes use of MALA's own resume tooling (see examples/advanced/ex05_checkpoint_hyperparameter_optimization.py) is this: before submitting the batch job again and let the script do the resume work, a user needs to modify the database like so:

python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"

which will set the RUNNING trials to state WAITING (more on that in #497).

If this works, the assumption is now that when Optuna resumes, it will pick up and re-run those, before carrying on running the resumed study.

questions:

Does "injecting" jobs like this disturb Optuna's operation in any way? In particular, since MALA pickle-dumps and re-loads the Optuna study object, is there state in that object and the database that must be in sync, i.e. do those trials have to be run? If not, then one could probably also delete them from the database (if that is the only source of truth for the study's state before we resume). In this case, the study will have missing data points from trials that have been suggested for a reason, so even if Optuna would resume fine, we may still want to re-run them from an optimization point of view.
Optuna has some resume functionality, is this used in MALA? If not, could this help here?

The text was updated successfully, but these errors were encountered:

RandomDefaultUser added this to the v1.3.0 - Into the multi-GPU-verse milestone Jul 24, 2024

RandomDefaultUser removed this from the v1.3.0 - Into the multi-GPU-niverse milestone Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optuna resume workflow overhaul #515

Optuna resume workflow overhaul #515

elcorto commented May 30, 2024 •

edited

Loading

Optuna resume workflow overhaul #515

Optuna resume workflow overhaul #515

Comments

elcorto commented May 30, 2024 • edited Loading

elcorto commented May 30, 2024 •

edited

Loading