Skip to content

Commit

Permalink
Fix error of getting return value in multi-node training
Browse files Browse the repository at this point in the history
  • Loading branch information
acherstyx committed Apr 10, 2024
1 parent 3e0c0e8 commit 7b85491
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion hydra_plugins/hydra_torchrun_launcher/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from hydra.types import TaskFunction
from hydra.core.utils import (
JobReturn,
JobStatus,
configure_log,
filter_overrides,
run_job,
Expand Down Expand Up @@ -102,7 +103,10 @@ def launch(
)

# We assume that main process has rank 0
ret.return_value = ret.return_value[0]
# Return value from launch_agent with type Dict[int, Any], where the key is **global rank**.
logger.debug("Return value: %s", ret.return_value)
if 0 in ret.return_value:
ret.return_value = ret.return_value[0]
runs.append(ret)
configure_log(
launcher.config.hydra.hydra_logging, launcher.config.hydra.verbose
Expand Down

0 comments on commit 7b85491

Please sign in to comment.