Fix config of user facing execution parameters in spawning elastic tasks #1677

fg91 · 2023-06-06T09:28:14Z

TL;DR

When using @task(task_config=flytekitplugins.kfpytorch.Elastic()), the task function is started in a number of worker processes using torch elastic_launch (torchrun). The processes can be created using fork or spawn which is controlled by the arg Elastic(start_method=...).

When using fork, the child process inherits a copy of the parent process' stack including the flyte context and the user facing execution parameters ctx = flytekit.current_context().

When spawning, however, fresh processes are started and the flyte context and the execution parameters are not transferred to the child process currently. This means that within a task with @task(task_config=Elastic(start_method="spawn")) the execution id and the checkpoint cannot be accessed from the execution parameters.

This PR fixes this by setting up the flyte context in the spawned worker processes.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

In the spawned worker processes I call flytekit.bin.entrypoint.setup_execution which sets up the flyte context the same way as when a normal python task is started. Raw data prefix and checkpoint pathes are transferred from the parent process.

Tracking Issue

NA

Follow-up issue

NA

codecov · 2023-06-06T09:40:57Z

Codecov Report

Merging #1677 (1263dab) into master (3370a96) will decrease coverage by 0.03%.
The diff coverage is n/a.

❗ Current head 1263dab differs from pull request most recent head d71c3bb. Consider uploading reports for the commit d71c3bb to get more accurate results

@@            Coverage Diff             @@
##           master    #1677      +/-   ##
==========================================
- Coverage   71.03%   71.00%   -0.03%     
==========================================
  Files         336      336              
  Lines       30798    30781      -17     
  Branches     5589     5576      -13     
==========================================
- Hits        21876    21855      -21     
- Misses       8375     8379       +4     
  Partials      547      547

see 15 files with indirect coverage changes

fg91 · 2023-06-06T10:14:25Z

plugins/flytekit-kf-pytorch/tests/test_elastic_task.py

+        ("spawn", "", False),
+        ("spawn", "f12345678", True),
+        ("fork", "local", False),


When spawning, the execution_id.name, .project, .domain, ... are set to the default value "" here when the FLYTE_INTERNAL_EXECUTION_ID, ... env vars are not set, i.e. during a local execution.

When executing a workflow/task locally, these execution identifiers are normally set to "local" which happens here. Since the parent processes stack is copied during forking, "local" is set when using this start method.

Accepting this difference between forking and spawning in a local execution might be a pragmatic compromise but is something that gives me a bit of grief.

If we want to remove this difference, I see two options for doing so.

Not set "" as the default value for execution id name, project, domain, ... in flytekit.bin.entrypoint.setup_execution. Would this have any undesired effect?

Maintain an adapted copy of setup_execution here, which would, however, lead to quite some code duplication which wouldn't be nice either.

will defer to @eapolinario @pingsutw on this point

I think that's fine. We don't use project, domain, and name in the local execution, right?

cosmicBboy · 2023-06-20T13:27:14Z

plugins/flytekit-kf-pytorch/tests/test_elastic_task.py

+        ("spawn", "", False),
+        ("spawn", "f12345678", True),
+        ("fork", "local", False),


will defer to @eapolinario @pingsutw on this point

pingsutw · 2023-06-20T21:00:50Z

plugins/flytekit-kf-pytorch/tests/test_elastic_task.py

+        ("spawn", "", False),
+        ("spawn", "f12345678", True),
+        ("fork", "local", False),


I think that's fine. We don't use project, domain, and name in the local execution, right?

Signed-off-by: Fabio Grätz <[email protected]>

fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 2cc8283 to a1e0a8e Compare June 6, 2023 09:40

fg91 commented Jun 6, 2023

View reviewed changes

fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from a1e0a8e to 56d91f5 Compare June 6, 2023 13:51

fg91 marked this pull request as ready for review June 6, 2023 14:30

fg91 requested review from wild-endeavor, kumare3, eapolinario, pingsutw and cosmicBboy as code owners June 6, 2023 14:30

fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 56d91f5 to 1263dab Compare June 7, 2023 10:01

cosmicBboy approved these changes Jun 20, 2023

View reviewed changes

pingsutw approved these changes Jun 20, 2023

View reviewed changes

Fix config of user facing execution parameters in spawning elastic tasks

d71c3bb

Signed-off-by: Fabio Grätz <[email protected]>

fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 1263dab to d71c3bb Compare June 21, 2023 17:10

pingsutw approved these changes Jun 23, 2023

View reviewed changes

pingsutw merged commit d7bfa6e into master Jun 26, 2023

fg91 mentioned this pull request Jul 3, 2024

Add myself to code owners of flytekit-kf-pytorch #2556

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix config of user facing execution parameters in spawning elastic tasks #1677

Fix config of user facing execution parameters in spawning elastic tasks #1677

fg91 commented Jun 6, 2023 •

edited

Loading

codecov bot commented Jun 6, 2023 •

edited

Loading

fg91 Jun 6, 2023 •

edited

Loading

cosmicBboy Jun 20, 2023

pingsutw Jun 20, 2023

cosmicBboy Jun 20, 2023

pingsutw Jun 20, 2023

Fix config of user facing execution parameters in spawning elastic tasks #1677

Fix config of user facing execution parameters in spawning elastic tasks #1677

Conversation

fg91 commented Jun 6, 2023 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Jun 6, 2023 • edited Loading

Codecov Report

fg91 Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

cosmicBboy Jun 20, 2023

Choose a reason for hiding this comment

pingsutw Jun 20, 2023

Choose a reason for hiding this comment

cosmicBboy Jun 20, 2023

Choose a reason for hiding this comment

pingsutw Jun 20, 2023

Choose a reason for hiding this comment

fg91 commented Jun 6, 2023 •

edited

Loading

codecov bot commented Jun 6, 2023 •

edited

Loading

fg91 Jun 6, 2023 •

edited

Loading