Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix config of user facing execution parameters in spawning elastic tasks #1677

Merged
merged 1 commit into from
Jun 26, 2023

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jun 6, 2023

TL;DR

When using @task(task_config=flytekitplugins.kfpytorch.Elastic()), the task function is started in a number of worker processes using torch elastic_launch (torchrun). The processes can be created using fork or spawn which is controlled by the arg Elastic(start_method=...).

When using fork, the child process inherits a copy of the parent process' stack including the flyte context and the user facing execution parameters ctx = flytekit.current_context().

When spawning, however, fresh processes are started and the flyte context and the execution parameters are not transferred to the child process currently. This means that within a task with @task(task_config=Elastic(start_method="spawn")) the execution id and the checkpoint cannot be accessed from the execution parameters.

This PR fixes this by setting up the flyte context in the spawned worker processes.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

In the spawned worker processes I call flytekit.bin.entrypoint.setup_execution which sets up the flyte context the same way as when a normal python task is started. Raw data prefix and checkpoint pathes are transferred from the parent process.

Tracking Issue

NA

Follow-up issue

NA

@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 2cc8283 to a1e0a8e Compare June 6, 2023 09:40
@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Merging #1677 (1263dab) into master (3370a96) will decrease coverage by 0.03%.
The diff coverage is n/a.

❗ Current head 1263dab differs from pull request most recent head d71c3bb. Consider uploading reports for the commit d71c3bb to get more accurate results

@@            Coverage Diff             @@
##           master    #1677      +/-   ##
==========================================
- Coverage   71.03%   71.00%   -0.03%     
==========================================
  Files         336      336              
  Lines       30798    30781      -17     
  Branches     5589     5576      -13     
==========================================
- Hits        21876    21855      -21     
- Misses       8375     8379       +4     
  Partials      547      547              

see 15 files with indirect coverage changes

Comment on lines +74 to +76
("spawn", "", False),
("spawn", "f12345678", True),
("fork", "local", False),
Copy link
Member Author

@fg91 fg91 Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When spawning, the execution_id.name, .project, .domain, ... are set to the default value "" here when the FLYTE_INTERNAL_EXECUTION_ID, ... env vars are not set, i.e. during a local execution.

When executing a workflow/task locally, these execution identifiers are normally set to "local" which happens here. Since the parent processes stack is copied during forking, "local" is set when using this start method.

Accepting this difference between forking and spawning in a local execution might be a pragmatic compromise but is something that gives me a bit of grief.

If we want to remove this difference, I see two options for doing so.

  • Not set "" as the default value for execution id name, project, domain, ... in flytekit.bin.entrypoint.setup_execution. Would this have any undesired effect?
  • Maintain an adapted copy of setup_execution here, which would, however, lead to quite some code duplication which wouldn't be nice either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will defer to @eapolinario @pingsutw on this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. We don't use project, domain, and name in the local execution, right?

@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from a1e0a8e to 56d91f5 Compare June 6, 2023 13:51
@fg91 fg91 marked this pull request as ready for review June 6, 2023 14:30
@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 56d91f5 to 1263dab Compare June 7, 2023 10:01
Comment on lines +74 to +76
("spawn", "", False),
("spawn", "f12345678", True),
("fork", "local", False),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will defer to @eapolinario @pingsutw on this point

Comment on lines +74 to +76
("spawn", "", False),
("spawn", "f12345678", True),
("fork", "local", False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine. We don't use project, domain, and name in the local execution, right?

@fg91 fg91 force-pushed the fabio/fix/kfpytorch-elastic-execution-params branch from 1263dab to d71c3bb Compare June 21, 2023 17:10
@pingsutw pingsutw merged commit d7bfa6e into master Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants