-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix config of user facing execution parameters in spawning elastic tasks #1677
Conversation
2cc8283
to
a1e0a8e
Compare
Codecov Report
@@ Coverage Diff @@
## master #1677 +/- ##
==========================================
- Coverage 71.03% 71.00% -0.03%
==========================================
Files 336 336
Lines 30798 30781 -17
Branches 5589 5576 -13
==========================================
- Hits 21876 21855 -21
- Misses 8375 8379 +4
Partials 547 547 |
("spawn", "", False), | ||
("spawn", "f12345678", True), | ||
("fork", "local", False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When spawning, the execution_id.name
, .project
, .domain
, ... are set to the default value ""
here when the FLYTE_INTERNAL_EXECUTION_ID
, ... env vars are not set, i.e. during a local execution.
When executing a workflow/task locally, these execution identifiers are normally set to "local"
which happens here. Since the parent processes stack is copied during forking, "local"
is set when using this start method.
Accepting this difference between forking and spawning in a local execution might be a pragmatic compromise but is something that gives me a bit of grief.
If we want to remove this difference, I see two options for doing so.
- Not set
""
as the default value for execution id name, project, domain, ... inflytekit.bin.entrypoint.setup_execution
. Would this have any undesired effect? - Maintain an adapted copy of
setup_execution
here, which would, however, lead to quite some code duplication which wouldn't be nice either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will defer to @eapolinario @pingsutw on this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine. We don't use project, domain, and name in the local execution, right?
a1e0a8e
to
56d91f5
Compare
56d91f5
to
1263dab
Compare
("spawn", "", False), | ||
("spawn", "f12345678", True), | ||
("fork", "local", False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will defer to @eapolinario @pingsutw on this point
("spawn", "", False), | ||
("spawn", "f12345678", True), | ||
("fork", "local", False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine. We don't use project, domain, and name in the local execution, right?
Signed-off-by: Fabio Grätz <[email protected]>
1263dab
to
d71c3bb
Compare
TL;DR
When using
@task(task_config=flytekitplugins.kfpytorch.Elastic())
, the task function is started in a number of worker processes using torchelastic_launch
(torchrun
). The processes can be created using fork or spawn which is controlled by the argElastic(start_method=...)
.When using
fork
, the child process inherits a copy of the parent process' stack including the flyte context and the user facing execution parametersctx = flytekit.current_context()
.When spawning, however, fresh processes are started and the flyte context and the execution parameters are not transferred to the child process currently. This means that within a task with
@task(task_config=Elastic(start_method="spawn"))
the execution id and the checkpoint cannot be accessed from the execution parameters.This PR fixes this by setting up the flyte context in the spawned worker processes.
Type
Are all requirements met?
Complete description
In the spawned worker processes I call
flytekit.bin.entrypoint.setup_execution
which sets up the flyte context the same way as when a normal python task is started. Raw data prefix and checkpoint pathes are transferred from the parent process.Tracking Issue
NA
Follow-up issue
NA