Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle resource overrides in KF plugins #4467

Merged
merged 15 commits into from
Nov 30, 2023
Merged

Conversation

jeevb
Copy link
Contributor

@jeevb jeevb commented Nov 21, 2023

Tracking issue

Fixes: #4422

Describe your changes

This PR:

  • Adds a helper to override interruptibility and task resources in TaskExecutionContext before passing to ToK8sPodSpec where the base pod spec is constructed, and hydrated with defaults and additional configuration (e.g. affinity, tolerations, etc.).
  • Updates dask and spark plugins to use this new helper for overriding interruptibility of the driver pods
  • Updates KF plugins (i.e. mpi, pytorch and tensorflow) to correctly handle resource overrides specified in the respective task configs. This allows for correctly applying resource-specific tolerations (e.g. GPU) to the pod spec.

Config:

plugins:
  k8s:
    resource-tolerations:
      nvidia.com/gpu:
      - key: "flyte/GPU"
        operator: "Exists"
        effect: "NoSchedule"

Task decorator:

@task(
    task_config=MPIJob(
        launcher=Launcher(
            replicas=1,
        ),
        worker=Worker(
            replicas=1,
            limits=Resources(cpu="2", mem="1000Mi", gpu="1"),
        ),
    ),
    retries=3,
    cache=True,
    cache_version="0.1",
    limits=Resources(cpu="2", mem="1000Mi"),
)

Launcher tolerations:

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Worker tolerations:

  tolerations:
  - effect: NoSchedule
    key: flyte/GPU
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Setup Process

Screenshots

Note to reviewers

Related PRs

Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
@jeevb jeevb force-pushed the jeev/fix-kfplugins branch from 55063aa to cfc2f24 Compare November 21, 2023 22:51
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: 156 lines in your changes are missing coverage. Please review.

Comparison is base (7878b8e) 59.63% compared to head (307bead) 59.52%.

❗ Current head 307bead differs from pull request most recent head 7df36cf. Consider uploading reports for the commit 7df36cf to get more accurate results

Files Patch % Lines
...ks/pluginmachinery/flytek8s/plugin_exec_context.go 0.00% 75 Missing ⚠️
.../plugins/k8s/kfoperators/common/common_operator.go 1.61% 61 Missing ⚠️
...lugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 72.72% 5 Missing and 4 partials ⚠️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 69.56% 4 Missing and 3 partials ⚠️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 88.57% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4467      +/-   ##
==========================================
- Coverage   59.63%   59.52%   -0.12%     
==========================================
  Files         638      636       -2     
  Lines       53995    53838     -157     
==========================================
- Hits        32201    32045     -156     
- Misses      19262    19265       +3     
+ Partials     2532     2528       -4     
Flag Coverage Δ
unittests 59.52% <32.17%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jeevb jeevb marked this pull request as ready for review November 21, 2023 23:26
return nil, err
}
launcherReplica.RestartPolicy = common.ParseRestartPolicy(launcherReplicaSpec.GetRestartPolicy())
launcherReplicaSpec, err = common.ToReplicaSpecWithOverrides(ctx, taskCtx, kfMPITaskExtraArgs.GetLauncherReplicas(), kubeflowv1.MPIJobDefaultContainerName, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to confirm if launcher pods should be treated as special and be non-interruptible.

Copy link
Contributor Author

@jeevb jeevb Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clear evidence to suggest one way or the other. This pod is responsible for launching the jobs on the MPI workers (somewhat akin to Ray head nodes), so will give it special treatment for now.

taskCtxOptions := []flytek8s.PluginTaskExecutionContextOption{}
// Master should always run as non-interruptible
if isMaster {
taskCtxOptions = append(taskCtxOptions, flytek8s.WithInterruptible(false))
Copy link
Contributor Author

@jeevb jeevb Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that "master" pods will be tagged as non-interruptible. Currently this is set for PyTorch masters and MPI launchers.

Copy link
Contributor

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you!

Re: whether kfoperators master pods should be non-interruptible, why do we have do consider those as special? Is it just for consistency with the other k8s plugins (dask and spark have their driver/master pods as non-interruptible?

flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go Outdated Show resolved Hide resolved
flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go Outdated Show resolved Hide resolved
@jeevb
Copy link
Contributor Author

jeevb commented Nov 23, 2023

This looks great, thank you!

Re: whether kfoperators master pods should be non-interruptible, why do we have do consider those as special? Is it just for consistency with the other k8s plugins (dask and spark have their driver/master pods as non-interruptible?

If it would be detrimental to lose the master, we should run them as non-interruptible. Same logic as dask and spark, yes.

Signed-off-by: Jeev B <[email protected]>
@jeevb jeevb force-pushed the jeev/fix-kfplugins branch from 3253966 to 1d29e02 Compare November 23, 2023 00:33
return nil, flyteerr.Errorf(flyteerr.BadTaskSpecification, "Unable to create replica spec: [%v]", err.Error())
}
launcherReplicaSpec = replicaSpec.DeepCopy()
// TODO (jeev): Is this even a valid configuration. Can there be more than 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintains parity for now, but will need to follow up in a subsequent investigation.

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Nov 27, 2023
hamersaw
hamersaw previously approved these changes Nov 28, 2023
Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. would feel more confident with a few more eyes here. @fg91 @yubofredwang mind taking a quick look?

@hamersaw hamersaw requested a review from fg91 November 28, 2023 15:39
pingsutw
pingsutw previously approved these changes Nov 30, 2023
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 30, 2023
Copy link
Member

@fg91 fg91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the change that a pytorchjob master replica should always be non-interruptible. The master replica doesn't start the worker replicas or restart them in case they are preempted. If one of the worker replicas is preempted, all other pods including the master replica have to stop/be restarted. So there is no value in preventing the master replica from being preempted. Or am I overlooking something here?

I think differentiating between master and worker replicas doesn't make that much sense for pytorchjobs in the first place and in the newer elastic pytorchjobs, there are no more master replicas at all.

Could you please revert this change (unless I'm overlooking something)?


I appreciate that you tagged me @hamersaw @jeevb since we rely on pytorch jobs a lot :)
I ran a flytepropeller with your changes in our staging cluster and apart from the non-interruptible master replica everything works as expected.

However, for pytorchjobs everything worked as expected also before. Do I understand it correctly that the large diff in pytorch.go comes from the fact that now the overriding of resources and resource/interruptible tolerations happens more centrally instead of being handled separately in every single plugin?

@dosubot dosubot bot removed the lgtm This PR has been approved by a maintainer label Nov 30, 2023
@jeevb
Copy link
Contributor Author

jeevb commented Nov 30, 2023

I went ahead and reverted forcing non-interruptibility on KF plugin masters. This keeps parity with the current implementation for now.

@fg91: The big change here was to apply resource overrides from plugin-specific config before ToK8sPodSpec is called. This ensures that things like resource-based tolerations (maybe not relevant for you guys since GKE injects GPU tolerations automatically), and other additional pod configuration work correctly - that function is doing quite a bit. The remaining changes in the PyTorch plugin are just structural to make the code less repetitive and easier to maintain - lots of commonalities between the KF plugins.

@jeevb jeevb requested review from fg91 and hamersaw November 30, 2023 17:01
@fg91
Copy link
Member

fg91 commented Nov 30, 2023

I went ahead and reverted forcing non-interruptibility on KF plugin masters. This keeps parity with the current implementation for now.

Parity also for MPI? Asking because I know only for pytorch that the master shouldn't always be non-interruptible - I don't know for MPI ..

@fg91: The big change here was to apply resource overrides to before ToK8sPodSpec is called. This ensures that things like resource-based tolerations (maybe not relevant for you guys since GKE injects GPU tolerations automatically), and other additional pod configuration work correctly - that function is doing quite a bit. The remaining changes in the PyTorch plugin are just structural to make the code less repetitive and easier to maintain - lots of commonalities between the KF plugins.

Sounds great! Out of curiosity, does this mean that resource tolerations are automatically handled correctly if another plugin is added because this is handled before the plugin builds the resource?

@jeevb
Copy link
Contributor Author

jeevb commented Nov 30, 2023

Yes keeping parity for MPI as well for now, since no one has been able to provide a convincing argument one way or the other. I don't know MPI well enough. Let me know if you have thoughts @fg91.

Sounds great! Out of curiosity, does this mean that resource tolerations are automatically handled correctly if another plugin is added because this is handled before the plugin builds the resource?

This is still handled when the plugin is building the resource. The overrides in this case are derived from parsing the plugin configuration. These changes basically give plugins a simple mechanism to inject overrides and benefit from the helpers in pluginmachinery.

@fg91
Copy link
Member

fg91 commented Nov 30, 2023

Yes keeping parity for MPI as well for now, since no one has been able to provide a convincing argument one way or the other. I don't know MPI well enough. Let me know if you have thoughts @fg91.

I don't know MPI well enough either unfortunately 🤷 But I'm sure about pytorch jobs ... and as long as we keep parity here, I think this is good to merge.

Sounds great! Out of curiosity, does this mean that resource tolerations are automatically handled correctly if another plugin is added because this is handled before the plugin builds the resource?

This is still handled when the plugin is building the resource. The overrides in this case are derived from parsing the plugin configuration. These changes basically give plugins a simple mechanism to inject overrides and benefit from the helpers in pluginmachinery.

Nice. I always disliked that the plugins had to reinvent the wheel here instead of using shared helpers.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 30, 2023
@jeevb jeevb merged commit 7481b3d into master Nov 30, 2023
41 checks passed
@jeevb jeevb deleted the jeev/fix-kfplugins branch November 30, 2023 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPU tolerations are not correctly passed to MPI workers
5 participants