-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DaskJob resource #483
Comments
Hi @jacobtomlinson! Upfront, thanks for the great work on the operator, that should make life easier on multiple ends! I've already played around with it a bit and so far things are looking really good 👍 For a bit of context, I am currently working on integrating I've already managed to create a So TL;DR: If there is anything I can help with on this issue, just let me know |
Glad the recent work is useful to you! I'm excited to have folks trying it out. Implementing I don't know anything about Flyte but I'm curious what features you need from the |
Sound good! If we see any errors, we'll try to open PRs whenever we can 👍 I am by no means an expert on backend Flyte plugins, but the way these work is that they construct a K8s resource (of any kind), submit that resource to K8s and then watch it's state. The go interface a plugin needs to implement is here, an example implementation (a Spark plugin) can be found here. The This is the corresponding |
In KubeFlow there are many operators that allow users to run training jobs as a one-shot task. For example, PyTorch has a
PyTorchJob
CRD which creates a cluster of pods to run a training task and runs them to completion, then cleans up when it is done.We should add something similar to the operator here so that folks will have familiar tools. We can reuse the
DaskCluster
CRD within theDaskJob
(nesting this will be trivial thanks to the work by @samdyzon in #452).I've thought about a few approaches (see alternatives in the details below) but this is my preferred option:
DaskJob
with nestedJob
spec andDaskCluster
spec.Job
resource that runs the client code and aDaskCluster
resource that will be leveraged by the client code.Job
creates itsPod
(this is done by the kubelet) the operator adopts theDaskCluster
to thePod
so that it will be cascade deleted on completion of the Job.This approach will only support non-parallel jobs.
Alternative approaches that I have discounted
One way to implement this would be to reuse as many existing resources and behaviours as possible. But it would require access to the
Jobs
API and for the client code to be resilient to waiting for the cluster to start.DaskJob
with nestedJob
spec andDaskCluster
spec.Job
resource that runs the client code.Job
creates aPod
the operator creates aDaskCluster
resource and adopts it to thePod
.Job
finishes and thePod
is removed theDaskCluster
will be cascade deleted automatically.We could also create the
DaskCluster
andJob
at the same time or theDaskCluster
first then theJob
.Alternatively, we could reimplement some of what a
Job
does in the operator which would give us a little more control over the startup order and require less of the API.DaskJob
with nestedPod
spec andDaskCluster
spec.DaskCluster
resource.Pod
from the spec that runs the client code.Pod
to wait for it to finish.Pod
is done thePod
andDaskCluster
resources are deleted.We also probably want to support a nested autoscaler resource in the
DaskJob
too, so will need #451 to be merged.The text was updated successfully, but these errors were encountered: