-
Notifications
You must be signed in to change notification settings - Fork 53
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Codecov Report
@@ Coverage Diff @@
## master #275 +/- ##
==========================================
+ Coverage 62.37% 63.02% +0.64%
==========================================
Files 147 148 +1
Lines 11816 12148 +332
==========================================
+ Hits 7370 7656 +286
- Misses 3882 3912 +30
- Partials 564 580 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hamersaw @eapolinario @pingsutw!
This is a first draft of the dask
backend plugin. I mostly wanted to share progress with you here and would have one small question around it.
It already runs successfully, but the code is still lacking:
- Configuration options (currently everything is hardcoded) -> A PR for the Python API should come soon in
flytekit
, this would probably be the best way to discuss the interface? - Any sort of testing
go/tasks/plugins/k8s/dask/dask.go
Outdated
func getJobPodFromJobResource(job *daskAPI.DaskJob, ctx context.Context) (*v1.Pod, error) { | ||
clientset, err := getClientset() | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
jobPodName := job.ObjectMeta.Name + jobRunnerPodPostfix | ||
jobPodNamespace := job.ObjectMeta.Namespace | ||
pod, err := clientset.CoreV1().Pods(jobPodNamespace).Get(ctx, jobPodName, metav1.GetOptions{}) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return pod, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the DaskJob
CR has no notion of a state. So to get the state, I am currently instantiating a k8s clientset
, query for the job pod, and use that to determine state, similar to how the pod
plugin does that.
My questions here would be:
- Is this something that is acceptable? From what I've seen, usually only propeller has a notion of
clientset
, other plugins rely on the state within the CR provided by to the plugin by the propeller. If this is not an option, I can try to contribute to thedask
-operator, potentially we could save some state in there. - If yes, should there be configuration which allows authenticating using a local kubeconfig, or should we only support in-cluster mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally plugins should not create a new k8s config. This context is better kept within propeller and injected in.
Is it that you want to get the reference to the driver pod? We can do that in a different way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kumare3 Exactly, this would only be used to get the driver pod (JobPod
in the dask
terminology). The Task Phase would then be based on the state of this pod (similar to how the pod plugin determines the task phase).
Happy to also do this a different way 👍
This is awesome! Thanks for working on it, I see it's been a bit of effort in different repos as well. One question, in doing some reading on the dask operator here it sounds like once the DaskJob CRD is created the control is handed over the the Job Runner Pod (hope terminology is correct). As much so that "Once the job Pod runs to completion the cluster is removed automatically". This is why you are forced to retrieve the JobPod to gather the status of the job, because, although a bit unusual for CRDs, the status is not tracked in the top level. Does aborting the JobPod shutdown the cluster as well? If so, FlytePropeller uses the BuildIdentityResource function for checking status, aborting, and finalizing the task - maybe we could set this to the JobPod rather than the DaskJob? |
d4b5a00
to
dfe92b6
Compare
Hi @hamersaw! Sorry for the slow progress on this, but last time I've tried to continue I ran into weird issues with my Flyte dev setup on my private machine (
I might be missing something here though? Otherwise, I would propose that I try to add state to the CRD on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only big thing I notice is figuring out resources. I don't like that there is parsing (ie. protobuf to k8s) specific code here. It looks like this functionality has been implemented in FlytePropeller, which unfortunately cannot be included as a dependency here.
IIUC this uses the task-defined resources as defaults for everything and then allows overriding through the configuration. I think this a sane way of doing it - but should certainly be documented. I think the two options are:
(1) refactor flytepropeller to move the code to flyteplugins
(2) use the k8s version resources as part of the configuration. i think we would have to unmarshal them similar to how we do podspecs in the pod plugin. honestly, i do not know everything about what this would look like in flytekit, but presumably we could then use the k8s Resource objects are part of the flytekit definition as well.
I'm not sure what is a better route here. Thoughts?
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
c06eb67
to
7f8d0b5
Compare
Signed-off-by: Bernhard Stadlbauer <[email protected]>
@hamersaw Good point about the resource parsing, I've refactored this out of |
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
017bdb7
to
ae66bc4
Compare
Signed-off-by: Bernhard Stadlbauer <[email protected]>
2bea897
to
565c942
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we get the FlyteIDL 1.4.0
-> 1.3.x
fixed lets go ahead and merge!
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
@hamersaw I've reordered the imports and changed the flytekit version to 1.3.2. I've also removed the |
Congrats on merging your first pull request! 🎉 |
* First working version with DaskCluster Signed-off-by: Bernhard Stadlbauer <[email protected]> * Update plugin Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add container customization Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add correct `getTaskPhase` Signed-off-by: Bernhard Stadlbauer <[email protected]> * Refactor dask.go Signed-off-by: Bernhard Stadlbauer <[email protected]> * Use new dask operator which includes a status Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add first tests and use data from flyteidl Signed-off-by: Bernhard Stadlbauer <[email protected]> * Refactor tests Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add support for custom namespace Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add support for passing on annotations Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add support for env vars Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add default container logic to job runner Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add TestGetTaskPhaseDask Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add logs to task info Signed-off-by: Bernhard Stadlbauer <[email protected]> * Fix Signed-off-by: Bernhard Stadlbauer <[email protected]> * Use tagged version of dask go operator Signed-off-by: Bernhard Stadlbauer <[email protected]> * Fix linting issues Signed-off-by: Bernhard Stadlbauer <[email protected]> * Refactor `ToK8sResourceRequirements` Signed-off-by: Bernhard Stadlbauer <[email protected]> * Use platform resources by default Signed-off-by: Bernhard Stadlbauer <[email protected]> * Fix incorrect resources Signed-off-by: Bernhard Stadlbauer <[email protected]> * Remove namespace Signed-off-by: Bernhard Stadlbauer <[email protected]> * Don't restart job runner and scheduler Signed-off-by: Bernhard Stadlbauer <[email protected]> * Run `go mod tidy` after rebase Signed-off-by: Bernhard Stadlbauer <[email protected]> * Run formatter Signed-off-by: Bernhard Stadlbauer <[email protected]> * Update to new `flyteidl` Signed-off-by: Bernhard Stadlbauer <[email protected]> * Add support for interruptible workers Signed-off-by: Bernhard Stadlbauer <[email protected]> * Update flytekit to 1.4.0 Signed-off-by: Bernhard Stadlbauer <[email protected]> * Fix linting errors Signed-off-by: Bernhard Stadlbauer <[email protected]> * Update `flyteidl` to 1.3.2 Signed-off-by: Bernhard Stadlbauer <[email protected]> * Reorder imports Signed-off-by: Bernhard Stadlbauer <[email protected]> Signed-off-by: Bernhard Stadlbauer <[email protected]>
TL;DR
This PR adds a backend
dask
plugin using thedask-kubernetes
operator to manage thedask
cluster lifecycle.Type
Are all requirements met?
Complete description
The plugin works similar to the already existing
spark
plugin. It uses theDaskJob
(docs) Custom Resource as a client pod which connects to the spun up cluster. This is similar to how theSparkApplication
Custom Resource works for thespark
plugin.TODO before this can be merged:
flyteidl
after Add initaldask
plugin IDL #minor flyteidl#339 is merged and releasedTracking Issue
flyteorg/flyte#427
Follow-up issue
NA