Implement DaskJob resource #483

jacobtomlinson · 2022-05-13T10:48:55Z

In KubeFlow there are many operators that allow users to run training jobs as a one-shot task. For example, PyTorch has a PyTorchJob CRD which creates a cluster of pods to run a training task and runs them to completion, then cleans up when it is done.

We should add something similar to the operator here so that folks will have familiar tools. We can reuse the DaskCluster CRD within the DaskJob (nesting this will be trivial thanks to the work by @samdyzon in #452).

I've thought about a few approaches (see alternatives in the details below) but this is my preferred option:

User creates DaskJob with nested Job spec and DaskCluster spec.
Operator creates Job resource that runs the client code and a DaskCluster resource that will be leveraged by the client code.
When the Job creates its Pod (this is done by the kubelet) the operator adopts the DaskCluster to the Pod so that it will be cascade deleted on completion of the Job.

This approach will only support non-parallel jobs.

Alternative approaches that I have discounted

One way to implement this would be to reuse as many existing resources and behaviours as possible. But it would require access to the Jobs API and for the client code to be resilient to waiting for the cluster to start.

User creates DaskJob with nested Job spec and DaskCluster spec.
Operator creates Job resource that runs the client code.
When the Job creates a Pod the operator creates a DaskCluster resource and adopts it to the Pod.
When the Job finishes and the Pod is removed the DaskCluster will be cascade deleted automatically.

We could also create the DaskCluster and Job at the same time or the DaskCluster first then the Job.

Alternatively, we could reimplement some of what a Job does in the operator which would give us a little more control over the startup order and require less of the API.

User creates DaskJob with nested Pod spec and DaskCluster spec.
Operator creates the DaskCluster resource.
When the cluster is running the operator creates a Pod from the spec that runs the client code.
The operator polls the Pod to wait for it to finish.
When the Pod is done the Pod and DaskCluster resources are deleted.

We also probably want to support a nested autoscaler resource in the DaskJob too, so will need #451 to be merged.

The text was updated successfully, but these errors were encountered:

bstadlbauer · 2022-05-23T06:11:42Z

Hi @jacobtomlinson!

Upfront, thanks for the great work on the operator, that should make life easier on multiple ends! I've already played around with it a bit and so far things are looking really good 👍 For a bit of context, I am currently working on integrating dask into Flyte by providing a Dask Flyte-task which manages the ephemeral dask cluster lifecycle. There are already similar plugins for Spark, etc.

I've already managed to create a flytekit (i.e. a pure Python) plugin (code is not public yet) utilizes experimental.KubeCluster to spin up and then delete the dask cluster. It would be really nice to convert this to a backend (Go) plugin (which would create the required CRD), but this would require the DaskJob resource.

So TL;DR: If there is anything I can help with on this issue, just let me know

jacobtomlinson · 2022-05-23T10:12:47Z

Glad the recent work is useful to you! I'm excited to have folks trying it out. Implementing DaskJob is on my tasklist for this week, although I do want to try and land #451 first as I also want to nest the autoscaling resource within DaskJob so that it can be adaptive.

I don't know anything about Flyte but I'm curious what features you need from the DaskJob CRD that you can't get from the DaskCluster CRD? Does it specifically need a Job style resource rather than a Deployment style resource?

bstadlbauer · 2022-05-25T09:03:44Z

Sound good! If we see any errors, we'll try to open PRs whenever we can 👍

I am by no means an expert on backend Flyte plugins, but the way these work is that they construct a K8s resource (of any kind), submit that resource to K8s and then watch it's state. The go interface a plugin needs to implement is here, an example implementation (a Spark plugin) can be found here. The spark plugin uses the spark operator, and submits a SparkApplication CRD, which (as I understand), contains both the cluster specification, as well as the job that needs to run.

This is the corresponding dask related issue in Flyte flyteorg/flyte#427

jacobtomlinson added enhancement operator labels May 13, 2022

jacobtomlinson added this to the [Operator] Support KubeFlow milestone May 13, 2022

jacobtomlinson mentioned this issue May 13, 2022

Add Dask Operator #392

Merged

20 tasks

bstadlbauer mentioned this issue May 23, 2022

[Backend][Plugin]Support for Dask clustered tasks in Flyte flyteorg/flyte#427

Closed

6 tasks

jacobtomlinson mentioned this issue May 26, 2022

Add DaskJob resource #504

Merged

jacobtomlinson closed this as completed in #504 May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DaskJob resource #483

Implement DaskJob resource #483

jacobtomlinson commented May 13, 2022 •

edited

Loading

bstadlbauer commented May 23, 2022

jacobtomlinson commented May 23, 2022

bstadlbauer commented May 25, 2022

Implement DaskJob resource #483

Implement DaskJob resource #483

Comments

jacobtomlinson commented May 13, 2022 • edited Loading

bstadlbauer commented May 23, 2022

jacobtomlinson commented May 23, 2022

bstadlbauer commented May 25, 2022

jacobtomlinson commented May 13, 2022 •

edited

Loading