-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backend][Plugin]Support for Dask clustered tasks in Flyte #427
Comments
Dask can be deployed to Kubernetes, the template is shown here. Allowing this would help the users a lot and enable writing really short tasks. This coupled with cluster re-use (coming later) or cluster gateways (daskhub) and support for a coiled task in the future would enable users to use dask more effectively and make Flyte + Dask work together. @task(config=Dask(
workers=4,
worker_resources=....,
[worker-pod-template=...] # Also the command should probably be hard coded client side?
), resources=Resources(....) # Driver resource
)
def my_dask_program():
pass |
This was discussed a little on slack: https://app.slack.com/client/TN89P6GGK/CNMKCU6FR/thread/CNMKCU6FR-1648660418.322249 Currently we do not use dask, nor do we use Coiled's hosted platform for Dask. However, both are really interesting to us in terms of migrating away from our current, home rolled, workflow orchestration solution and having someone else run our work loads for us. The primary interest in Dask is its drop in nature w.r.t. dataframes and numpy arrays. We currently employ a streaming solution (using mmap) to allow us to do this work on one node; being able to scale up to multiple nodes without needing code changes in dependent areas of our project would be an instant win for us. From an integration point of view I found the Dask+Prefect video informative: In this video a couple of things stand out to me as desirable from any integration:
Unfortunately I don't have much more to add other than opinions currently; however as we re-work our stack to work with flyte there's a high chance we find ourselves going down this path - in which case we'll keep you apprised. |
This is a great summary, let me chalk out the effort and see when we can accommodate this |
Also we can always start with a flytekit plugin, |
@kumare3 Quick update on this, we are working on a I've also looked into creating a backend plugin and have a working prototype, capable of managing the cluster lifecycle. Currently, this is waiting on dask/dask-kubernetes#483 (Basically |
@bstadlbauer this is awesome. Please let us know how we can help. There is some momentum now in adding Flyte+Ray support. We will also be working on reusing Ray cluster across multiple tasks in a Flyte workflow. Once you have your dask plugin, we will start modifying things towards this common way of reusing clusters |
@kumare3 Great, thank you! Resuing clusters would be super helpful! I've looked at the |
@bstadlbauer tye backend plugin is flexible. Spark is peculiar because it starts the cluster and runs the app. We actually prefer that you can run a separate driver as that can speed up Flyte even more and give fantastic control- learnt through many issues in spark. Flyte can run the user code as a separate pod and then monitor it. This also helps on reuse |
@kumare3 Oh that's nice! Is there a plugin that does this already? |
Not today, but we are working on ray plugin. Let me add you to a slack thread |
Quick update:
|
Signed-off-by: Ketan Umare <[email protected]>
Quick update from my end. I had some time this weekend to finish things. Sorry for this taking so long, the last weeks have been quite busy. Overall, this would be the order in which the PRs need to go in: |
Signed-off-by: Haytham Abuelfutuh <[email protected]>
All PRs are in, closing this task. Thanks again for this awesome contribution @bstadlbauer ! 🚀 |
Signed-off-by: Ketan Umare <[email protected]>
…teorg#427) This reverts commit c1489d8. Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]>
* add field Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * Pass task execution metadata from agent (#422) * Pass task execution metadata from agent Signed-off-by: Hongxin Liang <[email protected]> * Add doc Signed-off-by: Hongxin Liang <[email protected]> * Update protos/flyteidl/admin/agent.proto Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Honnix <[email protected]> * Regenerate --------- Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add tags to execution spec (#414) * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add comment Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Correct comment for array job max parallelism (#431) Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add the scalar to the operand (#427) Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * add selector Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * move selectors from container to task metadata Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * drop only_preferred Signed-off-by: Jeev B <[email protected]> * Updating boilerplate to lock golangci-lint version (#435) Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Jeev B <[email protected]> * add unpartitioned selector Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * fix oneof names Signed-off-by: Jeev B <[email protected]> * add build.os for read the docs Signed-off-by: Jeev B <[email protected]> --------- Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Jeev B <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
* add field Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * Pass task execution metadata from agent (#422) * Pass task execution metadata from agent Signed-off-by: Hongxin Liang <[email protected]> * Add doc Signed-off-by: Hongxin Liang <[email protected]> * Update protos/flyteidl/admin/agent.proto Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Honnix <[email protected]> * Regenerate --------- Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add tags to execution spec (#414) * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add comment Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Correct comment for array job max parallelism (#431) Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add the scalar to the operand (#427) Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * add selector Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * move selectors from container to task metadata Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * drop only_preferred Signed-off-by: Jeev B <[email protected]> * Updating boilerplate to lock golangci-lint version (#435) Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Jeev B <[email protected]> * add unpartitioned selector Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * fix oneof names Signed-off-by: Jeev B <[email protected]> * add build.os for read the docs Signed-off-by: Jeev B <[email protected]> --------- Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Jeev B <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
Signed-off-by: Kamal Eybov <[email protected]>
Signed-off-by: Kamal Eybov <[email protected]>
Signed-off-by: Kamal Eybov <[email protected]>
Signed-off-by: Kamal Eybov <[email protected]>
* handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>
* [BUG] add retries to handle array node eventing race condition (#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>
* [BUG] add retries to handle array node eventing race condition (#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * use deep copy of bit arrays when getting array node state Signed-off-by: Paul Dittamo <[email protected]> * Revert "add deep copy for array node status" This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]> * ignore ErrorOnAlreadyExists when marshalling event config Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>
* [BUG] add retries to handle array node eventing race condition (#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>
* [BUG] add retries to handle array node eventing race condition (#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * use deep copy of bit arrays when getting array node state Signed-off-by: Paul Dittamo <[email protected]> * Revert "add deep copy for array node status" This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]> * ignore ErrorOnAlreadyExists when marshalling event config Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: pmahindrakar-oss <[email protected]>
* [BUG] add retries to handle array node eventing race condition (flyteorg#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (flyteorg#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (flyteorg#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: Bugra Gedik <[email protected]>
…eorg#5681) * [BUG] add retries to handle array node eventing race condition (flyteorg#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (flyteorg#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (flyteorg#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * use deep copy of bit arrays when getting array node state Signed-off-by: Paul Dittamo <[email protected]> * Revert "add deep copy for array node status" This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]> * ignore ErrorOnAlreadyExists when marshalling event config Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: Bugra Gedik <[email protected]>
Why would this plugin be helpful to the Flyte community
Users could write very short running distributed array jobs using DASK. This makes it possible to have very small runtime jobs multi-plexed onto same set of nodes.
Type of Plugin
Can you help us with the implementation?
Additional context
This would really help express some ideas that are not Spark, or heavyweight like Flyte batch jobs.
The text was updated successfully, but these errors were encountered: