Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A failed wf node leaves the other nodes (spark tasks) running until they finish #263

Closed
3 of 20 tasks
EngHabu opened this issue Apr 13, 2020 · 1 comment
Closed
3 of 20 tasks
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers

Comments

@EngHabu
Copy link
Contributor

EngHabu commented Apr 13, 2020

Describe the bug
If a node in a workflow fails, the entire workflow fails, if there are other nodes running spark, the underlying CRD will continue to run wasting resources.

Expected behavior
If the workflow shows as failed, the underlying CRDs it created should be cleaned up.

Flyte component

  • Overall
  • Flyte Setup and Installation scripts
  • Flyte Documentation
  • Flyte communication (slack/email etc)
  • FlytePropeller
  • FlyteIDL (Flyte specification language)
  • Flytekit (Python SDK)
  • FlyteAdmin (Control Plane service)
  • FlytePlugins
  • DataCatalog
  • FlyteStdlib (common libraries)
  • FlyteConsole (UI)
  • Other

Environment
Flyte component

  • Sandbox (local or on one machine)
  • Cloud hosted
    • AWS
    • GCP
    • Azure
  • Baremetal
  • Other
@EngHabu EngHabu added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 13, 2020
@kumare3
Copy link
Contributor

kumare3 commented Apr 22, 2020

@EngHabu does the abort not propagate?

@EngHabu EngHabu closed this as completed Apr 22, 2020
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
* Fix: register remaining tasks without default plugins

Signed-off-by: Filipe Regadas <[email protected]>

* Add test

Signed-off-by: Filipe Regadas <[email protected]>

* fixup! Add test

Signed-off-by: Filipe Regadas <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
* Migrate to golang-jwt/jwt/v4

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* go mod tidy

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Move to go 1.17

Signed-off-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: pmahindrakar-oss <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
* Fix: register remaining tasks without default plugins

Signed-off-by: Filipe Regadas <[email protected]>

* Add test

Signed-off-by: Filipe Regadas <[email protected]>

* fixup! Add test

Signed-off-by: Filipe Regadas <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
* Migrate to golang-jwt/jwt/v4

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* go mod tidy

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Move to go 1.17

Signed-off-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: pmahindrakar-oss <[email protected]>
austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: pmahindrakar-oss <[email protected]>
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: pmahindrakar-oss <[email protected]>
troychiu pushed a commit that referenced this issue Jul 8, 2024
## Overview
This PR enables graceful aborts (rather than panics) when a fasttask times out waiting for worker availability.

## Test Plan
Tested locally.

## Rollout Plan (if applicable)
This can be rolled out along with any other changes.

## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If so, please check this box for auditing. Note, this is the responsibility of each developer. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [ ] To be upstreamed

## Jira Issue
https://unionai.atlassian.net/browse/EXO-103

## Checklist
* [ ] Added tests
* [ ] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

No branches or pull requests

2 participants