-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Workflows nodes sometimes remain in "Running" state even when task fails #333
Closed
2 of 20 tasks
Labels
bug
Something isn't working
Comments
kumare3
added
bug
Something isn't working
untriaged
This issues has not yet been looked at by the Maintainers
labels
May 29, 2020
8 tasks
kumare3
pushed a commit
to flyteorg/flytepropeller
that referenced
this issue
May 29, 2020
Abort always fails for a task if task was already in a terminal state - success, failure or retryable fail. This is because the event publish fails. This fix ensures an event is not published for terminal cases. - [x] Bug Fix - [ ] Feature - [ ] Plugin - [x] Code completed - [x] Smoke tested - [x] Unit tests added - [x] Code documentation added - [x] Any pending items have an associated Issue NA flyteorg/flyte#333 NA
8 tasks
kumare3
pushed a commit
to flyteorg/flytepropeller
that referenced
this issue
May 29, 2020
Abort always fails for a task if task was already in a terminal state - success, failure or retryable fail. This is because the event publish fails. This fix ensures an event is not published for terminal cases. - [x] Bug Fix - [ ] Feature - [ ] Plugin - [x] Code completed - [x] Smoke tested - [x] Unit tests added - [x] Code documentation added - [x] Any pending items have an associated Issue NA flyteorg/flyte#333 NA
8 tasks
kumare3
removed
the
untriaged
This issues has not yet been looked at by the Maintainers
label
May 30, 2020
This is now merged. Should be part of 0.4.0 |
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Dec 6, 2022
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Dec 20, 2022
* Single node GPU training example Signed-off-by: Ketan Umare <[email protected]> * Minor fix related to tensorboard in PyTorch (flyteorg#334) Signed-off-by: Jinserk Baik <[email protected]> * updated pytorch training example Signed-off-by: Ketan Umare <[email protected]> * updated Signed-off-by: Ketan Umare <[email protected]> * wandb integration, code lint, content Signed-off-by: Samhita Alla <[email protected]> * remove misplaced text Signed-off-by: Samhita Alla <[email protected]> * add pytorch in tests' manifest Signed-off-by: Samhita Alla <[email protected]> * changed pytorch to mnist Signed-off-by: Samhita Alla <[email protected]> * dockerfile Signed-off-by: Samhita Alla <[email protected]> * update link Signed-off-by: cosmicBboy <[email protected]> * update deps Signed-off-by: cosmicBboy <[email protected]> Co-authored-by: Jinserk Baik <[email protected]> Co-authored-by: Samhita Alla <[email protected]> Co-authored-by: cosmicBboy <[email protected]>
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Dec 20, 2022
* update pytorch multi-gpu example, incorporate comments @samhita-alla @kumare3 Signed-off-by: Niels Bantilan <[email protected]> * Apply suggestions from code review Co-authored-by: Samhita Alla <[email protected]> Signed-off-by: Niels Bantilan <[email protected]> Co-authored-by: Samhita Alla <[email protected]>
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Dec 20, 2022
Signed-off-by: Flyte-Bot <[email protected]> Co-authored-by: flyte-bot <[email protected]>
pingsutw
pushed a commit
to pingsutw/flyte-monorepo
that referenced
this issue
Apr 4, 2023
Abort always fails for a task if task was already in a terminal state - success, failure or retryable fail. This is because the event publish fails. This fix ensures an event is not published for terminal cases. - [x] Bug Fix - [ ] Feature - [ ] Plugin - [x] Code completed - [x] Smoke tested - [x] Unit tests added - [x] Code documentation added - [x] Any pending items have an associated Issue NA flyteorg/flyte#333 NA
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Jul 24, 2023
Signed-off-by: Daniel Rammer <[email protected]>
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Aug 9, 2023
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Aug 21, 2023
Signed-off-by: Daniel Rammer <[email protected]>
eapolinario
pushed a commit
to eapolinario/flyte
that referenced
this issue
Apr 30, 2024
Signed-off-by: Flyte-Bot <[email protected]> Co-authored-by: flyte-bot <[email protected]>
austin362667
pushed a commit
to austin362667/flyte
that referenced
this issue
May 7, 2024
Signed-off-by: Flyte-Bot <[email protected]> Co-authored-by: flyte-bot <[email protected]>
robert-ulbrich-mercedes-benz
pushed a commit
to robert-ulbrich-mercedes-benz/flyte
that referenced
this issue
Jul 2, 2024
Signed-off-by: Flyte-Bot <[email protected]> Co-authored-by: flyte-bot <[email protected]>
troychiu
pushed a commit
that referenced
this issue
Jul 8, 2024
…for the containers (#333) ## Overview Union secrets injected env vars should appear at the beggining of the env list. This requirement came from the issue faced during NIMs poc where the sidecar container which required secret to be passed in with specific env var name The NGC sidecar container requires a secret to passed in ENV var `NGC_API_KEY` Since union injected secrets use _UNION_ prefix, we couldn't define the secret to be NGC_API_KEY directly as it would be injected as _UNION_NGC_API_KEY Adding of _UNION_ prefix is to be able to distinguish the secret env vars injected by the webhook, Unchanging that functionality , the proposal is to use https://kubernetes.io/docs/tasks/inject-data-application/define-interdependent-environment-variables/ which allow you to define NGC_API_KEY as following `NGC_API_KEY= $(_UNION_NGC_API_KEY)` Also the change removes duplicates if the user is trying to define the same Env var which union is injecting ## Test Plan Before the change ``` k describe pods -n development agd92xq6rbhsvn25g7qb Environment: FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:using_secrets.main FLYTE_INTERNAL_EXECUTION_ID: agd92xq6rbhsvn25g7qb FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks FLYTE_INTERNAL_EXECUTION_DOMAIN: development FLYTE_ATTEMPT_NUMBER: 0 FLYTE_INTERNAL_TASK_PROJECT: flytesnacks FLYTE_INTERNAL_TASK_DOMAIN: development FLYTE_INTERNAL_TASK_NAME: using_secrets.fn FLYTE_INTERNAL_TASK_VERSION: zEKw37ArzIKUrfgKOlUHUg FLYTE_INTERNAL_PROJECT: flytesnacks FLYTE_INTERNAL_DOMAIN: development FLYTE_INTERNAL_NAME: using_secrets.fn FLYTE_INTERNAL_VERSION: zEKw37ArzIKUrfgKOlUHUg FLYTE_SECRETS_ENV_PREFIX: _UNION_ _UNION_MY-CUSTOM-SECRET: Thisisasecret\r ``` After the change on dogfood-gcp ``` k describe pods -n development av8hbdjlmf5lzc8gbp5k Environment: _UNION_MY-CUSTOM-SECRET: Thisisasecret\r FLYTE_SECRETS_ENV_PREFIX: _UNION_ FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:using_secrets.main FLYTE_INTERNAL_EXECUTION_ID: av8hbdjlmf5lzc8gbp5k FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks FLYTE_INTERNAL_EXECUTION_DOMAIN: development FLYTE_ATTEMPT_NUMBER: 0 FLYTE_INTERNAL_TASK_PROJECT: flytesnacks FLYTE_INTERNAL_TASK_DOMAIN: development FLYTE_INTERNAL_TASK_NAME: using_secrets.fn FLYTE_INTERNAL_TASK_VERSION: zEKw37ArzIKUrfgKOlUHUg FLYTE_INTERNAL_PROJECT: flytesnacks FLYTE_INTERNAL_DOMAIN: development FLYTE_INTERNAL_NAME: using_secrets.fn FLYTE_INTERNAL_VERSION: zEKw37ArzIKUrfgKOlUHUg ``` Notice the position of _UNION_MY-CUSTOM-SECRET. Any union secrets would show up at the beginning of the list of ENV vars ## Rollout Plan (if applicable) Rollout to staging and then demo tenant for NIMS feature ## Upstream Changes Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [] To be upstreamed to OSS ## Issue *TODO: Link Linear issue(s) using [magic words](https://linear.app/docs/github#magic-words). `fixes` will move to merged status, while `ref` will only link the PR.* ## Checklist * [ ] Added tests * [ ] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Nodes remain in running state even when task and workflow has failed.
Expected behavior
All nodes should appear in the failed state.
Flyte component
To Reproduce
Steps to reproduce the behavior:
run a node with a bad image (imagepull failure) and observe
Screenshots
Environment
Flyte component
Additional context
NA
The text was updated successfully, but these errors were encountered: