Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Calling abort rather than finalize on permanent failure #449

Merged
merged 1 commit into from
Jun 6, 2022

Conversation

hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Jun 3, 2022

TL;DR

Currently FlytePropeller calls "finalize" rather than "abort" on permanent failures. Consequently, during instances where a resource is actively running we fail to abort it. For example when the number of timeouts exceeds the maximum number of retries we automatically transition from a retryable failure to a permanent failure.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

I do not believe there are any unintended side-effects by calling abort. If a node has not yet been executed (ex. dynamic, subworkflow, etc) the abort should still be successful. If this is not the case, we will need to revisit this logic to ensure we fail correctly.

Tracking Issue

fixes flyteorg/flyte#2298

Follow-up issue

NA

@hamersaw hamersaw requested review from kumare3 and EngHabu as code owners June 3, 2022 01:21
@codecov
Copy link

codecov bot commented Jun 3, 2022

Codecov Report

Merging #449 (65abf93) into master (ac88f95) will decrease coverage by 0.00%.
The diff coverage is 42.85%.

❗ Current head 65abf93 differs from pull request most recent head 1069fe2. Consider uploading reports for the commit 1069fe2 to get more accurate results

@@ -659,7 +659,7 @@ func (c *nodeExecutor) handleNode(ctx context.Context, dag executors.DAGStructur

if currentPhase == v1alpha1.NodePhaseFailing {
logger.Debugf(ctx, "node failing")
if err := c.finalize(ctx, h, nCtx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we, instead, change finalize implementation in webapi to call delete?

Having both finalize and abort has always bothered me.. I think we only need one (perhaps finalize)... If you look at the implementations in all plugins, the code is pretty much the same... And when it isn't, it's usually a bug waiting to happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly do that. I had two thoughts:
(1) The plugin abstraction, especially since users may write there own, makes it difficult to know if this issue exists in other plugins as well as the webapi async. On a permanent failure we should probably be attempting to abort anyways right? If we make the change in the webapi plugin are we potentially missing out on this fix in other plugins?
(2) I agree the two calls is confusing - especially when they are very similar. Since the existing node-level abort implementation calls both Abort and Finaliz if we introduce a Delete call in the webapi plugin are there any concerns about handling ResourceNotFound errors in webapi plugins? Just want to make sure we are not going to introduce downstream issues in custom plugins. This should already be handled right?

Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

@hamersaw hamersaw merged commit ac293ed into master Jun 6, 2022
@hamersaw hamersaw deleted the bug/abort-on-permanent-failure branch June 6, 2022 22:30
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [flytepropeller] Timed out tasks are not cancelled
2 participants