Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

Closed
2 tasks done
fediazgon opened this issue Mar 29, 2022 · 2 comments · Fixed by flyteorg/flytepropeller#449
Closed
2 tasks done

[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

fediazgon opened this issue Mar 29, 2022 · 2 comments · Fixed by flyteorg/flytepropeller#449
Assignees
Labels
bug Something isn't working
Milestone

Comments

@fediazgon
Copy link

Describe the bug

AsyncPlugin Delete method is not called after the task which is handled by that plugin times out. However, according to the internal documentation, propeller should call Delete:

// Delete the object in the remote service using the resource key. Flyte will call this API at least once. If the
// resource has already been deleted, the API should not fail.
Delete(ctx context.Context, tCtx DeleteContext) error

We've experienced this with bigquery webapi plugin. We see the following message in the logs, but the job keeps running:

"Current execution for the node timed out; timeout configured: 3h0m0s"

According to @hamersaw. It might be an issue with propeller handling timeouts as retryable errors

Expected behavior

After a node timeout, the associated task should be aborted

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@fediazgon fediazgon added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 29, 2022
@hamersaw
Copy link
Contributor

For a little more context. Right now propeller just marks the task as a retryable failure and moves on. The correct way to handle this is probably to best-effort abort the node before moving on.

@hamersaw hamersaw self-assigned this Apr 6, 2022
@hamersaw hamersaw added this to the 1.0.1 milestone Apr 6, 2022
@hamersaw hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Apr 6, 2022
@EngHabu EngHabu modified the milestones: 1.0.1, 1.0.2 May 11, 2022
@hamersaw
Copy link
Contributor

For a little more context. Right now propeller just marks the task as a retryable failure and moves on. The correct way to handle this is probably to best-effort abort the node before moving on.

On a second look this seems to be the correct functionality as implemented in this PR. When processing a RetryableFailure FlytePropeller attempts to abort the node, which calls abort on the internal webapi CorePlugin and subsequently calls Delete.

@fediazgon has this issue persisted? Can you check the logs for calling abort on the CorePlugin or cancelling the bigquery job?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants