[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

fediazgon · 2022-03-29T11:46:32Z

Describe the bug

AsyncPlugin Delete method is not called after the task which is handled by that plugin times out. However, according to the internal documentation, propeller should call Delete:

// Delete the object in the remote service using the resource key. Flyte will call this API at least once. If the
// resource has already been deleted, the API should not fail.
Delete(ctx context.Context, tCtx DeleteContext) error

We've experienced this with bigquery webapi plugin. We see the following message in the logs, but the job keeps running:

"Current execution for the node timed out; timeout configured: 3h0m0s"

According to @hamersaw. It might be an issue with propeller handling timeouts as retryable errors

Expected behavior

After a node timeout, the associated task should be aborted

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

hamersaw · 2022-03-29T15:59:34Z

For a little more context. Right now propeller just marks the task as a retryable failure and moves on. The correct way to handle this is probably to best-effort abort the node before moving on.

hamersaw · 2022-05-17T15:12:34Z

For a little more context. Right now propeller just marks the task as a retryable failure and moves on. The correct way to handle this is probably to best-effort abort the node before moving on.

On a second look this seems to be the correct functionality as implemented in this PR. When processing a RetryableFailure FlytePropeller attempts to abort the node, which calls abort on the internal webapi CorePlugin and subsequently calls Delete.

@fediazgon has this issue persisted? Can you check the logs for calling abort on the CorePlugin or cancelling the bigquery job?

fediazgon added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 29, 2022

hamersaw self-assigned this Apr 6, 2022

hamersaw added this to the 1.0.1 milestone Apr 6, 2022

hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Apr 6, 2022

EngHabu modified the milestones: 1.0.1, 1.0.2 May 11, 2022

hamersaw mentioned this issue Jun 3, 2022

Calling abort rather than finalize on permanent failure flyteorg/flytepropeller#449

Merged

8 tasks

hamersaw closed this as completed in flyteorg/flytepropeller#449 Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

fediazgon commented Mar 29, 2022

hamersaw commented Mar 29, 2022

hamersaw commented May 17, 2022

[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

[BUG] [flytepropeller] Timed out tasks are not cancelled #2298

Comments

fediazgon commented Mar 29, 2022

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

hamersaw commented Mar 29, 2022

hamersaw commented May 17, 2022