You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AsyncPlugin Delete method is not called after the task which is handled by that plugin times out. However, according to the internal documentation, propeller should call Delete:
// Delete the object in the remote service using the resource key. Flyte will call this API at least once. If the
// resource has already been deleted, the API should not fail.
Delete(ctx context.Context, tCtx DeleteContext) error
We've experienced this with bigquery webapi plugin. We see the following message in the logs, but the job keeps running:
"Current execution for the node timed out; timeout configured: 3h0m0s"
For a little more context. Right now propeller just marks the task as a retryable failure and moves on. The correct way to handle this is probably to best-effort abort the node before moving on.
For a little more context. Right now propeller just marks the task as a retryable failure and moves on. The correct way to handle this is probably to best-effort abort the node before moving on.
On a second look this seems to be the correct functionality as implemented in this PR. When processing a RetryableFailure FlytePropeller attempts to abort the node, which calls abort on the internal webapi CorePlugin and subsequently calls Delete.
Describe the bug
AsyncPlugin Delete method is not called after the task which is handled by that plugin times out. However, according to the internal documentation, propeller should call
Delete
:We've experienced this with bigquery webapi plugin. We see the following message in the logs, but the job keeps running:
According to @hamersaw. It might be an issue with propeller handling timeouts as retryable errors
Expected behavior
After a node timeout, the associated task should be aborted
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: