You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running a complex workflow on a low-powered ARM system, we are running into occasional step timeouts in workflows, which the Arcaflow engine responds to by cancelling the step. After that step is cancelled, a series of related dependencies result in failures, and then ultimately the workflow fails with a traceback:
Reproducible on the slow ARM systems we are testing on consistently with the below workflow (disable-steps branch), but presumably if you set closure_wait_timeout for the pcp step in the workflow-stressng.yaml file to some low value like 1000 the failure will be triggered even on a powerful system.
I've realized that it is possible that this failure is sourcing from the podman config having the read_only = true setting rather than the pcp timout specifically.
Describe the bug
Running a complex workflow on a low-powered ARM system, we are running into occasional step timeouts in workflows, which the Arcaflow engine responds to by cancelling the step. After that step is cancelled, a series of related dependencies result in failures, and then ultimately the workflow fails with a traceback:
To reproduce
Reproducible on the slow ARM systems we are testing on consistently with the below workflow (disable-steps branch), but presumably if you set
closure_wait_timeout
for the pcp step in theworkflow-stressng.yaml
file to some low value like 1000 the failure will be triggered even on a powerful system.https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf/-/tree/disable-steps?ref_type=heads
Use the
example-input-quick.yaml
input file, and set allenable_<step>
toFalse
exceptenable_stressng: True
The text was updated successfully, but these errors were encountered: