"Panic: runtime error" when a workflow step is force closed after a timeout #238

dustinblack · 2024-12-16T11:31:14Z

Describe the bug

Running a complex workflow on a low-powered ARM system, we are running into occasional step timeouts in workflows, which the Arcaflow engine responds to by cancelling the step. After that step is cancelled, a series of related dependencies result in failures, and then ultimately the workflow fails with a traceback:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x135d790]

goroutine 358 [running]:
go.flow.arcalot.io/engine/workflow.(*executableWorkflow).Execute(0x400026d180, {0x1b4f450, 0x40006601e0}, {0x1553560?, 0x4000666b10?})
	/home/runner/go/pkg/mod/go.flow.arcalot.io/[email protected]/workflow/workflow.go:256 +0x1360
go.flow.arcalot.io/engine/internal/step/foreach.(*runningStep).executeSubWorkflows.func1()
	/home/runner/go/pkg/mod/go.flow.arcalot.io/[email protected]/internal/step/foreach/provider.go:882 +0x1f0
created by go.flow.arcalot.io/engine/internal/step/foreach.(*runningStep).executeSubWorkflows in goroutine 322
	/home/runner/go/pkg/mod/go.flow.arcalot.io/[email protected]/internal/step/foreach/provider.go:864 +0xc8

To reproduce

Reproducible on the slow ARM systems we are testing on consistently with the below workflow (disable-steps branch), but presumably if you set closure_wait_timeout for the pcp step in the workflow-stressng.yaml file to some low value like 1000 the failure will be triggered even on a powerful system.

https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf/-/tree/disable-steps?ref_type=heads

Use the example-input-quick.yaml input file, and set all enable_<step> to False except enable_stressng: True

arcaflow --config config.yaml --input example-input-quick.yaml

The text was updated successfully, but these errors were encountered:

dustinblack · 2024-12-16T12:01:51Z

output.txt

Full workflow output with traceback attached.

dustinblack · 2024-12-16T12:38:33Z

I've realized that it is possible that this failure is sourcing from the podman config having the read_only = true setting rather than the pcp timout specifically.

jaredoconnell · 2024-12-16T17:11:16Z

Fixed with #239

dustinblack added the bug Something isn't working label Dec 16, 2024

jaredoconnell closed this as completed Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Panic: runtime error" when a workflow step is force closed after a timeout #238

"Panic: runtime error" when a workflow step is force closed after a timeout #238

dustinblack commented Dec 16, 2024

dustinblack commented Dec 16, 2024

dustinblack commented Dec 16, 2024

jaredoconnell commented Dec 16, 2024

"Panic: runtime error" when a workflow step is force closed after a timeout #238

"Panic: runtime error" when a workflow step is force closed after a timeout #238

Comments

dustinblack commented Dec 16, 2024

Describe the bug

To reproduce

dustinblack commented Dec 16, 2024

dustinblack commented Dec 16, 2024

jaredoconnell commented Dec 16, 2024