Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Panic: runtime error" when a workflow step is force closed after a timeout #238

Closed
dustinblack opened this issue Dec 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@dustinblack
Copy link
Member

Describe the bug

Running a complex workflow on a low-powered ARM system, we are running into occasional step timeouts in workflows, which the Arcaflow engine responds to by cancelling the step. After that step is cancelled, a series of related dependencies result in failures, and then ultimately the workflow fails with a traceback:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x135d790]

goroutine 358 [running]:
go.flow.arcalot.io/engine/workflow.(*executableWorkflow).Execute(0x400026d180, {0x1b4f450, 0x40006601e0}, {0x1553560?, 0x4000666b10?})
	/home/runner/go/pkg/mod/go.flow.arcalot.io/[email protected]/workflow/workflow.go:256 +0x1360
go.flow.arcalot.io/engine/internal/step/foreach.(*runningStep).executeSubWorkflows.func1()
	/home/runner/go/pkg/mod/go.flow.arcalot.io/[email protected]/internal/step/foreach/provider.go:882 +0x1f0
created by go.flow.arcalot.io/engine/internal/step/foreach.(*runningStep).executeSubWorkflows in goroutine 322
	/home/runner/go/pkg/mod/go.flow.arcalot.io/[email protected]/internal/step/foreach/provider.go:864 +0xc8

To reproduce

Reproducible on the slow ARM systems we are testing on consistently with the below workflow (disable-steps branch), but presumably if you set closure_wait_timeout for the pcp step in the workflow-stressng.yaml file to some low value like 1000 the failure will be triggered even on a powerful system.

https://gitlab.com/redhat/edge/tests/perfscale/arcaflow-workflow-auto-perf/-/tree/disable-steps?ref_type=heads

Use the example-input-quick.yaml input file, and set all enable_<step> to False except enable_stressng: True

arcaflow --config config.yaml --input example-input-quick.yaml
@dustinblack dustinblack added the bug Something isn't working label Dec 16, 2024
@dustinblack
Copy link
Member Author

output.txt

Full workflow output with traceback attached.

@dustinblack
Copy link
Member Author

I've realized that it is possible that this failure is sourcing from the podman config having the read_only = true setting rather than the pcp timout specifically.

@jaredoconnell
Copy link
Contributor

Fixed with #239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants