Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker agent experienced a fatal error; aborting job #62

Open
iliana opened this issue Aug 30, 2024 · 3 comments
Open

worker agent experienced a fatal error; aborting job #62

iliana opened this issue Aug 30, 2024 · 3 comments

Comments

@iliana
Copy link

iliana commented Aug 30, 2024

https://buildomat.eng.oxide.computer/wg/0/details/01J6JDMDGGV0TWS2FZ4NNYK5KG/IVBVeOSe2r64WYXoXFus5vZACPV8GDTwMFM49pFiK0KwvToN/01J6JDNAR2GCC48VK3B433REY1#S4121 (https://github.com/oxidecomputer/omicron/runs/29494306519)

I've seen this a few times but apparently have never filed an issue. Are we able to figure out what the fatal error was?

@jclulow
Copy link
Collaborator

jclulow commented Aug 31, 2024

Yes I believe that one is related to occasional NVMe stalls that we've seen since moving to the Nitro AWS stuff. I/O just sort of stops, the machine panics after the I/O deadman fires (~16 minutes later) and when it starts up the buildomat agent realises it has been restarted and aborts the job.

@iliana
Copy link
Author

iliana commented Sep 1, 2024

Ah. This has reactivated a memory:

For an experience similar to EBS volumes attached to Xen instances, we recommend setting nvme_core.io_timeout to the highest value possible. For current kernels, the maximum is 4294967295, while for earlier kernels the maximum is 255.

If I recall correctly the amount of time Linux's default timeout (not that that would apply here) is not long enough to deal with the fact that the underlying storage interface on Nitro hardware might crash or be updated, and it takes longer than 30 seconds to reconcile the in-flight I/O.

@jclulow
Copy link
Collaborator

jclulow commented Sep 1, 2024

That's distressingly long! The deadman is 1000 seconds, though, or 16 of your earth minutes. I'm not sure it's necessarily their fault and not something we are flubbing in the driver so far. The challenge is it's hard to make a crash dump when your disk won't speak to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants