`worker agent experienced a fatal error; aborting job` #62

iliana · 2024-08-30T20:42:20Z

https://buildomat.eng.oxide.computer/wg/0/details/01J6JDMDGGV0TWS2FZ4NNYK5KG/IVBVeOSe2r64WYXoXFus5vZACPV8GDTwMFM49pFiK0KwvToN/01J6JDNAR2GCC48VK3B433REY1#S4121 (https://github.com/oxidecomputer/omicron/runs/29494306519)

I've seen this a few times but apparently have never filed an issue. Are we able to figure out what the fatal error was?

jclulow · 2024-08-31T08:34:55Z

Yes I believe that one is related to occasional NVMe stalls that we've seen since moving to the Nitro AWS stuff. I/O just sort of stops, the machine panics after the I/O deadman fires (~16 minutes later) and when it starts up the buildomat agent realises it has been restarted and aborts the job.

iliana · 2024-09-01T02:52:03Z

Ah. This has reactivated a memory:

For an experience similar to EBS volumes attached to Xen instances, we recommend setting nvme_core.io_timeout to the highest value possible. For current kernels, the maximum is 4294967295, while for earlier kernels the maximum is 255.

If I recall correctly the amount of time Linux's default timeout (not that that would apply here) is not long enough to deal with the fact that the underlying storage interface on Nitro hardware might crash or be updated, and it takes longer than 30 seconds to reconcile the in-flight I/O.

jclulow · 2024-09-01T02:54:28Z

That's distressingly long! The deadman is 1000 seconds, though, or 16 of your earth minutes. I'm not sure it's necessarily their fault and not something we are flubbing in the driver so far. The challenge is it's hard to make a crash dump when your disk won't speak to you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`worker agent experienced a fatal error; aborting job` #62

`worker agent experienced a fatal error; aborting job` #62

iliana commented Aug 30, 2024

jclulow commented Aug 31, 2024

iliana commented Sep 1, 2024

jclulow commented Sep 1, 2024 •

edited

Loading

worker agent experienced a fatal error; aborting job #62

worker agent experienced a fatal error; aborting job #62

Comments

iliana commented Aug 30, 2024

jclulow commented Aug 31, 2024

iliana commented Sep 1, 2024

jclulow commented Sep 1, 2024 • edited Loading

`worker agent experienced a fatal error; aborting job` #62

`worker agent experienced a fatal error; aborting job` #62

jclulow commented Sep 1, 2024 •

edited

Loading