Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle hard failures (eg. kernel panic) during bootstrap #1143

Merged
merged 1 commit into from
Jul 3, 2023

Conversation

trvrnrth
Copy link
Contributor

This change configures cloud-init to run the bootstrap script on every boot so that we can check for previous failed attempts and mark the instance as unhealthy.

This caters for the scenario where the kernel panics during a bootstrap attempt which leaves the instance non-operational on reboot but marked healthy.

This change configures cloud-init to run the bootstrap script on every
boot so that we can check for previous failed attempts and mark the
instance as unhealthy.

This caters for the scenario where the kernel panics during a bootstrap
attempt which leaves the instance non-operational on reboot but marked
healthy.
return 1
fi
}
check_docker
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure here will now end up in the error trap rather than exiting immediately leaving the instance non-operational.

Copy link
Contributor

@triarius triarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @trvrnrth, this looks like a good improvement.

I suppose it's quite possible that one of the earlier scripts would fail too. But it looks like in those cases, the instance won't be marked as unhealthy. I do wonder if they are idempotent. It looks like if the instance is restarted, they will run again. So I think in that case, we may want to skip them in the if they completed previously. In particular, it looks like bin/bk-mount-instance-storage.sh will format the volume again on restart, which seems like something we may not want to do. You may want to do something about that, but as that's the status quo, it's no impediment to merging this.

Having said that, kernel panics causing the startup scripts to fail is not something we encounter regularly, so we haven't seen the need to make these scripts resilient to them. Do you have any insight into why they may be happening?

@triarius triarius merged commit 518e756 into buildkite:master Jul 3, 2023
@trvrnrth
Copy link
Contributor Author

trvrnrth commented Jul 3, 2023

Having said that, kernel panics causing the startup scripts to fail is not something we encounter regularly, so we haven't seen the need to make these scripts resilient to them. Do you have any insight into why they may be happening?

The issue that led to this is that we were seeing relatively frequent occurences of the kernel watchdog detecting soft lockups during installation of gcloud as part of our bootstrap. I'm unsure what the true underlying cause is.

We don't currently run with instance storage enabled, but my understanding is that the boothooks will run on every boot so you're probably correct that at least some portion of them should also be skipped if they have previously completed successfully.

@trvrnrth trvrnrth deleted the handle-hard-bootstrap-failure branch July 3, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants