-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep directories when SIGINT sent to daemon #881
Keep directories when SIGINT sent to daemon #881
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some of these manual tests as automated tests with this change?
2898ab4
to
7182dc3
Compare
I tried, but I don't actually know how to kill the snapshotter with a SIGINT. I see that we have KillMatchingProcess, which is called by rebootContainerd, but this sends a SIGKILL, not a SIGINT. Even if I could pass it in, I tried to manually kill but |
Give https://man7.org/linux/man-pages/man1/kill.1.html a try and if still stuck lmk and we can look together. |
7182dc3
to
5f36c19
Compare
I ended up using I added the test and ensured that it crashed with the previous version and passed with this one. EDIT: I was wrong, but now I should be right (I hope!) I tried to do a goroutine at first but they weren't guaranteed to terminate with its encapsulating testing function. Hopefully
|
b8f3729
to
7529c33
Compare
7529c33
to
9fe318b
Compare
13f39aa
to
a81d56f
Compare
Signed-off-by: David Son <[email protected]>
a81d56f
to
1e7b33f
Compare
…#767) Issue #, if available: re-verified #412 - Through extensive e2e test debugging, I noticed that soci and stargz snapshotters weren't persisting data as expected. After debugging, I found some context in these two PRs: - awslabs/soci-snapshotter#881 - containerd/stargz-snapshotter#1526 Unfortunately, neither of them are deployed yet, so I've implemented a hacky workaround for now. After this change, an image/container can be pull/run, the VM can be restarted, and then the container can be re-started again. *Description of changes:* - Redo how BuildKit/Stargz/SOCI are related to containerd using [systemd's `PartOf` ](https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#PartOf=) - this ensures that all of these services are restarted when containerd is restarted, which the lack of has caused errors in the past - Create some missing directories that might throw errors in cloud-init - Ensure that `SIGTERM` is used to kill the snapshotter services for now *Testing done:* - manual testing - [x] I've reviewed the guidance in CONTRIBUTING.md #### License Acceptance By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Justin Alvarez <[email protected]>
Issue #, if available:
#275, #832
Description of changes:
On SIGINT, the daemon will perform some cleanup functions, including removing the directories for each of the snapshot layers. However, the metadata does not reflect this change. This meant that, upon restarting the snapshot, the snapshot would not start unless
allow_invalid_mounts_on_restart=true
was set in the config. Even with the variable set, one had to manually remove the snapshot to get the daemon in a working state.This was pretty frustrating behavior. What should be a graceful kill ends up being more work than using a non-graceful kill (i.e. SIGKILL/SIGTERM, which allow the snapshots to instantly start working again, no questions asked). Additionally, it was unintuitive (to me) that killing the daemon also wipes all the snapshotter data, forcing one to re-pull an image every time the daemon is killed.
This change removes the cleanup step of wiping the snapshot directories. It will still unmount the FUSE mounts, but the snapshotter already tries to restore snapshots on startup, which just unmounts any mounts leftover and remounts existing mounts in the metadata, so it ties in very well to snapshot startup
Testing performed:
I performed some basic testing, largely including pulling an image + running a container with it after a SIGINT and confirmed normal behavior.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.