Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guacamole instances randomly freezing #26

Closed
jsf9k opened this issue Aug 12, 2020 · 7 comments
Closed

Guacamole instances randomly freezing #26

jsf9k opened this issue Aug 12, 2020 · 7 comments
Assignees
Labels
bug This issue or pull request addresses broken functionality

Comments

@jsf9k
Copy link
Member

jsf9k commented Aug 12, 2020

🐛 Bug Report

Guacamole instances running recently-built AMIs randomly freeze. When this happens

  • All ssh sessions freeze
  • New ssh sessions cannot be established
  • Guacamole sessions freeze

To Reproduce

Steps to reproduce the behavior:

  • Instantiate an instance from a recently-built AMI generated from this repository
  • Wait long enough, and it will freeze

Expected behavior

Such instances should not freeze.

@jsf9k jsf9k added the bug This issue or pull request addresses broken functionality label Aug 12, 2020
@jsf9k jsf9k self-assigned this Aug 12, 2020
@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

I tried tailing logs and looking at dmesg output while waiting for such an instance to freeze, but time and time again I saw nothing of interest. Clearly the freeze was taking place before the logs could output anything. Therefore the issue looks very much like a kernel panic.

@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

Hoping that this issue was indeed caused by a kernel panic, I manually enabled kernel crash dumps in running Guacamole instances. I basically used these instructions, although I found some of the steps listed there to be superfluous. What is actually required is:

  1. Modify the Guacamole server's security group and network ACLs to allow port 80 outbound.
  2. sudo apt-get install kdump-tools $(uname -r)-dbg
  3. Append nmi_watchdog=1 to the kernel command line in /etc/default/grub
  4. sudo update-grub
  5. Reboot the instance

@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

We caught a kernel panic on env0 in our staging environment. I added the information to the original Debian bug report, since it looks like our problem is indeed a result of the recent kernel changes.

Links for investigating kernel crash dumps using crash:

@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

It took several failed attempts (partly because doing kernel foo in Docker is not a great way to test), but I modified the Guacamole AMI to boot into an older kernel that we used before and know to be good. See these PRs:

Once the kernel bug has been resolved and is available in Debian, we can undo these changes.

@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

It is worth noting that this issue is probably appearing in the Guacamole instance because we are running Docker there and the kernel bug involves cgroups.

@jsf9k
Copy link
Member Author

jsf9k commented Aug 12, 2020

Resolved in the pRs listed above, as well as #24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue or pull request addresses broken functionality
Projects
None yet
Development

No branches or pull requests

1 participant