-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock by doing create/start/getContainer via varlink connection #3572
Comments
Some more info: |
So this is somewhere between I wonder if we're not seeing some sort of issue with Varlink trying to order requests here, leading to a sort of priority inversion? |
@towe75 i am back from PTO and ready to look at this .. this is very similar to another reported issue I was debugging today but you are doing it in the context of a sample program ... i'd like to use it. it looks like when you did this, you had a special environment (imports). any chance you peek at this quickly and modify your example to be built from from github.com/containers/libpod/towe75 or some such? |
@baude thank you for checking. Find my example attached to this post: podman-issue3572.zip It contains: ├── go.mod Just unpack it anywhere an run go build. It's module enable, so no need to place it into GOPATH. |
@towe75 do you hang out on #podman on freenode? |
@towe75 can you confirm that if you run podman varlink outside of systemd, you see no deadlock? |
@baude sorry could not catch you in freenode, was quite busy.
in another shell i ran my test binary in a loop. I was able to produce 30 containers without a deadlock whereas with systemd it usually stops working after 2-3 iterations. I'll do a longer test later. So looks like systemd socket has some effect. |
@baude is working on this one still, I believe. We've tracked the deadlock into c/storage. |
So, I finally managed to debug this and find the culprit. So, here is what is happening:
Now in the podman codebase there are several places also handling LISTEN_FDS:
Boom, the varlink connection This code block closed the varlink connection. The whole LISTEN_FDS handling makes no sense anyway, because LISTEN_FDS begins at Anyway commenting out this code block and this issue is gone: I don't know what the use case of this code block is and how it would work without adjusting LISTEN_FDNAMES. Also one has to review the correct handling of the code in You really have to take care of the fact, that some code used one of the LISTEN_FDS and the fd has been closed and the number is reused. One marker could be: So, if the CLOEXEC flag is set, don't close the fd on execve. But that is only guesswork also. |
I'll tag in @giuseppe here - I'm pretty sure he wrote this for |
I am not really familiar with this part, but some things I've observed:
|
Hmmm. My original assumption was that was your code - it's definitely not mine... I think it might well end up being @rhatdan |
commit 989f5e3 |
I don't think it makes sense to use LISTEN_FDS and NOTIFY_SOCKET, if podman is in varlink server mode. This should really be skipped here. |
I agree these should only be handled for not 'remote' podman. |
add ability to not activate sd_notify when running under varlink as it causes deadlocks and hangs. Fixes: containers#3572 Signed-off-by: baude <[email protected]>
Make sure that Podman passes the LISTEN_* environment into containers. Similar to runc, LISTEN_PID is set to 1. Also remove conditionally passing the LISTEN_FDS as extra files. The condition was wrong (inverted) and introduced to fix containers#3572 which related to running under varlink which has been dropped entirely with Podman 3.0. Note that the NOTIFY_SOCKET and LISTEN_* variables are cleared when running `system service`. Fixes: containers#10443 Signed-off-by: Daniel J Walsh <[email protected]> Signed-off-by: Valentin Rothberg <[email protected]>
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
I'm working on #3387 and discovered a deadlock problem when doing a series of CreateContainer/StartContainer/GetContainer multiple times via systemd varlink connection.
GetContainer will then block until i restart the io.podman systemd service. Also podman on the cli is blocked until i restart the service.
Steps to reproduce the issue:
Additional information you deem important (e.g. issue happens only occasionally):
The time of repeations varies, sometimes it locks up on 2nd try, sometimes it takes a while.
Output of
podman version
:Output of
podman info --debug
:Additional environment details (AWS, VirtualBox, physical, etc.):
I am running this stuff in a nested Fedora 30 LXC container on a ubuntu 18.04 host (Linux 4.4.0-116-generic)
Also i tried other varlink calls instead of GetContainer():
Also i tried sleeping before GetContainer() but it did not change the behavior.
The text was updated successfully, but these errors were encountered: