-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network connect/disconnect: new flake #11248
Comments
I will take a look. |
#11091 merged on Aug 6. compose flakes since then: compose: simple_port_map - curl (port 5000) failed with status 7
compose: env_and_volume - curl (port 5000) failed with status 7
compose: env_and_volume : port 5001
compose: mount_and_label - curl (port 5000) failed with status 7
(EDIT: I guess only the first of those is rootless) |
OK from reading
So that means the socket file exists but no one is listening on it. The rootlessport process did create the socket with |
I don't have a reproducer, but it should be easy to add a debug statement, something like |
Well, that failed quickly (podman-3.3.0-0.13.rc2.fc34.x86_64): $ while :;do bats --filter disconnect /usr/share/podman/test/system/500-networking.bats || break;done
...
✗ podman network connect/disconnect with port forwarding
(from function `is' in file /usr/share/podman/test/system/helpers.bash, line 474,
in test file /usr/share/podman/test/system/500-networking.bats, line 447)
`is "$output" "$random_1" "curl 127.0.0.1:/index.txt should work again"' failed with status 56
$ podman rm --all --force
$ podman ps --all --external --format {{.ID}} {{.Names}}
$ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
quay.io/libpod/testimage:20210610 9f9ec7f2fdef
$ podman network create testnet-xQXsssXgCB
/home/fedora/.config/cni/net.d/testnet-xQXsssXgCB.conflist
$ podman network create testnet2-O1sbJx5lkf
/home/fedora/.config/cni/net.d/testnet2-O1sbJx5lkf.conflist
$ podman run -d --network testnet-xQXsssXgCB quay.io/libpod/testimage:20210610 top
564449877f9ebf99ddb5e6b205f8b8665d7f5033ff2f69e99021a40198b6852a
$ podman run -d -p 12345:80 --network testnet-xQXsssXgCB -v /tmp/podman_bats.EafKjG/hello.txt:/var/www/index.txt:Z -w /var/www quay.io/libpod/testimage:20210610 /bin/busybox-extras httpd -f -p 80
75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
$ podman inspect 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83 --format {{(index .NetworkSettings.Networks "testnet-xQXsssXgCB").IPAddress}}
10.89.0.3
$ podman inspect 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83 --format {{(index .NetworkSettings.Networks "testnet-xQXsssXgCB").MacAddress}}
9a:4a:da:2c:da:2c
$ podman network disconnect testnet-xQXsssXgCB 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
fedora 54185 0.0 0.0 6140 836 pts/0 S+ 07:45 0:00 grep rootlessport
$ podman network connect testnet-xQXsssXgCB 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
time="2021-08-18T07:45:48-04:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/1000/libpod/tmp/rp/75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83: connect: connection refused"
fedora 54415 0.0 0.0 6140 840 pts/0 S+ 07:45 0:00 grep rootlessport
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#| FAIL: curl 127.0.0.1:/index.txt should work again
#| expected: 'CHbNHcWcT4eryCWX672X8W6fJGwKre'
#| actual: ''
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Obviously I didn't get the |
Here's a better one:
(Blank line separators added for ease of reading). This shows that |
Thanks so my suspicion is right. Now the fun part, find the killer :) |
Did you tried this in a 1minutetip vm? I got this to fail after 45 min and found out that the process is not dead after disconnect, it is dead after the curl command which makes sure that the port connection does not work. I also noticed that the rootlessport process did not exited the normal way, so it either got killed or it paniced. I tried to reproduce with a patched podman which redirects the stdout and stderr to a file instead of /dev/null so I could get the output and see if there is a stack trace or something. However with this patch it now runs over two hours without failure. |
Yes, this was |
This is the patch I use: diff --git a/pkg/rootlessport/rootlessport_linux.go b/pkg/rootlessport/rootlessport_linux.go
index ede216bfe..f988a0c61 100644
--- a/pkg/rootlessport/rootlessport_linux.go
+++ b/pkg/rootlessport/rootlessport_linux.go
@@ -120,11 +120,12 @@ func parent() error {
select {
case s := <-sigC:
if s == unix.SIGPIPE {
- if f, err := os.OpenFile("/dev/null", os.O_WRONLY, 0755); err == nil {
+ if f, err := os.OpenFile("/tmp/rootlessport", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0755); err == nil {
unix.Dup2(int(f.Fd()), 1) // nolint:errcheck
unix.Dup2(int(f.Fd()), 2) // nolint:errcheck
f.Close()
}
+ logrus.Info("got SIGPIPE")
}
case <-exitC:
}
@@ -265,6 +266,7 @@ outer:
if _, err := ioutil.ReadAll(exitR); err != nil {
return err
}
+ logrus.Info("exit")
return nil
}
@edsantiago Could you try to build this one see if it fails for you. And provide the /tmp/rootlessport file if it does.
Yes that is what I did too. Failed after 45 min, and after 30 min in another run. With my patch it did not failed for now well over two hours. |
I just brought up a VM, ran the reproducer, and it failed on the second iteration (a few seconds). Can I offer you my VM for you to use? I don't have a good setup for building podman in VMs, it will take me too long. PM me on IRC. Just to confirm, though: I am running as user |
OK I know what is causing this. The problem is SIGPIPE kills the process. Stdout and stderr are attached to the podman parent process and when podman exits and the rootlessport process tries to write to stdout/err it will fail with SIGPIPE, the code handles this signal and uses dup2 two set /dev/null to both stdout and stderr. I am not sure why but sometimes this seems to fail and the process continues to run into SIGPIPE errors and it gets killed eventually. |
When the rootlessport process is started the stdout/stderr are attached to the podman process. However once everything is setup podman exits and when the rootlessport process tries to write to stdout it will fail with SIGPIPE. The code handles this signal and puts /dev/null to stdout and stderr but this is not robust. I do not understand the exact cause but sometimes the process is still killed by SIGPIPE. Either go lost the signal or the process got already killed before the goroutine could handle it. Instead of handling SIGPIPE just set /dev/null to stdout and stderr before podman exits. With this there should be no race and no way to run into SIGPIPE errors. [NO TESTS NEEDED] Fixes containers#11248 Signed-off-by: Paul Holzinger <[email protected]>
When the rootlessport process is started the stdout/stderr are attached to the podman process. However once everything is setup podman exits and when the rootlessport process tries to write to stdout it will fail with SIGPIPE. The code handles this signal and puts /dev/null to stdout and stderr but this is not robust. I do not understand the exact cause but sometimes the process is still killed by SIGPIPE. Either go lost the signal or the process got already killed before the goroutine could handle it. Instead of handling SIGPIPE just set /dev/null to stdout and stderr before podman exits. With this there should be no race and no way to run into SIGPIPE errors. [NO TESTS NEEDED] Fixes containers#11248 Signed-off-by: Paul Holzinger <[email protected]>
Followup to #11091: looks like there's still something flaking:
sys: podman network connect/disconnect with port forwarding
The text was updated successfully, but these errors were encountered: