Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network connect/disconnect: new flake #11248

Closed
edsantiago opened this issue Aug 17, 2021 · 12 comments · Fixed by #11269
Closed

network connect/disconnect: new flake #11248

edsantiago opened this issue Aug 17, 2021 · 12 comments · Fixed by #11269
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@edsantiago
Copy link
Member

Followup to #11091: looks like there's still something flaking:

[+1340s] not ok 241 podman network connect/disconnect with port forwarding
         # (from function `is' in file test/system/helpers.bash, line 474,
         #  in test file test/system/500-networking.bats, line 445)
         #   `is "$output" "$random_1" "curl 127.0.0.1:/index.txt should work again"' failed with status 56
         # $ podman rm --all --force
         # $ podman ps --all --external --format {{.ID}} {{.Names}}
         # $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # quay.io/libpod/testimage:20210610 9f9ec7f2fdef
         # $ podman network create testnet-FA2Qkj7JP6
         # /home/some24215dude/.config/cni/net.d/testnet-FA2Qkj7JP6.conflist
         # $ podman network create testnet2-HVIhBV5Cvx
         # /home/some24215dude/.config/cni/net.d/testnet2-HVIhBV5Cvx.conflist
         # $ podman run -d --network testnet-FA2Qkj7JP6 quay.io/libpod/testimage:20210610 top
         # 0b007115b381d4c40e11b6948839dbf4d572fa899d5ec5dc6096a566dcc0b89d
         # $ podman run -d -p 12345:80 --network testnet-FA2Qkj7JP6 -v /tmp/podman_bats.nacKs4/hello.txt:/var/www/index.txt:Z -w /var/www quay.io/libpod/testimage:20210610 /bin/busybox-extras httpd -f -p 80
         # 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9
         # $ podman inspect 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9 --format {{(index .NetworkSettings.Networks "testnet-FA2Qkj7JP6").IPAddress}}
         # 10.89.0.3
         # $ podman inspect 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9 --format {{(index .NetworkSettings.Networks "testnet-FA2Qkj7JP6").MacAddress}}
         # b6:f0:5d:b8:8f:81
         # $ podman network disconnect testnet-FA2Qkj7JP6 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9
         # $ podman network connect testnet-FA2Qkj7JP6 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9
         # time="2021-08-11T15:40:40-05:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/6817/libpod/tmp/rp/5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9: connect: connection refused"
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: curl 127.0.0.1:/index.txt should work again
         # #| expected: 'V2nBRSTMbI5tU89XO1xo5by7lYEdOT'
         # #|   actual: ''
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

sys: podman network connect/disconnect with port forwarding

@edsantiago edsantiago added flakes Flakes from Continuous Integration rootless labels Aug 17, 2021
@Luap99
Copy link
Member

Luap99 commented Aug 17, 2021

I will take a look.
@edsantiago Btw do you see rootless compose test flakes? I think I saw it still failing sometimes.

@edsantiago
Copy link
Member Author

edsantiago commented Aug 17, 2021

#11091 merged on Aug 6. compose flakes since then:

compose: simple_port_map - curl (port 5000) failed with status 7

compose: env_and_volume - curl (port 5000) failed with status 7

compose: env_and_volume : port 5001

compose: mount_and_label - curl (port 5000) failed with status 7

(EDIT: I guess only the first of those is rootless)

@Luap99
Copy link
Member

Luap99 commented Aug 18, 2021

OK from reading connect(2)

ECONNREFUSED
A connect() on a stream socket found no one listening on the remote address.

So that means the socket file exists but no one is listening on it. The rootlessport process did create the socket with net.Listen() otherwise we would get a ENOENT error. For some reason rootlessport is no longer listen on it. I think the rootlessport process exited somehow (maybe it got killed). @edsantiago Do you have a reproducer or can we add debugging output to the test, e.g. ps ux to see if the containers-rootlessport process is still active.

@edsantiago
Copy link
Member Author

I don't have a reproducer, but it should be easy to add a debug statement, something like ps auxww | grep rootlessport immediately before and then immediately after the podman network connect (before the is).

@edsantiago
Copy link
Member Author

Well, that failed quickly (podman-3.3.0-0.13.rc2.fc34.x86_64):

$ while :;do bats --filter disconnect /usr/share/podman/test/system/500-networking.bats  || break;done
...
 ✗ podman network connect/disconnect with port forwarding
   (from function `is' in file /usr/share/podman/test/system/helpers.bash, line 474,
    in test file /usr/share/podman/test/system/500-networking.bats, line 447)
     `is "$output" "$random_1" "curl 127.0.0.1:/index.txt should work again"' failed with status 56
   $ podman rm --all --force
   $ podman ps --all --external --format {{.ID}} {{.Names}}
   $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210610 9f9ec7f2fdef
   $ podman network create testnet-xQXsssXgCB
   /home/fedora/.config/cni/net.d/testnet-xQXsssXgCB.conflist
   $ podman network create testnet2-O1sbJx5lkf
   /home/fedora/.config/cni/net.d/testnet2-O1sbJx5lkf.conflist
   $ podman run -d --network testnet-xQXsssXgCB quay.io/libpod/testimage:20210610 top
   564449877f9ebf99ddb5e6b205f8b8665d7f5033ff2f69e99021a40198b6852a
   $ podman run -d -p 12345:80 --network testnet-xQXsssXgCB -v /tmp/podman_bats.EafKjG/hello.txt:/var/www/index.txt:Z -w /var/www quay.io/libpod/testimage:20210610 /bin/busybox-extras httpd -f -p 80
   75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
   $ podman inspect 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83 --format {{(index .NetworkSettings.Networks "testnet-xQXsssXgCB").IPAddress}}
   10.89.0.3
   $ podman inspect 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83 --format {{(index .NetworkSettings.Networks "testnet-xQXsssXgCB").MacAddress}}
   9a:4a:da:2c:da:2c
   $ podman network disconnect testnet-xQXsssXgCB 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
   fedora     54185  0.0  0.0   6140   836 pts/0    S+   07:45   0:00 grep rootlessport
   $ podman network connect testnet-xQXsssXgCB 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
   time="2021-08-18T07:45:48-04:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/1000/libpod/tmp/rp/75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83: connect: connection refused"
   fedora     54415  0.0  0.0   6140   840 pts/0    S+   07:45   0:00 grep rootlessport
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: curl 127.0.0.1:/index.txt should work again
   #| expected: 'CHbNHcWcT4eryCWX672X8W6fJGwKre'
   #|   actual: ''
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Obviously I didn't get the ps right, because it's empty both before and after. Anyhow, the above failed for me in just a few minutes. Hope it helps.

@edsantiago
Copy link
Member Author

Here's a better one:

   $ podman inspect 6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c --format {{(index .NetworkSettings.Networks "testnet-k45lQzQV1m").MacAddress}}
   ca:ab:38:81:81:80

   fedora    136264  0.0  2.5 1285612 47548 pts/0   Sl   07:51   0:00 containers-rootlessport
   fedora    136270  0.0  2.5 1064416 47364 pts/0   Sl   07:51   0:00 containers-rootlessport-child
   fedora    136801  0.0  0.0   6140   772 pts/0    S+   07:51   0:00 grep rootlessport

   $ podman network disconnect testnet-k45lQzQV1m 6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c

   fedora    137125  0.0  0.0   6140   844 pts/0    S+   07:51   0:00 grep rootlessport

   $ podman network connect testnet-k45lQzQV1m 6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c
   time="2021-08-18T07:51:24-04:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/1000/libpod/tmp/rp/6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c: connect: connection refused"

   fedora    137455  0.0  0.0   6140   776 pts/0    S+   07:51   0:00 grep rootlessport

   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: curl 127.0.0.1:/index.txt should work again
   #| expected: 'QFIstX9yQIhun6sUBPtoMYmX5qUJjq'
   #|   actual: ''
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(Blank line separators added for ease of reading). This shows that rootlessport is present before podman network disconnect, then is gone and never comes back. I will leave further diagnosis in your hands.

@Luap99
Copy link
Member

Luap99 commented Aug 18, 2021

Thanks so my suspicion is right. Now the fun part, find the killer :)

@Luap99
Copy link
Member

Luap99 commented Aug 18, 2021

Did you tried this in a 1minutetip vm? I got this to fail after 45 min and found out that the process is not dead after disconnect, it is dead after the curl command which makes sure that the port connection does not work. I also noticed that the rootlessport process did not exited the normal way, so it either got killed or it paniced.

I tried to reproduce with a patched podman which redirects the stdout and stderr to a file instead of /dev/null so I could get the output and see if there is a stack trace or something. However with this patch it now runs over two hours without failure.

@edsantiago
Copy link
Member Author

Yes, this was 1minutetip -n f34, with dnf --enablerepo=updates-testing install podman-tests. It consistently took 1-2 minutes to fail.

@Luap99
Copy link
Member

Luap99 commented Aug 18, 2021

This is the patch I use:

diff --git a/pkg/rootlessport/rootlessport_linux.go b/pkg/rootlessport/rootlessport_linux.go
index ede216bfe..f988a0c61 100644
--- a/pkg/rootlessport/rootlessport_linux.go
+++ b/pkg/rootlessport/rootlessport_linux.go
@@ -120,11 +120,12 @@ func parent() error {
                select {
                case s := <-sigC:
                        if s == unix.SIGPIPE {
-                               if f, err := os.OpenFile("/dev/null", os.O_WRONLY, 0755); err == nil {
+                               if f, err := os.OpenFile("/tmp/rootlessport", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0755); err == nil {
                                        unix.Dup2(int(f.Fd()), 1) // nolint:errcheck
                                        unix.Dup2(int(f.Fd()), 2) // nolint:errcheck
                                        f.Close()
                                }
+                               logrus.Info("got SIGPIPE")
                        }
                case <-exitC:
                }
@@ -265,6 +266,7 @@ outer:
        if _, err := ioutil.ReadAll(exitR); err != nil {
                return err
        }
+       logrus.Info("exit")
        return nil
 }

@edsantiago Could you try to build this one see if it fails for you. And provide the /tmp/rootlessport file if it does.

Yes, this was 1minutetip -n f34, with dnf --enablerepo=updates-testing install podman-tests. It consistently took 1-2 minutes to fail.

Yes that is what I did too. Failed after 45 min, and after 30 min in another run. With my patch it did not failed for now well over two hours.

@edsantiago
Copy link
Member Author

I just brought up a VM, ran the reproducer, and it failed on the second iteration (a few seconds). Can I offer you my VM for you to use? I don't have a good setup for building podman in VMs, it will take me too long. PM me on IRC.

Just to confirm, though: I am running as user fedora, with the magic loginctl enable-linger fedora turned on. Did you remember to do the magic loginctl?

@Luap99
Copy link
Member

Luap99 commented Aug 18, 2021

OK I know what is causing this. The problem is SIGPIPE kills the process. Stdout and stderr are attached to the podman parent process and when podman exits and the rootlessport process tries to write to stdout/err it will fail with SIGPIPE, the code handles this signal and uses dup2 two set /dev/null to both stdout and stderr. I am not sure why but sometimes this seems to fail and the process continues to run into SIGPIPE errors and it gets killed eventually.

Luap99 added a commit to Luap99/libpod that referenced this issue Aug 18, 2021
When the rootlessport process is started the stdout/stderr are attached
to the podman process. However once everything is setup podman exits and
when the rootlessport process tries to write to stdout it will fail with
SIGPIPE. The code handles this signal and puts /dev/null to stdout and
stderr but this is not robust. I do not understand the exact cause but
sometimes the process is still killed by SIGPIPE. Either go lost the
signal or the process got already killed before the goroutine could
handle it.

Instead of handling SIGPIPE just set /dev/null to stdout and stderr
before podman exits. With this there should be no race and no way to
run into SIGPIPE errors.

[NO TESTS NEEDED]

Fixes containers#11248

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Aug 19, 2021
When the rootlessport process is started the stdout/stderr are attached
to the podman process. However once everything is setup podman exits and
when the rootlessport process tries to write to stdout it will fail with
SIGPIPE. The code handles this signal and puts /dev/null to stdout and
stderr but this is not robust. I do not understand the exact cause but
sometimes the process is still killed by SIGPIPE. Either go lost the
signal or the process got already killed before the goroutine could
handle it.

Instead of handling SIGPIPE just set /dev/null to stdout and stderr
before podman exits. With this there should be no race and no way to
run into SIGPIPE errors.

[NO TESTS NEEDED]

Fixes containers#11248

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants