network connect/disconnect: new flake #11248

edsantiago · 2021-08-17T15:03:00Z

Followup to #11091: looks like there's still something flaking:

[+1340s] not ok 241 podman network connect/disconnect with port forwarding
         # (from function `is' in file test/system/helpers.bash, line 474,
         #  in test file test/system/500-networking.bats, line 445)
         #   `is "$output" "$random_1" "curl 127.0.0.1:/index.txt should work again"' failed with status 56
         # $ podman rm --all --force
         # $ podman ps --all --external --format {{.ID}} {{.Names}}
         # $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # quay.io/libpod/testimage:20210610 9f9ec7f2fdef
         # $ podman network create testnet-FA2Qkj7JP6
         # /home/some24215dude/.config/cni/net.d/testnet-FA2Qkj7JP6.conflist
         # $ podman network create testnet2-HVIhBV5Cvx
         # /home/some24215dude/.config/cni/net.d/testnet2-HVIhBV5Cvx.conflist
         # $ podman run -d --network testnet-FA2Qkj7JP6 quay.io/libpod/testimage:20210610 top
         # 0b007115b381d4c40e11b6948839dbf4d572fa899d5ec5dc6096a566dcc0b89d
         # $ podman run -d -p 12345:80 --network testnet-FA2Qkj7JP6 -v /tmp/podman_bats.nacKs4/hello.txt:/var/www/index.txt:Z -w /var/www quay.io/libpod/testimage:20210610 /bin/busybox-extras httpd -f -p 80
         # 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9
         # $ podman inspect 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9 --format {{(index .NetworkSettings.Networks "testnet-FA2Qkj7JP6").IPAddress}}
         # 10.89.0.3
         # $ podman inspect 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9 --format {{(index .NetworkSettings.Networks "testnet-FA2Qkj7JP6").MacAddress}}
         # b6:f0:5d:b8:8f:81
         # $ podman network disconnect testnet-FA2Qkj7JP6 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9
         # $ podman network connect testnet-FA2Qkj7JP6 5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9
         # time="2021-08-11T15:40:40-05:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/6817/libpod/tmp/rp/5f10d636830a5d8aea5648a735386ff6471639fcdf7ef19bc859ec2bce349df9: connect: connection refused"
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: curl 127.0.0.1:/index.txt should work again
         # #| expected: 'V2nBRSTMbI5tU89XO1xo5by7lYEdOT'
         # #|   actual: ''
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

sys: podman network connect/disconnect with port forwarding

fedora-33 : sys podman fedora-33 rootless host
- PR Release notes for Podman v3.3.0-RC2 #11193
  - 08-11 16:43
fedora-34 : sys podman fedora-34 rootless host
- PR Add support for pod inside of user namespace. #10589
  - 08-09 16:39
ubuntu-2010 : sys podman ubuntu-2010 rootless host
- PR make sure that signal buffers are sufficiently big #11240
  - 08-17 07:37
- PR Bump github.com/rootless-containers/rootlesskit from 0.14.3 to 0.14.4 #11125
  - 08-04 05:02

The text was updated successfully, but these errors were encountered:

Luap99 · 2021-08-17T16:44:12Z

I will take a look.
@edsantiago Btw do you see rootless compose test flakes? I think I saw it still failing sometimes.

edsantiago · 2021-08-17T16:48:20Z

#11091 merged on Aug 6. compose flakes since then:

compose: simple_port_map - curl (port 5000) failed with status 7

fedora-34 : compose test on fedora-34 (rootless)
- PR Add until filter to podman pod ps #11173
  - 08-10 16:25

compose: env_and_volume - curl (port 5000) failed with status 7

fedora-34 : compose test on fedora-34 (root)
- PR skip flaking auto-update test #11176
  - 08-10 05:21

compose: env_and_volume : port 5001

fedora-34 : compose test on fedora-34 (root)
- PR skip flaking auto-update test #11176
  - 08-10 05:21

compose: mount_and_label - curl (port 5000) failed with status 7

fedora-34 : compose test on fedora-34 (root)
- PR Reproducible Builds: trim embedded cgo paths #11160
  - 08-09 14:26

(EDIT: I guess only the first of those is rootless)

Luap99 · 2021-08-18T08:24:04Z

OK from reading connect(2)

ECONNREFUSED
A connect() on a stream socket found no one listening on the remote address.

So that means the socket file exists but no one is listening on it. The rootlessport process did create the socket with net.Listen() otherwise we would get a ENOENT error. For some reason rootlessport is no longer listen on it. I think the rootlessport process exited somehow (maybe it got killed). @edsantiago Do you have a reproducer or can we add debugging output to the test, e.g. ps ux to see if the containers-rootlessport process is still active.

edsantiago · 2021-08-18T11:34:41Z

I don't have a reproducer, but it should be easy to add a debug statement, something like ps auxww | grep rootlessport immediately before and then immediately after the podman network connect (before the is).

edsantiago · 2021-08-18T11:49:03Z

Well, that failed quickly (podman-3.3.0-0.13.rc2.fc34.x86_64):

$ while :;do bats --filter disconnect /usr/share/podman/test/system/500-networking.bats  || break;done
...
 ✗ podman network connect/disconnect with port forwarding
   (from function `is' in file /usr/share/podman/test/system/helpers.bash, line 474,
    in test file /usr/share/podman/test/system/500-networking.bats, line 447)
     `is "$output" "$random_1" "curl 127.0.0.1:/index.txt should work again"' failed with status 56
   $ podman rm --all --force
   $ podman ps --all --external --format {{.ID}} {{.Names}}
   $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210610 9f9ec7f2fdef
   $ podman network create testnet-xQXsssXgCB
   /home/fedora/.config/cni/net.d/testnet-xQXsssXgCB.conflist
   $ podman network create testnet2-O1sbJx5lkf
   /home/fedora/.config/cni/net.d/testnet2-O1sbJx5lkf.conflist
   $ podman run -d --network testnet-xQXsssXgCB quay.io/libpod/testimage:20210610 top
   564449877f9ebf99ddb5e6b205f8b8665d7f5033ff2f69e99021a40198b6852a
   $ podman run -d -p 12345:80 --network testnet-xQXsssXgCB -v /tmp/podman_bats.EafKjG/hello.txt:/var/www/index.txt:Z -w /var/www quay.io/libpod/testimage:20210610 /bin/busybox-extras httpd -f -p 80
   75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
   $ podman inspect 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83 --format {{(index .NetworkSettings.Networks "testnet-xQXsssXgCB").IPAddress}}
   10.89.0.3
   $ podman inspect 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83 --format {{(index .NetworkSettings.Networks "testnet-xQXsssXgCB").MacAddress}}
   9a:4a:da:2c:da:2c
   $ podman network disconnect testnet-xQXsssXgCB 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
   fedora     54185  0.0  0.0   6140   836 pts/0    S+   07:45   0:00 grep rootlessport
   $ podman network connect testnet-xQXsssXgCB 75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83
   time="2021-08-18T07:45:48-04:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/1000/libpod/tmp/rp/75291ed370cc8979dffe7b6ab8a69cdac42e7fd04d33797bf2039f478eee5f83: connect: connection refused"
   fedora     54415  0.0  0.0   6140   840 pts/0    S+   07:45   0:00 grep rootlessport
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: curl 127.0.0.1:/index.txt should work again
   #| expected: 'CHbNHcWcT4eryCWX672X8W6fJGwKre'
   #|   actual: ''
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Obviously I didn't get the ps right, because it's empty both before and after. Anyhow, the above failed for me in just a few minutes. Hope it helps.

edsantiago · 2021-08-18T11:57:10Z

Here's a better one:

   $ podman inspect 6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c --format {{(index .NetworkSettings.Networks "testnet-k45lQzQV1m").MacAddress}}
   ca:ab:38:81:81:80

   fedora    136264  0.0  2.5 1285612 47548 pts/0   Sl   07:51   0:00 containers-rootlessport
   fedora    136270  0.0  2.5 1064416 47364 pts/0   Sl   07:51   0:00 containers-rootlessport-child
   fedora    136801  0.0  0.0   6140   772 pts/0    S+   07:51   0:00 grep rootlessport

   $ podman network disconnect testnet-k45lQzQV1m 6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c

   fedora    137125  0.0  0.0   6140   844 pts/0    S+   07:51   0:00 grep rootlessport

   $ podman network connect testnet-k45lQzQV1m 6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c
   time="2021-08-18T07:51:24-04:00" level=warning msg="Could not reload rootless port mappings, port forwarding may no longer work correctly: dial unix /run/user/1000/libpod/tmp/rp/6ce34fe04f29019c47ef73af12e7801112618fabe698ae47591b7060d522ba3c: connect: connection refused"

   fedora    137455  0.0  0.0   6140   776 pts/0    S+   07:51   0:00 grep rootlessport

   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: curl 127.0.0.1:/index.txt should work again
   #| expected: 'QFIstX9yQIhun6sUBPtoMYmX5qUJjq'
   #|   actual: ''
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(Blank line separators added for ease of reading). This shows that rootlessport is present before podman network disconnect, then is gone and never comes back. I will leave further diagnosis in your hands.

Luap99 · 2021-08-18T12:06:00Z

Thanks so my suspicion is right. Now the fun part, find the killer :)

Luap99 · 2021-08-18T16:12:01Z

Did you tried this in a 1minutetip vm? I got this to fail after 45 min and found out that the process is not dead after disconnect, it is dead after the curl command which makes sure that the port connection does not work. I also noticed that the rootlessport process did not exited the normal way, so it either got killed or it paniced.

I tried to reproduce with a patched podman which redirects the stdout and stderr to a file instead of /dev/null so I could get the output and see if there is a stack trace or something. However with this patch it now runs over two hours without failure.

edsantiago · 2021-08-18T16:14:47Z

Yes, this was 1minutetip -n f34, with dnf --enablerepo=updates-testing install podman-tests. It consistently took 1-2 minutes to fail.

Luap99 · 2021-08-18T16:18:18Z

This is the patch I use:

diff --git a/pkg/rootlessport/rootlessport_linux.go b/pkg/rootlessport/rootlessport_linux.go
index ede216bfe..f988a0c61 100644
--- a/pkg/rootlessport/rootlessport_linux.go
+++ b/pkg/rootlessport/rootlessport_linux.go
@@ -120,11 +120,12 @@ func parent() error {
                select {
                case s := <-sigC:
                        if s == unix.SIGPIPE {
-                               if f, err := os.OpenFile("/dev/null", os.O_WRONLY, 0755); err == nil {
+                               if f, err := os.OpenFile("/tmp/rootlessport", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0755); err == nil {
                                        unix.Dup2(int(f.Fd()), 1) // nolint:errcheck
                                        unix.Dup2(int(f.Fd()), 2) // nolint:errcheck
                                        f.Close()
                                }
+                               logrus.Info("got SIGPIPE")
                        }
                case <-exitC:
                }
@@ -265,6 +266,7 @@ outer:
        if _, err := ioutil.ReadAll(exitR); err != nil {
                return err
        }
+       logrus.Info("exit")
        return nil
 }

@edsantiago Could you try to build this one see if it fails for you. And provide the /tmp/rootlessport file if it does.

Yes, this was 1minutetip -n f34, with dnf --enablerepo=updates-testing install podman-tests. It consistently took 1-2 minutes to fail.

Yes that is what I did too. Failed after 45 min, and after 30 min in another run. With my patch it did not failed for now well over two hours.

edsantiago · 2021-08-18T16:26:35Z

I just brought up a VM, ran the reproducer, and it failed on the second iteration (a few seconds). Can I offer you my VM for you to use? I don't have a good setup for building podman in VMs, it will take me too long. PM me on IRC.

Just to confirm, though: I am running as user fedora, with the magic loginctl enable-linger fedora turned on. Did you remember to do the magic loginctl?

Luap99 · 2021-08-18T18:59:05Z

OK I know what is causing this. The problem is SIGPIPE kills the process. Stdout and stderr are attached to the podman parent process and when podman exits and the rootlessport process tries to write to stdout/err it will fail with SIGPIPE, the code handles this signal and uses dup2 two set /dev/null to both stdout and stderr. I am not sure why but sometimes this seems to fail and the process continues to run into SIGPIPE errors and it gets killed eventually.

When the rootlessport process is started the stdout/stderr are attached to the podman process. However once everything is setup podman exits and when the rootlessport process tries to write to stdout it will fail with SIGPIPE. The code handles this signal and puts /dev/null to stdout and stderr but this is not robust. I do not understand the exact cause but sometimes the process is still killed by SIGPIPE. Either go lost the signal or the process got already killed before the goroutine could handle it. Instead of handling SIGPIPE just set /dev/null to stdout and stderr before podman exits. With this there should be no race and no way to run into SIGPIPE errors. [NO TESTS NEEDED] Fixes containers#11248 Signed-off-by: Paul Holzinger <[email protected]>

edsantiago added flakes Flakes from Continuous Integration rootless labels Aug 17, 2021

edsantiago assigned Luap99 Aug 17, 2021

Luap99 mentioned this issue Aug 18, 2021

fix rootlessport flake #11269

Merged

openshift-merge-robot closed this as completed in #11269 Aug 18, 2021

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network connect/disconnect: new flake #11248

network connect/disconnect: new flake #11248

edsantiago commented Aug 17, 2021

Luap99 commented Aug 17, 2021

edsantiago commented Aug 17, 2021 •

edited

Loading

Luap99 commented Aug 18, 2021

edsantiago commented Aug 18, 2021

edsantiago commented Aug 18, 2021

edsantiago commented Aug 18, 2021

Luap99 commented Aug 18, 2021

Luap99 commented Aug 18, 2021

edsantiago commented Aug 18, 2021

Luap99 commented Aug 18, 2021 •

edited

Loading

edsantiago commented Aug 18, 2021

Luap99 commented Aug 18, 2021

network connect/disconnect: new flake #11248

network connect/disconnect: new flake #11248

Comments

edsantiago commented Aug 17, 2021

sys: podman network connect/disconnect with port forwarding

Luap99 commented Aug 17, 2021

edsantiago commented Aug 17, 2021 • edited Loading

compose: simple_port_map - curl (port 5000) failed with status 7

compose: env_and_volume - curl (port 5000) failed with status 7

compose: env_and_volume : port 5001

compose: mount_and_label - curl (port 5000) failed with status 7

Luap99 commented Aug 18, 2021

edsantiago commented Aug 18, 2021

edsantiago commented Aug 18, 2021

edsantiago commented Aug 18, 2021

Luap99 commented Aug 18, 2021

Luap99 commented Aug 18, 2021

edsantiago commented Aug 18, 2021

Luap99 commented Aug 18, 2021 • edited Loading

edsantiago commented Aug 18, 2021

Luap99 commented Aug 18, 2021

edsantiago commented Aug 17, 2021 •

edited

Loading

Luap99 commented Aug 18, 2021 •

edited

Loading