Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't connect to docker container's exposed port with network's gateway ip from other container #5588

Closed
1 of 3 tasks
illesguy opened this issue Apr 16, 2021 · 15 comments
Closed
1 of 3 tasks

Comments

@illesguy
Copy link

  • I have tried with the latest version of Docker Desktop
  • I have tried disabling enabled experimental features
  • I have uploaded Diagnostics

Expected behavior

When creating a docker container with an exposed port (not binding it), be able to reach the application running on it from another container in the same network using the network's gateway ip and the port the container was assigned on the host. HostPort in the docker container's HostConfig, PortBindings should be empty. This is the behaviour on Ubuntu.

Actual behavior

When creating a docker container with an exposed port (not binding it), can reach the container from the host with localhost and the newly assigned port but not from another container with their network's gateway ip and the newly assigned port. HostPort in the docker container's HostConfig, PortBindings` is set to "0". This is the behaviour on Mac.

Information

The problem is reproducible. Not sure if it's new, we just noticed it and the not sure whether it came with an update. Added steps to reproduce below. When the run a container and use the -p <port> option to map the container's port to a randomly assigned one on the host machine, we noticed that HostPort under HostConfig and PortBindings in the container's config is set to "0", on Ubuntu it is an empty string. If we were to specify the port we want to bind to it works fine -p <host_port>:<container_port>. In this case the HostPort under HostConfig/PortBindings would be the host_port we specified on both Mac and Ubuntu. We suspect this is causing the issue.

macOS version: 10.15.6
Docker Desktop Version: 3.3.1
Docker version: 20.10.5

Ubuntu host tested on: 18.04.5
Docker version: 20.10.6

Reproduce

The following script would reproduce the issue, it would fail on Mac and succeed on Ubuntu:

  #!/usr/bin/env bash

echo "Starting PostgreSQL database in container"
CONTAINER=$(docker run -d -p 5432 -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=test postgres:9.6)
echo "PostgreSQL database started in container: $CONTAINER"
sleep 5  # Making sure db will be responsive

PORT=$(docker port $CONTAINER | head -n 1 | rev | cut -d: -f1 | rev)
GATEWAY_IP=$(docker network inspect -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}' bridge)
echo
echo "Gateway IP: $GATEWAY_IP, port: $PORT"

echo
echo "Querying database version from another container"
docker run postgres:9.6 psql "host=$GATEWAY_IP port=$PORT dbname=test user=test password=test" -c "select version()"

echo
echo "HostConfig PortBindings of container: $CONTAINER"
docker inspect -f '{{range $p, $conf := .HostConfig.PortBindings}} {{$p}} -> HostIp: {{(index $conf 0).HostIp}} HostPort: {{(index $conf 0).HostPort}} {{end}}' $CONTAINER

echo
echo "NetworkSettings Ports of container: $CONTAINER"
docker inspect -f '{{range $p, $conf := .NetworkSettings.Ports}} {{$p}} -> HostIp: {{(index $conf 0).HostIp}} HostPort: {{(index $conf 0).HostPort}} {{end}}' $CONTAINER

echo
echo "Stopping $CONTAINER"
docker stop $CONTAINER
@nk412
Copy link

nk412 commented Apr 16, 2021

+1 seeing the same issue

@djs55
Copy link
Contributor

djs55 commented Apr 16, 2021

Thanks for your report.

For maximum compatibility/portability between systems I recommend

  • using docker run -p to expose ports on the host (as you are doing)
  • but to access one container from another on the same network use the internal DNS name and IP, for example:
% docker network create test  
% docker run --rm --net=test --name=container1 -it alpine sh
...
% docker run --rm --net=test --name=container2 -it alpine sh
/ # ping container1
PING container1 (172.18.0.2): 56 data bytes
64 bytes from 172.18.0.2: seq=0 ttl=64 time=0.596 ms
^C
--- container1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.596/0.596/0.596 ms

The networking implementation on macOS is different to native Linux as there are two kernels with separate network stacks. On Linux docker run -p has the side-effect you observe, but on macOS the port has to be exposed differently, due to the different kernel.

@illesguy
Copy link
Author

illesguy commented Apr 16, 2021

The issue we are having is within a Python library (https://github.com/testcontainers/testcontainers-python) that uses the approach I've described (network gateway ip + port) to communicate between containers it starts, so we can't change that. For now we were able to fix this, by explicitly binding the ports we want to expose instead letting docker do it.

But if you're saying using the internal DNS and IP is recommended, maybe that is a ticket to open on the repository of the library we are using.

@ag-TJNII
Copy link

ag-TJNII commented May 5, 2021

On Linux docker run -p has the side-effect you observe, but on macOS the port has to be exposed differently, due to the different kernel.

So is this a 'Won't Fix', then? This used to work fine on Docker for Mac. I support Dockerized utilities in both Linux and Mac environments and use this behavior often. I also tried hitting the listening container by it's IP when troubleshooting and that didn't work. when did this behavior change? I didn't see any notes about it in the changelogs.

I'm also unable to get back to a working state by uninstalling and installing an older version. I'm suspicious this is due to a recent networking components update, and I noticed the uninstall does not ask for privileged access like the install does. Does a uninstall also uninstall the networking components?

@goatherder
Copy link

I see the same behaviour with https://github.com/testcontainers/testcontainers-go

@docker-robott
Copy link
Collaborator

Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

@rnorth
Copy link

rnorth commented Aug 26, 2021

We're looking into something that looks awfully similar with testcontainers-java, whereby GATEWAY_IP:PORT connections fail in Docker for Mac & Windows from one container to another.

testcontainers/testcontainers-java#4395

We're not quite sure where our investigations will lead us yet, but we'll update here if we find something useful.

@rnorth
Copy link

rnorth commented Aug 26, 2021

/remove-lifecycle stale

^ on the basis that this used to work just fine and has become relied upon, so removal without warning is not without impacts.

@rnorth
Copy link

rnorth commented Aug 26, 2021

Based on some initial tinkering there's something weird going on here, but it gives us some ideas for a workaround that could be applied to testcontainers-java and probably other similar libs:

# Running INSIDE a container which has the docker socket volume-mounted (docker wormhole pattern):

$ docker run -d -p 8080 SOME_IMAGE
# find out the host port using docker ps
$ curl http://172.17.0.1:58427
curl: (7) Failed to connect to 172.17.0.1 port 58427 after 0 ms: Connection refused

$ docker run -d -p 0:8080 SOME_IMAGE
# find out the host port using docker ps
$ curl http://172.17.0.1:55012
# SUCCESS

I was always under the impression that -p PORT and -p 0:PORT would be effectively the same, but I'm sure there are many moving parts that make this not such a simple assertion 😆.

Taking @illesguy's repro script above and simply sticking a 0: in the right place seems to fix it (with Docker for Mac 3.5.2, launching the script from inside a container).

@djs55, for various use cases (particularly CI environments with 'docker wormhole' arrangements) the GATEWAY_IP:PORT pattern has, for a long time, been really quite important to us and the network+named container pattern would be quite hard for users to migrate to.

Could you perhaps give us a steer on:

  • whether we would be safe to adopt a workaround resembling the -p 0:... approach, or whether there's an imminent risk that this might also break?
  • given the apparent simplicity of the workaround, might this actually be fixable in Docker Desktop?

Thank you in advance.

@djs55
Copy link
Contributor

djs55 commented Oct 4, 2021

@rnorth sorry for the delay in replying. The observation about 0:8080 working but 8080 not working is interesting. I'll add a ticket to our backlog to investigate: it might be fixable as you suggest.

@djs55
Copy link
Contributor

djs55 commented Oct 5, 2021

I've done a bit of investigating and I believe this problem is due to the fix for docker/for-win#10008 . On Mac and Windows dockerd runs inside the VM, in a network namespace. When a random port is needed to expose a container (e.g. docker run -p 80) there was a bug where dockerd would choose a port which was free inside it's network namespace, but which wasn't free on the host. This was fixed by choosing a random port on both the host and on 127.0.0.1 in Linux, connecting them together, and rewriting the docker ps results over /var/run/docker.sock to show the correct value for the host:

Desktop port forwarding (4)

There was a bug in this bugfix, in the test for "is the user requesting a random port". The code checks for the port being "", but not the port being "0". So the reason that -p 0:... works at the moment is because it's avoiding this bugfix, by accident.

Considering the possibility of making -p 80 and -p 0:80 behave the same i.e. fixing the bug in the bugfix, I can see 2 further problems:

  1. the bugfix binds internally on 127.0.0.1 inside the Linux network namespace, so the port isn't available on the gateway IP
  2. even if the port were available on the gateway IP, it would be the internal port rather than the external one visible in docker ps on the host.

I could probably fix (1) by binding internally on 0.0.0.0 rather than 127.0.0.1; hopefully this would be safe because the Linux network namespace with dockerd is almost empty. However this still leaves the problem of discovering the internal port number.

It is possible to bypass the Docker API proxy which is rewriting the port number:

  • docker run -v /var/run/docker.sock:/var/run/docker.sock docker ps: would show the host port
  • docker run -v /var/run/docker.sock.raw:/var/run/docker.sock docker ps: would show the internal port

I don't know how convenient this would be to change. It's a bit of an internal detail of Docker Desktop.

So in summary:

  1. -p 0:... seems to work by accident rather than design 😱
  2. I can't guarantee anything but I think there's no imminent danger of it stopping working: I'll add a test case which checks that it continues to work and linking to this issue so can avoid breaking it if possible / give you advance warning
  3. we need to think of a better fix for the original issue and this one

@gesellix
Copy link

gesellix commented Oct 5, 2021

I'll add a test case which checks that it continues to work and linking to this issue so can avoid breaking it if possible / give you advance warning

❤️

@rnorth
Copy link

rnorth commented Oct 6, 2021

@djs55 thank you so much for the detailed analysis and explanation! I really appreciate you spending time looking into this.

We'll have some internal talks to discuss what we do at our end, but my immediate thoughts are (in no particular order):

I can't guarantee anything but I think there's no imminent danger of it stopping working: I'll add a test case which checks that it continues to work and linking to this issue so can avoid breaking it if possible / give you advance warning

I think I'd strongly echo @gesellix's ❤️ reaction: this ought to help us implement the 'accidental workaround' with much more confidence. Having the behaviour be covered by a test is a great idea - thanks.

It is possible to bypass the Docker API proxy which is rewriting the port number... I don't know how convenient this would be to change.

I learned something new today 😄. Slightly longer term, this wouldn't be a trivial change for us, but might be something we could use from the Testcontainers side if essential. We already have to cater for other differences when we're running inside a container, so extending this might be practical if we can do it safely. My only concern would be that we'd be even deeper into coupling to Docker Desktop internals, so we have to consider that.

We'll keep this ticket updated with plans from the Testcontainers side.

Thanks again.

@docker-robott
Copy link
Collaborator

Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

@docker-robott
Copy link
Collaborator

Closed issues are locked after 30 days of inactivity.
This helps our team focus on active issues.

If you have found a problem that seems similar to this, please open a new issue.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle locked

@docker docker locked and limited conversation to collaborators Mar 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants