Use direct server return in east-west overlay load balancing #2270

ctelfer · 2018-09-20T14:17:12Z

This is a WIP to update the load balancing to support direct server return (DSR). The notion, at a high level is that instead of the IPVS load balancer modifying the destination IP address of a packet, it only modifies the destination MAC address and leaves the VIP in place. Furthermore, it avoids performing SNAT on the outgoing packet. The tasks in a service would each be programmed with the VIP(s) as an IP alias on the loopback interface so that they can receive and accept said packet. This requires that the stack enable IP forwarding but libnetwork sets this by default (in Linux) already for other reasons. To round out the picture, the server must also have ARP configured so that it does not respond to ARP queries for the VIP nor attempt ARP queries with the VIP as a source protocol address.

This approach will not easily work for ingress processing because libnetwork uses SNAT to ensure the routability of traffic from the outside network. However, this does address a long-standing concern for L7 load balancers running in the host network and balancing traffic to internal Docker networks (similar to moby/moby#35082). In this case, the L7 load balancer in the host network would be limited in the number of unique 5-tuples that it can open to the service task(s) leading to the connection tracking recycling issues mentioned. This change would also address issue that some folks have voiced about NAT hiding the original address of the client in east-west traffic.

The PR so far does not have any support for Windows networking which is why it is a WIP. I have tested it with Linux clusters and it has passed my tests thus far.

selansen · 2018-09-20T23:18:59Z

Lint Failure
🐳 lint
ipvs/ipvs.go:54:2: don't use ALL_CAPS in Go names; use CamelCase
ipvs/ipvs.go:54:2: exported const CONN_F_FWD_MASK should have comment (or a comment on this block) or be unexported
ipvs/ipvs.go:55:2: don't use ALL_CAPS in Go names; use CamelCase

ctelfer · 2018-09-21T18:59:23Z

Thanks. I'd patched it but forgot to push. Have updated now. In general, am not sure how I feel about the CamelCase rule for constants that are meant to be one-to-one mappings with constants from the OS (as in this case). It would seem like those should keep the same form in order to be easy to discover / line up and verify.

selansen · 2018-09-27T18:20:10Z

@ctelfer, which release are we planning to pick up this new change ?

ctelfer · 2018-09-28T14:14:28Z

I don't know. I just wanted to put this out there to make sure it was captured. If we can't find a good way to match the behavior in Windows we may have to introduce swarm changes or a new type of network (.e.g. "overlay2"). My one thought for windows integration so far is that we could do an ingress NAT for traffic coming in on the VIPs to direct the traffic to a container on the node. Ideal would be to match on the MAC to direct to a specific container, but we could do a second layer of load balancing where the incoming node load balancer knows the identity of all containers on the local node. This is actually something we don't currently track and so its a bit of an overhaul to add it. It also seems like a Bad Idea to have two layers of load balancing where the second doesn't serve any real purpose except NAT.

fcrisciani · 2018-10-09T00:32:02Z

osl/sandbox.go

@@ -51,6 +51,10 @@ type Sandbox interface {
 	// RemoveAliasIP removes the passed IP address from the named interface
 	RemoveAliasIP(ifName string, ip *net.IPNet) error

+	// DisableARPForVIP disables ARP replies and requests for VIP addresses
+	// on a particular interface
+	DisableARPForVIP(ifName string) error


better DisableARPForIfc?

Hrmm.. Don't object to renaming, but the suggestion is less accurate. The function doesn't disable ARP for all addresses. It just disables it for addresses assigned to a different interface. We really only care that the VIP addresses have ARP suppressed for purposes of this API. But the operation does change behavior slightly. Say you had:

+---------------+ eth0(10.1.0.2) ------+ container +------ eth1(10.2.0.7) +---------------+

The way containers are currently configured, if you ARPed 10.2.0.7 on eth0, you'd get a response back indicating that that address was on that interface. The ARP configurations that this function performs would prevent that. (the contaienr would ignore incoming ARPs for 10.2.0.7 on eth0) Ideally, we would be configuring this behavior ONLY for the VIP addresses (which get assigned to lo). Unfortunately, there isn't a nice way to do that without using arptables as far as I know. And requiring arptables and the like on docker installations is not feasible.

As far as ARPing across interfaces goes, my personal opinion is that this behavior should be the restriction that this function puts in place should be the default. But the Linux devs disagree. In any case I don't think that the more relaxed behavior is something an application developer should rely on. (see https://lwn.net/Articles/45373/) So the restriction should be ok. It would also be a property of the network, in any case. (i.e. only applies to endpoints that are on a "dsr"-option-enabled "overlay" network).

I see the point, it's difficult to express with a concise name. nvm for the moment then

ctelfer · 2018-10-10T18:34:41Z

I have pushed a change that makes the DSR behavior an overlay-network-specific property that one must enable by adding --opt dsr to the docker network create... command. (or by setting "dsr" in the Options map passed to a NetworkCreate() call for an overlay network.)

fcrisciani · 2018-10-10T22:23:59Z

sandbox.go

@@ -767,7 +775,10 @@ func (sb *sandbox) releaseOSSbox() {
 	}

 	for _, ep := range sb.getConnectedEndpoints() {
-		releaseOSSboxResources(osSbox, ep)
+		ep.Lock()


the releaseOSSboxResources is already getting the ep.Lock can we check this inside the function itself?

facepalm Yep, that's much more sensible...

Modify the loadbalancing for east-west traffic to use direct routing rather than NAT and update tasks to use direct service return under linux. This avoids hiding the source address of the sender and improves the performance in single-client/single-server tests. Signed-off-by: Chris Telfer <[email protected]>

Allow DSR to be a configurable option through a generic option to the overlay driver. On the one hand this approach makes sense insofar as only overlay networks can currently perform load balancing. On the other hand, this approach has several issues. First, should we create another type of swarm scope network, this will prevent it working. Second, the service core code is separate from the driver code and the driver code can't influence the core data structures. So the driver code can't set this option itself. Therefore, implementing in this way requires some hack code to test for this option in controller.NewNetwork. A more correct approach would be to make this a generic option for any network. Then the driver could ignore, reject or be unaware of the option depending on the chosen model. This would require changes to: * libnetwork - naturally * the docker API - to carry the option * swarmkit - to propagate the option * the docker CLI - to support the option * moby - to translate the API option into a libnetwork option Given the urgency of requests to address this issue, this approach will be saved for a future iteration. Signed-off-by: Chris Telfer <[email protected]>

fcrisciani

LGTM

ctelfer force-pushed the lbdsr branch from d2d62ab to fe37214 Compare September 20, 2018 19:49

ctelfer mentioned this pull request Oct 8, 2018

[18.06] Improve scalability of the Linux load balancing docker-archive/engine#16

Merged

fcrisciani reviewed Oct 9, 2018

View reviewed changes

fcrisciani reviewed Oct 10, 2018

View reviewed changes

ctelfer added 2 commits October 11, 2018 14:13

ctelfer changed the title ~~WIP DO NOT MERGE: Use direct server return in east-west overlay load balancing~~ Use direct server return in east-west overlay load balancing Oct 12, 2018

ctelfer force-pushed the lbdsr branch from 40d0e4a to 0c76756 Compare October 12, 2018 00:25

fcrisciani approved these changes Oct 12, 2018

View reviewed changes

fcrisciani merged commit be20dfe into moby:master Oct 12, 2018

ctelfer deleted the lbdsr branch October 12, 2018 00:45

fcrisciani mentioned this pull request Oct 16, 2018

Vendor libnetwork moby/moby#38031

Merged

thaJeztah mentioned this pull request Oct 16, 2018

Backport DSR feature to bump_18.09 #2278

Merged

ctelfer mentioned this pull request Nov 14, 2018

[18.09] Bump libnetwork to 6da50d19 for DSR load balancing changes docker-archive/engine#93

Merged

arkodg mentioned this pull request Aug 7, 2019

Docker 19.03.1 changes the source IP address moby/moby#39665

Open

arkodg mentioned this pull request Oct 16, 2019

Add documentation for dsr option for Overlay Networks docker/docs#9652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use direct server return in east-west overlay load balancing #2270

Use direct server return in east-west overlay load balancing #2270

ctelfer commented Sep 20, 2018

selansen commented Sep 20, 2018

ctelfer commented Sep 21, 2018

selansen commented Sep 27, 2018

ctelfer commented Sep 28, 2018

fcrisciani Oct 9, 2018

ctelfer Oct 10, 2018

fcrisciani Oct 10, 2018

ctelfer commented Oct 10, 2018

fcrisciani Oct 10, 2018

ctelfer Oct 11, 2018

fcrisciani left a comment

Use direct server return in east-west overlay load balancing #2270

Use direct server return in east-west overlay load balancing #2270

Conversation

ctelfer commented Sep 20, 2018

selansen commented Sep 20, 2018

ctelfer commented Sep 21, 2018

selansen commented Sep 27, 2018

ctelfer commented Sep 28, 2018

fcrisciani Oct 9, 2018

Choose a reason for hiding this comment

ctelfer Oct 10, 2018

Choose a reason for hiding this comment

fcrisciani Oct 10, 2018

Choose a reason for hiding this comment

ctelfer commented Oct 10, 2018

fcrisciani Oct 10, 2018

Choose a reason for hiding this comment

ctelfer Oct 11, 2018

Choose a reason for hiding this comment

fcrisciani left a comment

Choose a reason for hiding this comment