Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use direct server return in east-west overlay load balancing #2270

Merged
merged 2 commits into from
Oct 12, 2018

Conversation

ctelfer
Copy link
Contributor

@ctelfer ctelfer commented Sep 20, 2018

This is a WIP to update the load balancing to support direct server return (DSR). The notion, at a high level is that instead of the IPVS load balancer modifying the destination IP address of a packet, it only modifies the destination MAC address and leaves the VIP in place. Furthermore, it avoids performing SNAT on the outgoing packet. The tasks in a service would each be programmed with the VIP(s) as an IP alias on the loopback interface so that they can receive and accept said packet. This requires that the stack enable IP forwarding but libnetwork sets this by default (in Linux) already for other reasons. To round out the picture, the server must also have ARP configured so that it does not respond to ARP queries for the VIP nor attempt ARP queries with the VIP as a source protocol address.

This approach will not easily work for ingress processing because libnetwork uses SNAT to ensure the routability of traffic from the outside network. However, this does address a long-standing concern for L7 load balancers running in the host network and balancing traffic to internal Docker networks (similar to moby/moby#35082). In this case, the L7 load balancer in the host network would be limited in the number of unique 5-tuples that it can open to the service task(s) leading to the connection tracking recycling issues mentioned. This change would also address issue that some folks have voiced about NAT hiding the original address of the client in east-west traffic.

The PR so far does not have any support for Windows networking which is why it is a WIP. I have tested it with Linux clusters and it has passed my tests thus far.

@selansen
Copy link
Collaborator

Lint Failure
🐳 lint
ipvs/ipvs.go:54:2: don't use ALL_CAPS in Go names; use CamelCase
ipvs/ipvs.go:54:2: exported const CONN_F_FWD_MASK should have comment (or a comment on this block) or be unexported
ipvs/ipvs.go:55:2: don't use ALL_CAPS in Go names; use CamelCase

@ctelfer
Copy link
Contributor Author

ctelfer commented Sep 21, 2018

Thanks. I'd patched it but forgot to push. Have updated now. In general, am not sure how I feel about the CamelCase rule for constants that are meant to be one-to-one mappings with constants from the OS (as in this case). It would seem like those should keep the same form in order to be easy to discover / line up and verify.

@selansen
Copy link
Collaborator

@ctelfer, which release are we planning to pick up this new change ?

@ctelfer
Copy link
Contributor Author

ctelfer commented Sep 28, 2018

I don't know. I just wanted to put this out there to make sure it was captured. If we can't find a good way to match the behavior in Windows we may have to introduce swarm changes or a new type of network (.e.g. "overlay2"). My one thought for windows integration so far is that we could do an ingress NAT for traffic coming in on the VIPs to direct the traffic to a container on the node. Ideal would be to match on the MAC to direct to a specific container, but we could do a second layer of load balancing where the incoming node load balancer knows the identity of all containers on the local node. This is actually something we don't currently track and so its a bit of an overhaul to add it. It also seems like a Bad Idea to have two layers of load balancing where the second doesn't serve any real purpose except NAT.

@@ -51,6 +51,10 @@ type Sandbox interface {
// RemoveAliasIP removes the passed IP address from the named interface
RemoveAliasIP(ifName string, ip *net.IPNet) error

// DisableARPForVIP disables ARP replies and requests for VIP addresses
// on a particular interface
DisableARPForVIP(ifName string) error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better DisableARPForIfc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm.. Don't object to renaming, but the suggestion is less accurate. The function doesn't disable ARP for all addresses. It just disables it for addresses assigned to a different interface. We really only care that the VIP addresses have ARP suppressed for purposes of this API. But the operation does change behavior slightly. Say you had:

                            +---------------+
       eth0(10.1.0.2) ------+   container   +------ eth1(10.2.0.7)
                            +---------------+

The way containers are currently configured, if you ARPed 10.2.0.7 on eth0, you'd get a response back indicating that that address was on that interface. The ARP configurations that this function performs would prevent that. (the contaienr would ignore incoming ARPs for 10.2.0.7 on eth0) Ideally, we would be configuring this behavior ONLY for the VIP addresses (which get assigned to lo). Unfortunately, there isn't a nice way to do that without using arptables as far as I know. And requiring arptables and the like on docker installations is not feasible.

As far as ARPing across interfaces goes, my personal opinion is that this behavior should be the restriction that this function puts in place should be the default. But the Linux devs disagree. In any case I don't think that the more relaxed behavior is something an application developer should rely on. (see https://lwn.net/Articles/45373/) So the restriction should be ok. It would also be a property of the network, in any case. (i.e. only applies to endpoints that are on a "dsr"-option-enabled "overlay" network).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point, it's difficult to express with a concise name. nvm for the moment then

@ctelfer
Copy link
Contributor Author

ctelfer commented Oct 10, 2018

I have pushed a change that makes the DSR behavior an overlay-network-specific property that one must enable by adding --opt dsr to the docker network create... command. (or by setting "dsr" in the Options map passed to a NetworkCreate() call for an overlay network.)

sandbox.go Outdated
@@ -767,7 +775,10 @@ func (sb *sandbox) releaseOSSbox() {
}

for _, ep := range sb.getConnectedEndpoints() {
releaseOSSboxResources(osSbox, ep)
ep.Lock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the releaseOSSboxResources is already getting the ep.Lock can we check this inside the function itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facepalm Yep, that's much more sensible...

Modify the loadbalancing for east-west traffic to use direct routing
rather than NAT and update tasks to use direct service return under
linux.  This avoids hiding the source address of the sender and improves
the performance in single-client/single-server tests.

Signed-off-by: Chris Telfer <[email protected]>
Allow DSR to be a configurable option through a generic option to the
overlay driver.  On the one hand this approach makes sense insofar as
only overlay networks can currently perform load balancing.  On the
other hand, this approach has several issues.  First, should we create
another type of swarm scope network, this will prevent it working.
Second, the service core code is separate from the driver code and the
driver code can't influence the core data structures.  So the driver
code can't set this option itself.  Therefore, implementing in this way
requires some hack code to test for this option in
controller.NewNetwork.

A more correct approach would be to make this a generic option for any
network.  Then the driver could ignore, reject or be unaware of the option
depending on the chosen model.  This would require changes to:
  * libnetwork - naturally
  * the docker API - to carry the option
  * swarmkit - to propagate the option
  * the docker CLI - to support the option
  * moby - to translate the API option into a libnetwork option
Given the urgency of requests to address this issue, this approach will
be saved for a future iteration.

Signed-off-by: Chris Telfer <[email protected]>
@ctelfer ctelfer changed the title WIP DO NOT MERGE: Use direct server return in east-west overlay load balancing Use direct server return in east-west overlay load balancing Oct 12, 2018
Copy link

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants