Uneven connection distribution #92

alecthomas · 2020-03-26T04:57:32Z

Describe the bug

We experienced an ld-relay outage today that we believe was due to very unevenly distributed connections across ld-relay pods. This chart illustrates this quite clearly:

The distribution is quite large, ranging from 257 connections down to 51. There are around 1800 total connections.

In addition to the uneven steady state distribution, you can see the large spike in connections during a rolling restart. As pods terminate, their connections are moved to the remaining live pods, which keep them and do not redistribute them to other pods.

To reproduce
Run 10+ instances of ld-relay with many inbound connections.

Expected behavior

Connections to be evenly distributed.

We've experienced similar behaviour with long running HTTP/2 connections, and our solution was to actively terminate the connections after some time period (eg. 500ms, 1s). This was for a very high QPS service though, so that might defeat the purpose of LD using SSE.

Relay version

	gopkg.in/launchdarkly/ld-relay.v5 v5.0.0-20200122220444-c99cac201df1

Language version, developer tools
go1.13.5

The text was updated successfully, but these errors were encountered:

alecthomas · 2020-03-26T04:57:41Z

@mightyguava

eli-darkly · 2020-03-27T01:46:35Z

I'm not sure I understand which part of this you're saying is an ld-relay issue. We don't have any control over your container system's load-balancing behavior (it's not clear to me what you're using— Kubernetes?). I'm probably not correctly understanding your point about the rolling restart but it seems to me that if pods were being restarted one at a time, you would at worst end up with the connections being distributed across all but one; I can't visualize how they would get bunched up into a smaller subset, unless you were doing something like restarting half of the pods all at once.

I guess it would be possible to implement some kind of hard time limit on connections, but certainly not anything as short as 1 second, and unless it was randomized in some way I would expect it to cause a similar musical chairs problem as a bunch of connections would all get restarted at once.

mightyguava · 2020-03-27T14:28:47Z

The container system is Kubernetes, but we do have an AWS ELB load balancing TCP traffic in front of ld-relay. Maybe ld-relay isn't doing anything wrong. But the load is definitely uneven as you can see from the graph.

There's a general issue for us here that, out of the box, putting an established load balancer like an ELB in front of it does not end up distributing traffic evenly.

Currently, in steady state, with 10 replicas, the busiest one has 144 conns (according to ld_relay_connections) and uses 214MB, while the most idle one has 24 conns and uses 70MB. That's about 1.2MB per connection. I assume that's correlated with the size of our flagset data. We'll be growing both the number and complexity of flags and number of services using them significantly in the future. This isn't going to be scalable for very long.

We are looking for a solution here, whether that be the relay somehow automatically redistributing connections, or implementing timeouts so that connections can be shed. Another potential idea here is for the relay to have some awareness of how much memory it has (which can be inferred from cgroup limits), and limit the number of connections it accepts. We were seeing that, when the busiest replicas ran out of memory, we went into cascading failure as the newly shifted connections kept overloading the next busiest replica.

eli-darkly · 2020-03-27T19:19:18Z

the busiest one has 144 conns (according to ld_relay_connections) and uses 214MB, while the most idle one has 24 conns and uses 70MB. That's about 1.2MB per connection. I assume that's correlated with the size of our flagset data

Thanks for mentioning this - it sounds like there may be a separate issue, because it's not supposed to behave that way. That is, the way ld-relay is implemented it should be sharing a single flag data store for all connections, not copying the data set for every connection (it does have to temporarily copy data to build the initial stream event when a connection is made, but that is not retained).

So we'll have to take a closer look at what is being allocated. This is the first report we've had of memory use varying so greatly between instances.

eli-darkly · 2020-03-28T00:50:25Z

Also, pardon my momentary confusion about your rolling restart scenario. Indeed, it makes sense that if you started with an equal distribution and then restarted instances one at a time and each one's connections got redistributed evenly among the rest, you would end up with an unequal distribution heavily weighted toward the instances that got restarted earlier. It took me a minute to see that though, because for whatever reason we do not (as far as I know) have a similar issue when we do rolling restarts of the stream service. But that probably just means I'm less familiar with our back-end architecture than I am with the SDKs and Relay.

* allow relay to be configured with environment vars and no file * remove Docker script, use vars directly * update goreleaser Dockerfile to match main Dockerfile * rm another obsolete file reference * allow both file and vars * describe new parameter in readme * rm obsolete test of Docker conf file * typo * try making output from test job visible * DataDog address parameters shouldn't be nullable * comments * don't background the job, do display logs * better config logic that's more backward-compatible * update readme for new config behavior * misc fixes + tests * optional file syntax for full backward-compatibility in Docker * add some variables we forgot to support * add another variable we forgot to support * fix command line * misc cleanup * major rewrite of readme: config options, examples, copyedit (#92) * change syntax for optional config file

eli-darkly · 2020-09-24T00:52:46Z

Sorry that this issue has gone a long time without an update. We are getting ready to release Relay Proxy 6.0.0 fairly soon, and it includes the ability to set a maximum lifetime for stream connections from SDKs to Relay, so that Relay will automatically disconnect the client after that amount of time forcing it to reconnect and be redistributed by the load-balancer.

To the related comment about Relay using an unexpectedly large amount of memory per connection, unfortunately we have not been able to reproduce this and haven't found anything in the code that would account for it.

eli-darkly · 2020-10-07T19:04:40Z

The 6.0.0 release is now available. It supports a new configuration option— maxClientConnectionTime in the [Main] section of the config file, or MAX_CLIENT_CONNECTION_TIME if using environment variables— providing the behavior I mentioned in the previous comment.

eli-darkly · 2020-10-07T23:54:57Z

@alecthomas @mightyguava I'll close the issue now, but please feel free to reopen it if you have questions or problems regarding this feature.

mightyguava · 2020-10-08T00:01:56Z

Thanks for the update! We’ll try it out soon

eli-darkly closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uneven connection distribution #92

Uneven connection distribution #92

alecthomas commented Mar 26, 2020

alecthomas commented Mar 26, 2020

eli-darkly commented Mar 27, 2020

mightyguava commented Mar 27, 2020

eli-darkly commented Mar 27, 2020

eli-darkly commented Mar 28, 2020

eli-darkly commented Sep 24, 2020

eli-darkly commented Oct 7, 2020

eli-darkly commented Oct 7, 2020

mightyguava commented Oct 8, 2020

Uneven connection distribution #92

Uneven connection distribution #92

Comments

alecthomas commented Mar 26, 2020

alecthomas commented Mar 26, 2020

eli-darkly commented Mar 27, 2020

mightyguava commented Mar 27, 2020

eli-darkly commented Mar 27, 2020

eli-darkly commented Mar 28, 2020

eli-darkly commented Sep 24, 2020

eli-darkly commented Oct 7, 2020

eli-darkly commented Oct 7, 2020

mightyguava commented Oct 8, 2020