-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uneven connection distribution #92
Comments
I'm not sure I understand which part of this you're saying is an ld-relay issue. We don't have any control over your container system's load-balancing behavior (it's not clear to me what you're using— Kubernetes?). I'm probably not correctly understanding your point about the rolling restart but it seems to me that if pods were being restarted one at a time, you would at worst end up with the connections being distributed across all but one; I can't visualize how they would get bunched up into a smaller subset, unless you were doing something like restarting half of the pods all at once. I guess it would be possible to implement some kind of hard time limit on connections, but certainly not anything as short as 1 second, and unless it was randomized in some way I would expect it to cause a similar musical chairs problem as a bunch of connections would all get restarted at once. |
The container system is Kubernetes, but we do have an AWS ELB load balancing TCP traffic in front of ld-relay. Maybe ld-relay isn't doing anything wrong. But the load is definitely uneven as you can see from the graph. There's a general issue for us here that, out of the box, putting an established load balancer like an ELB in front of it does not end up distributing traffic evenly. Currently, in steady state, with 10 replicas, the busiest one has 144 conns (according to We are looking for a solution here, whether that be the relay somehow automatically redistributing connections, or implementing timeouts so that connections can be shed. Another potential idea here is for the relay to have some awareness of how much memory it has (which can be inferred from cgroup limits), and limit the number of connections it accepts. We were seeing that, when the busiest replicas ran out of memory, we went into cascading failure as the newly shifted connections kept overloading the next busiest replica. |
Thanks for mentioning this - it sounds like there may be a separate issue, because it's not supposed to behave that way. That is, the way ld-relay is implemented it should be sharing a single flag data store for all connections, not copying the data set for every connection (it does have to temporarily copy data to build the initial stream event when a connection is made, but that is not retained). So we'll have to take a closer look at what is being allocated. This is the first report we've had of memory use varying so greatly between instances. |
Also, pardon my momentary confusion about your rolling restart scenario. Indeed, it makes sense that if you started with an equal distribution and then restarted instances one at a time and each one's connections got redistributed evenly among the rest, you would end up with an unequal distribution heavily weighted toward the instances that got restarted earlier. It took me a minute to see that though, because for whatever reason we do not (as far as I know) have a similar issue when we do rolling restarts of the stream service. But that probably just means I'm less familiar with our back-end architecture than I am with the SDKs and Relay. |
* allow relay to be configured with environment vars and no file * remove Docker script, use vars directly * update goreleaser Dockerfile to match main Dockerfile * rm another obsolete file reference * allow both file and vars * describe new parameter in readme * rm obsolete test of Docker conf file * typo * try making output from test job visible * DataDog address parameters shouldn't be nullable * comments * don't background the job, do display logs * better config logic that's more backward-compatible * update readme for new config behavior * misc fixes + tests * optional file syntax for full backward-compatibility in Docker * add some variables we forgot to support * add another variable we forgot to support * fix command line * misc cleanup * major rewrite of readme: config options, examples, copyedit (#92) * change syntax for optional config file
Sorry that this issue has gone a long time without an update. We are getting ready to release Relay Proxy 6.0.0 fairly soon, and it includes the ability to set a maximum lifetime for stream connections from SDKs to Relay, so that Relay will automatically disconnect the client after that amount of time forcing it to reconnect and be redistributed by the load-balancer. To the related comment about Relay using an unexpectedly large amount of memory per connection, unfortunately we have not been able to reproduce this and haven't found anything in the code that would account for it. |
The 6.0.0 release is now available. It supports a new configuration option— |
@alecthomas @mightyguava I'll close the issue now, but please feel free to reopen it if you have questions or problems regarding this feature. |
Thanks for the update! We’ll try it out soon |
Describe the bug
We experienced an ld-relay outage today that we believe was due to very unevenly distributed connections across ld-relay pods. This chart illustrates this quite clearly:
The distribution is quite large, ranging from 257 connections down to 51. There are around 1800 total connections.
In addition to the uneven steady state distribution, you can see the large spike in connections during a rolling restart. As pods terminate, their connections are moved to the remaining live pods, which keep them and do not redistribute them to other pods.
To reproduce
Run 10+ instances of ld-relay with many inbound connections.
Expected behavior
Connections to be evenly distributed.
We've experienced similar behaviour with long running HTTP/2 connections, and our solution was to actively terminate the connections after some time period (eg. 500ms, 1s). This was for a very high QPS service though, so that might defeat the purpose of LD using SSE.
Relay version
Language version, developer tools
go1.13.5
The text was updated successfully, but these errors were encountered: