-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latency spikes #208
Comments
Hey Kai, Sorry for late response. And thanks for the details.
Hm, with such a low number of active subscribers, it's strange to see latency spikes. Let me think about what could cause them. A bit of how things work under the hood. We use a pool of workers (we call them "gates" internally) to deliver broadcast messages; thus, every broadcast passes through a particular gate, where we iterate over subscribers and perform the actual First, each stream is processed by a single gate, but different streams may be processed by different gates concurrently. This is by design: we want to guarantee the order of broadcasted messages, so we cannot deliver them concurrently. The blocking could happen due to the growing backlog of messages to deliver to everyone in the global stream's gate. There is also a buffer, or a queue (or a channel) before entering the gates, but since per-stream broadcasts are not affected, we may conclude that it handles the load without any problems. Thus, the problem lies somewhere deeper. Each gate uses a buffered channel (size of 256) as a queue for broadcasts. When it's overflown, an attempt to send a broadcast message to this gate is blocked until we have enough capacity. That would block the whole hub (all gates)—something that rarely happens (and should not at all, but..), and we don't observe this either. Then, we deal with clients. Writing a date to a socket is usually a quick operation: in most cases, we just move data to the underlying I/O buffer, and the rest is not our business (well, not precisely). Still, this buffer has its size, that affects how often do we perform syscalls which could be blocking. You can experiment with WS buffers using Also, in Pro version we have a slightly different mechanism (more memory-efficient) of dealing with WS read and writes. Worth trying here, too. |
Very interesting! Thanks for the details about implementation. I ended up splitting the same load across multiple streams, and that ended up eliminating the issue. I did also notice that the load was higher than I reported initially. As it's variable it's hard to get a precise number, but I'd estimate it more at 10 KB/s. Many individual messages were around 10KB. Given what you say about buffer sizes, the theory about syscalls being blocking sounds quite plausible. It does surprise me that it could cause such a big latency spike even if it requires 10 syscalls instead of 1, but I guess things could build up with the variable load. Unfortunately, since I've split data across multiple streams I no longer have an easy way to test other fixes, but if it's necessary I could probably duplicate the traffic to a local app for testing. |
not necessary but that would help to investigate this further |
In production (running on render), I'm seeing occasional latency spikes. End-to-end, it typically takes a few tens of milliseconds for a message to go through, but occasionally this spikes to ten seconds.
I'm testing end to end latency with
ActionCable.server.broadcast 'mission_control', { event: 'latency_check', info: { timestamp: Time.now.to_f } }
, thenconsole.log
ing it on the frontend. I'm pretty sure the latency isn't being introduced by redis; I can monitor it withredis.psubscribe('__anycable__')
and see that latency there is low (typically ~1ms). I'm also pretty sure that this isn't happening on the frontend (eg browser bandwidth limitations): across multiple tabs and multiple devices and multiple networks these latency spikes come at precisely the same time. In terms of versions, I'm using anycable-go 1.5.4 and gem version 1.2.2.Interestingly, this latency seems to be per stream. For example, if I have
stream_from 'global'
andstream_from "per_user:#{user.to_gid_param}"
, I can broadcast the same message to both simultaneously and global has the latency spikes, whereas the per-user one does not.global
, naturally, has a lot more load on it: roughly 1 Kbps, whileper_user
has essentially zero load. The server I'm running it on has plenty of headroom, only using ~10% of available memory and ~2% of available CPU. There are ~20 subscribers to the channel.I'll continue trying to reproduce this in a more minimal environment than my full production app, but I wanted to flag these latency spikes in case anyone else is running into them or has run into them before.
The text was updated successfully, but these errors were encountered: