Latency spikes #208

KMarshland · 2024-10-10T21:30:51Z

In production (running on render), I'm seeing occasional latency spikes. End-to-end, it typically takes a few tens of milliseconds for a message to go through, but occasionally this spikes to ten seconds.

I'm testing end to end latency with ActionCable.server.broadcast 'mission_control', { event: 'latency_check', info: { timestamp: Time.now.to_f } }, then console.loging it on the frontend. I'm pretty sure the latency isn't being introduced by redis; I can monitor it with redis.psubscribe('__anycable__') and see that latency there is low (typically ~1ms). I'm also pretty sure that this isn't happening on the frontend (eg browser bandwidth limitations): across multiple tabs and multiple devices and multiple networks these latency spikes come at precisely the same time. In terms of versions, I'm using anycable-go 1.5.4 and gem version 1.2.2.

Interestingly, this latency seems to be per stream. For example, if I have stream_from 'global' and stream_from "per_user:#{user.to_gid_param}", I can broadcast the same message to both simultaneously and global has the latency spikes, whereas the per-user one does not. global, naturally, has a lot more load on it: roughly 1 Kbps, while per_user has essentially zero load. The server I'm running it on has plenty of headroom, only using ~10% of available memory and ~2% of available CPU. There are ~20 subscribers to the channel.

I'll continue trying to reproduce this in a more minimal environment than my full production app, but I wanted to flag these latency spikes in case anyone else is running into them or has run into them before.

The text was updated successfully, but these errors were encountered:

palkan · 2024-10-16T21:51:23Z

Hey Kai,

Sorry for late response. And thanks for the details.

There are ~20 subscribers to the channel.

Hm, with such a low number of active subscribers, it's strange to see latency spikes. Let me think about what could cause them.

A bit of how things work under the hood.

We use a pool of workers (we call them "gates" internally) to deliver broadcast messages; thus, every broadcast passes through a particular gate, where we iterate over subscribers and perform the actual socket.send(data) calls. Broadcasts are processed in a FIFO nature, one by one. Why then there is difference in behaviour between the global stream and per user streams?

First, each stream is processed by a single gate, but different streams may be processed by different gates concurrently. This is by design: we want to guarantee the order of broadcasted messages, so we cannot deliver them concurrently.

The blocking could happen due to the growing backlog of messages to deliver to everyone in the global stream's gate. There is also a buffer, or a queue (or a channel) before entering the gates, but since per-stream broadcasts are not affected, we may conclude that it handles the load without any problems. Thus, the problem lies somewhere deeper.

Each gate uses a buffered channel (size of 256) as a queue for broadcasts. When it's overflown, an attempt to send a broadcast message to this gate is blocked until we have enough capacity. That would block the whole hub (all gates)—something that rarely happens (and should not at all, but..), and we don't observe this either.

Then, we deal with clients.

Writing a date to a socket is usually a quick operation: in most cases, we just move data to the underlying I/O buffer, and the rest is not our business (well, not precisely). Still, this buffer has its size, that affects how often do we perform syscalls which could be blocking. You can experiment with WS buffers using --write_buffer_size and --read_buffer_size options (both defaults to 1024, bytes).

Also, in Pro version we have a slightly different mechanism (more memory-efficient) of dealing with WS read and writes. Worth trying here, too.

KMarshland · 2024-10-17T00:45:14Z

Very interesting! Thanks for the details about implementation. I ended up splitting the same load across multiple streams, and that ended up eliminating the issue. I did also notice that the load was higher than I reported initially. As it's variable it's hard to get a precise number, but I'd estimate it more at 10 KB/s. Many individual messages were around 10KB. Given what you say about buffer sizes, the theory about syscalls being blocking sounds quite plausible. It does surprise me that it could cause such a big latency spike even if it requires 10 syscalls instead of 1, but I guess things could build up with the variable load.

Unfortunately, since I've split data across multiple streams I no longer have an easy way to test other fixes, but if it's necessary I could probably duplicate the traffic to a local app for testing.

palkan · 2024-10-17T01:02:51Z

if it's necessary I could probably duplicate the traffic to a local app for testing

not necessary but that would help to investigate this further

palkan added the investigation required label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency spikes #208

Latency spikes #208

KMarshland commented Oct 10, 2024

palkan commented Oct 16, 2024

KMarshland commented Oct 17, 2024

palkan commented Oct 17, 2024

Latency spikes #208

Latency spikes #208

Comments

KMarshland commented Oct 10, 2024

palkan commented Oct 16, 2024

KMarshland commented Oct 17, 2024

palkan commented Oct 17, 2024