Added read timeout to socket #254

foeken · 2014-10-14T12:53:10Z

After some production runs the code seems to hold up properly. To create a full-proof consumer you will need to set the recover_from_connection_close options. If one node is removed (graceful shutdown for maintenance) the connections are closed. Without this setting the consumer will not reconnect.

Perhaps it should default to true?

One issue we encountered was that in some cases the socket would block while reconnecting. I have not been able to find a specific reason to explain why this is happening but the result is a consumer that 'hangs'.

The code below adds the @read_write_timeout behaviour to reading too, previously it was just applied to writing to the socket. This will ensure that even in these cases the reconnect mechanism will kick-in.

We found no other issues in production.

Update The blocking socket was probably caused by stunnel, which we use to enable SSL with client certificates

michaelklishin · 2014-10-14T13:53:45Z

Setting it to true means you won't be able to close such connections via management UI, or rather, they will immediately re-connect.

michaelklishin · 2014-10-14T13:55:13Z

lib/bunny/transport.rb

-      @socket.read_fully(*args)
+      begin
+        if @read_write_timeout
+          Bunny::Timeout.timeout(@read_write_timeout, Bunny::ClientTimeout) do


I'm really not happy about bringing Bunny::Timeout.timeout back. It creates a new thread per invocation which

Confuses the hell out of you during debugging

Has very noticeable throughput effect

Please find another way to specify a timeout on the socket.

michaelklishin · 2014-10-14T13:56:04Z

I can't merge this, it uses Bunny::Timeout.timeout which we've been removing where possible. It creates a new thread per invocation and that leads to all kinds of unexpected stuff.

michaelklishin · 2014-10-14T13:57:52Z

Socket::SO_SNDTIMEO and Socket::SO_RCVTIMEO sounds like what we want. Let's get rid of Bunny::Timeout for socket writes, too, while we are at it. Mike Perham has discovered this a while ago.

foeken · 2014-10-14T14:11:48Z

Okay, I didn't delve deeper into the implementation for the timeout class. I've read Mike's article and I'll see what I can do :)

I have no issues with keeping the default to 'false' for the setting, but then perhaps it should be more clear in a robust example noting the negative aspects of the option?

michaelklishin · 2014-10-14T14:14:18Z

@foeken it is documented. I'm not sure what else we can do.

foeken · 2014-10-14T14:17:45Z

Agreed, I did not see that. In an ideal world we would be able to determine a shutdown/restart force kill and a manual action ;) Perhaps the provided message can be used to that effect?

michaelklishin · 2014-10-14T15:08:16Z

We can't because the only way to determine a management UI close is via the message which can be altered by whoever performs the operation. Please let's keep this PR about switching socket timeouts to not use Ruby's Timeout everywhere we can.

foeken · 2014-10-14T15:15:41Z

Okay, no problem. I'll start working on it tomorrow. It might take some figuring out how to get the low level socket behaviour without sacrificing platform compatibility thouhg, but I'll try my best.

michaelklishin · 2014-10-14T17:43:14Z

@foeken we already set some socket options conditionally in Bunny::Socket implementation for CRuby.

foeken · 2014-10-15T08:53:12Z

lib/bunny/transport.rb

@@ -183,8 +183,20 @@ def flush
      @socket.flush if @socket
    end

-    def read_fully(*args)
-      @socket.read_fully(*args)
+    def read_fully(count)


I did not see any other uses of this method, so I changed it.

foeken · 2014-10-15T08:58:52Z

I did some tests using native socket timeout options (SO_RCVTIMEO and SO_SNDTIMEO) but they seem to be very broken in any ruby above 1.8.7. I tried on three operating systems and only using 1.8.7 did the native timeouts work. (confirmed by: http://moret1979.wordpress.com/2014/09/03/socket-read-timeout/)

Looking at the memcache-client from @mperham they still use them though: https://github.com/mperham/memcache-client/blob/master/lib/memcache.rb#L1084
I highly doubt they have any effect though.

The IO.select timeout method seems to be the most widely adopted for reliable and speedy timeouts. The support for this was already there, I just hooked it in.

I'll start to yank the timeouts for the writes next.

michaelklishin · 2014-10-15T10:02:49Z

@foeken thanks. I'm curious what should be used for writes then.

foeken · 2014-10-15T10:16:05Z

I am using this: https://github.com/lann/tcp-timeout-ruby/blob/master/lib/tcp_timeout.rb as a template. But so far the main is issue is how to simulate a timeout write in the first place :D

michaelklishin · 2014-10-15T10:34:19Z

@foeken using an iptables that drops traffic (instead of rejecting it).

michaelklishin · 2014-10-15T10:39:34Z

@foeken if you don't have much time to properly test things with iptables, feel free to run any stress tests you can. In the end, only running things under load for a long time can prove or disprove the effectiveness of either approach.

We also have benchmarks that can be used to asses throughput effect of a change.

foeken · 2014-10-15T10:41:15Z

Okay, I have a basic working setup. The write will time out but only if the socket buffer is filled and it has to block while sending. So if we write less then the buffer it will not time out (yet). I can fix that probably by using blocking writes (just like before, I rewrote the wrong method now).

Currently I only implemented the c-ruby socket method, the SSLSockets and JRuby parts are somewhat outside my comfort zone. Could I just follow the same strategy that is currently used for the SSLSocket on the read part?

michaelklishin · 2014-10-15T10:45:26Z

@foeken just do the same for JRuby. If JRuby's socket implementation in 1.7 has issues, it's a JRuby problem, and 1.8 has fixed a ton of I/O related incompatibilities and issues.

We don't generally support JRuby because March Hare is a much better option on JRuby anyway, with a nearly identical API. So don't lose your sleep over JRuby compatibility.

foeken · 2014-10-15T12:54:01Z

Okay the current implementation is finished. I ended up keeping the non-blocking write. I was wrong in assuming this might cause issues. Looking at the way write_nonblock works I believe this will be fine as writes the data to the OS buffer right away.

I left the JRuby implementation alone as you suggested and finished the SSL implementation using the read_fully method as an example.

I ran all the benchmarks and there is no noticeable increase or decrease in performance on my machine.

Locally I keep having issues getting the lower level tests consistently green. This is also the case for the current master, so whether these changes have an effect I am not sure. I had similar issues in the previous PR.

I am having several issues with the stress tests. Previously the read method would just wait forever for the RabbitMq server to 'catch-up'. The stress tests publish massive amounts of messages and then it starts to read from the socket. It takes way longer for RabbitMQ to respond to those reads then the current read/write timeout.

In order to get those tests running I would either have to disable the timeout by default or increase it to somewhere in the region of 2 minutes. I am not sure how to proceed here.

foeken · 2014-10-15T14:22:32Z

Okay we ran into an interesting issue. The current read timeout needs to be higher than the heartbeat interval of the server. The RabbitMQ default for heartbeats is 580 seconds.

If it is lower the connection will constantly time out due to my 'fix' since the reader_loop is just waiting until it times out over and over again.

Our original issue was the socket being kept alive due to stunnel causing the connection to hang. This should have been fixed automatically by the heartbeat but heartbeat frames are discarded and the client does not check if there has been a heartbeat timeout.

If we increase the default socket timeout to the same or a higher value of the heartbeat value that is negotiated we have a solution. But the real/proper solution would be to actually check for a heartbeat timeout.

The current changes do have merit since they remove some instances of the Timeout class, but they do not actually improve performance.

foeken · 2014-10-16T06:39:26Z

lib/bunny/transport.rb

@@ -39,7 +41,7 @@ def initialize(session, host, port, opts)
      @logger                = session.logger
      @tls_enabled           = tls_enabled?(opts)

-      @read_write_timeout = opts[:socket_timeout] || 3


It might be a good idea to split the read and write timeout here. If we are using the heartbeart * 2 strategy the write timeout has essentially increased from 3 to a potential 2*580 seconds, which might be a bit too long.

foeken · 2014-10-16T12:44:30Z

Okay I split the two timeouts. Why is the read_write_timeout used for continuations timeout btw?

I think this is ready now, could you review the changes?

michaelklishin · 2014-10-16T13:34:59Z

@foeken feel free to introduce another timeout.

As for timeouts vs. heartbeat, I think it's fine to set the value higher than heartbeat and recommend using lower heartbeats value since they eliminate an issue that socket timeouts won't: that it takes a long time to detect a lost TCP connection with common default OS configurations.

I'd expect that rejected packets would immediately result in a socket timeout. Try it. Dropping packages is a pretty extreme case. If that's what happens, I'm curious if we can simply stay with a fairly low (say, 10 seconds) socket timeout by default.

foeken · 2014-10-17T08:11:00Z

Okay I fixed some inconsistent naming and split both timeouts. Rejected sent packets do cause a timeout after the WRITE_TIMEOUT value (5 seconds). The READ_TIMEOUT is set to twice the heartbeat value.

I believe this is finished.

michaelklishin · 2014-10-17T09:15:56Z

@foeken fantastic, thank you so much. I will get to it right after we ship RabbitMQ 3.4.0, which should be next week.

foeken · 2014-10-17T09:18:02Z

Cool. We just deployed the code to a server. So we'll even get some actual results from it :)

foeken · 2014-10-17T12:43:42Z

I fixed two distinct things that we found in production (maybe these should be separated later):

The logger seemed to cache the first host even after it switched hosts (causing confusion)
When the connection is broken after a successful connect it would never cycle to the next host

michaelklishin · 2014-10-17T12:53:13Z

👍

foeken · 2014-10-17T14:06:55Z

We are having some other issues on the server:

=INFO REPORT==== 17-Oct-2014::16:04:54 ===
accepting AMQP connection <0.977.0> (127.0.0.1:34746 -> 127.0.0.1:5672)

=ERROR REPORT==== 17-Oct-2014::16:04:57 ===
closing AMQP connection <0.977.0> (127.0.0.1:34746 -> 127.0.0.1:5672):
{handshake_error,starting,25953,
                {amqp_error,frame_error,
                            "type 114, all octets = <<>>: {frame_too_large,1679849327,4088}",
                            none}}

It worked with the previous version, so something is still broken. I'll continue my search monday. If you have any ideas I would appreciate them.

michaelklishin · 2014-10-17T14:22:08Z

Interesting. This typically means a serialisation problem (which we haven't had reported for over 6 months) or that multiple write operations on the socket were not synchronised. I suspect the write_nonblock_fully changes.

foeken · 2014-10-17T14:26:01Z

Aha! That's a good place to start ;)

Sent from my iPhone

On 17 Oct 2014, at 16:22, Michael Klishin [email protected] wrote:

Interesting. This typically means a serialisation problem (which we haven't had reported for over 6 months) or that multiple write operations on the socket were not synchronised. I suspect the write_nonblock_fully changes.

—
Reply to this email directly or view it on GitHub.

foeken · 2014-10-20T08:22:37Z

I cycled the host index when we retried a connection. It was possible to get stuck in a state where the hosts would never cycle and you'd get a hanging bunny consumer/publisher.

Unfortunately I have not been able to reproduce the handshake errors. For now it seems to be very solid. I'll keep it running and see what else comes up.

michaelklishin · 2014-10-20T08:39:36Z

@foeken thanks for the update. I'll take a look first thing after RabbitMQ 3.4.0 is out (should be tomorrow).

michaelklishin · 2014-10-21T09:56:33Z

Merged it manually. I also discovered why some tests failed when executed together with connection recovery ones, should be fine now.

jemc · 2015-05-30T00:09:12Z

I have a question about timeouts. I used git blame to check history of commits for the lines in question and it led me here, so I thought I'd comment in this thread rather than raise a new issue.

In the codebase for Channel, (almost) all of the calls to wait_on_continuations are surrounded by an enclosing timeout, as in this example:

      Bunny::Timeout.timeout(wait_on_continuations_timeout, ClientTimeout) do
        @last_queue_bind_ok = wait_on_continuations
      end

Note that wait_on_continuations_timeout is just an alias:

    def wait_on_continuations_timeout
      @connection.transport_write_timeout
    end

However, in wait_on_continuations, we see another timeout for the blocking call:

@continuations.poll(@connection.continuation_timeout)

I'd like to merge these two timeouts into one if possible, and use the "native" timeout for ContinuationQueue instead of using a watchdog-thread-based timeout (Bunny::Timeout). One approach would be to determine which timeout of the two is lesser at runtime, then use that one. However, I'm not sure whether we actually need both timeout values.

It seems most logical to me to just use @connection.continuation_timeout and ignore @connection.transport_write_timeout, because this is a timeout around a protocol operation, not a write operation. However, I'm not fully understanding all of @foeken's logic in these commits and this thread, and I didn't want to undo work I didn't understand the reason for.

Is there any reason not to eliminate the extra timeouts here?

michaelklishin · 2015-05-30T02:30:27Z

@jemc your idea can work quite well. Note that continuation timeouts are actually configured by the users, unlike write timeouts, and assume an exception in the calling thread, not the I/O one (although if we make I/O timeouts trigger a recovery, that may be good enough).

Also note that we have a single-threaded mode which is still supported. Shouldn't change anything for timeouts but continuations implementation may look less than obvious because of that.

michaelklishin reviewed Oct 14, 2014
View reviewed changes

Used timeout support in read_fully

ce8a286

foeken reviewed Oct 15, 2014
View reviewed changes

Rewrite write_nonblock_fully to include timeouts and use it

a076b5e

foeken added 2 commits October 15, 2014 13:34

Added SSL implementation and some cleanup

05d1dea

Typo fix

98576c6

Adjustable socket timeout from the Session

29df9b2

Use the socket timeout feature for heartbeat checks

1d88289

foeken reviewed Oct 16, 2014
View reviewed changes

Split the read and write timeout

037aeff

Renamed socket_timeout to connect_timeout

e0d8b62

foeken added 2 commits October 17, 2014 14:28

Minor fixes to the logger, to prevent confusion

c532a03

This prevented cycling for breaking connections

a279b4b

foeken added 2 commits October 17, 2014 14:57

Added logging when retrying after connection issue

990d069

Explicit close of the transport layer before reusing it

36262e8

Ensure we continue retrying when the host list is depleted

a7d3a74

michaelklishin closed this Oct 21, 2014

Added read timeout to socket #254

Added read timeout to socket #254

Conversation

foeken commented Oct 14, 2014

michaelklishin commented Oct 14, 2014

michaelklishin Oct 14, 2014

Choose a reason for hiding this comment

michaelklishin commented Oct 14, 2014

michaelklishin commented Oct 14, 2014

foeken commented Oct 14, 2014

michaelklishin commented Oct 14, 2014

foeken commented Oct 14, 2014

michaelklishin commented Oct 14, 2014

foeken commented Oct 14, 2014

michaelklishin commented Oct 14, 2014

foeken Oct 15, 2014

Choose a reason for hiding this comment

foeken commented Oct 15, 2014

michaelklishin commented Oct 15, 2014

foeken commented Oct 15, 2014

michaelklishin commented Oct 15, 2014

michaelklishin commented Oct 15, 2014

foeken commented Oct 15, 2014

michaelklishin commented Oct 15, 2014

foeken commented Oct 15, 2014

foeken commented Oct 15, 2014

foeken Oct 16, 2014

Choose a reason for hiding this comment

michaelklishin Oct 16, 2014

Choose a reason for hiding this comment

foeken commented Oct 16, 2014

michaelklishin commented Oct 16, 2014

foeken commented Oct 17, 2014

michaelklishin commented Oct 17, 2014

foeken commented Oct 17, 2014

foeken commented Oct 17, 2014

michaelklishin commented Oct 17, 2014

foeken commented Oct 17, 2014

michaelklishin commented Oct 17, 2014

foeken commented Oct 17, 2014

foeken commented Oct 20, 2014

michaelklishin commented Oct 20, 2014

michaelklishin commented Oct 21, 2014

jemc commented May 30, 2015

michaelklishin commented May 30, 2015