Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added read timeout to socket #254

Closed
wants to merge 13 commits into from
Closed

Added read timeout to socket #254

wants to merge 13 commits into from

Conversation

foeken
Copy link
Contributor

@foeken foeken commented Oct 14, 2014

After some production runs the code seems to hold up properly. To create a full-proof consumer you will need to set the recover_from_connection_close options. If one node is removed (graceful shutdown for maintenance) the connections are closed. Without this setting the consumer will not reconnect.

Perhaps it should default to true?

One issue we encountered was that in some cases the socket would block while reconnecting. I have not been able to find a specific reason to explain why this is happening but the result is a consumer that 'hangs'.

The code below adds the @read_write_timeout behaviour to reading too, previously it was just applied to writing to the socket. This will ensure that even in these cases the reconnect mechanism will kick-in.

We found no other issues in production.

Update The blocking socket was probably caused by stunnel, which we use to enable SSL with client certificates

@michaelklishin
Copy link
Member

Setting it to true means you won't be able to close such connections via management UI, or rather, they will immediately re-connect.

@socket.read_fully(*args)
begin
if @read_write_timeout
Bunny::Timeout.timeout(@read_write_timeout, Bunny::ClientTimeout) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not happy about bringing Bunny::Timeout.timeout back. It creates a new thread per invocation which

  • Confuses the hell out of you during debugging
  • Has very noticeable throughput effect

Please find another way to specify a timeout on the socket.

@michaelklishin
Copy link
Member

I can't merge this, it uses Bunny::Timeout.timeout which we've been removing where possible. It creates a new thread per invocation and that leads to all kinds of unexpected stuff.

@michaelklishin
Copy link
Member

Socket::SO_SNDTIMEO and Socket::SO_RCVTIMEO sounds like what we want. Let's get rid of Bunny::Timeout for socket writes, too, while we are at it. Mike Perham has discovered this a while ago.

@foeken
Copy link
Contributor Author

foeken commented Oct 14, 2014

Okay, I didn't delve deeper into the implementation for the timeout class. I've read Mike's article and I'll see what I can do :)

I have no issues with keeping the default to 'false' for the setting, but then perhaps it should be more clear in a robust example noting the negative aspects of the option?

@michaelklishin
Copy link
Member

@foeken it is documented. I'm not sure what else we can do.

@foeken
Copy link
Contributor Author

foeken commented Oct 14, 2014

Agreed, I did not see that. In an ideal world we would be able to determine a shutdown/restart force kill and a manual action ;) Perhaps the provided message can be used to that effect?

@michaelklishin
Copy link
Member

We can't because the only way to determine a management UI close is via the message which can be altered by whoever performs the operation. Please let's keep this PR about switching socket timeouts to not use Ruby's Timeout everywhere we can.

@foeken
Copy link
Contributor Author

foeken commented Oct 14, 2014

Okay, no problem. I'll start working on it tomorrow. It might take some figuring out how to get the low level socket behaviour without sacrificing platform compatibility thouhg, but I'll try my best.

@michaelklishin
Copy link
Member

@foeken we already set some socket options conditionally in Bunny::Socket implementation for CRuby.

@@ -183,8 +183,20 @@ def flush
@socket.flush if @socket
end

def read_fully(*args)
@socket.read_fully(*args)
def read_fully(count)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see any other uses of this method, so I changed it.

@foeken
Copy link
Contributor Author

foeken commented Oct 15, 2014

I did some tests using native socket timeout options (SO_RCVTIMEO and SO_SNDTIMEO) but they seem to be very broken in any ruby above 1.8.7. I tried on three operating systems and only using 1.8.7 did the native timeouts work. (confirmed by: http://moret1979.wordpress.com/2014/09/03/socket-read-timeout/)

Looking at the memcache-client from @mperham they still use them though: https://github.com/mperham/memcache-client/blob/master/lib/memcache.rb#L1084
I highly doubt they have any effect though.

The IO.select timeout method seems to be the most widely adopted for reliable and speedy timeouts. The support for this was already there, I just hooked it in.

I'll start to yank the timeouts for the writes next.

@michaelklishin
Copy link
Member

@foeken thanks. I'm curious what should be used for writes then.

@foeken
Copy link
Contributor Author

foeken commented Oct 15, 2014

I am using this: https://github.com/lann/tcp-timeout-ruby/blob/master/lib/tcp_timeout.rb as a template. But so far the main is issue is how to simulate a timeout write in the first place :D

@michaelklishin
Copy link
Member

@foeken using an iptables that drops traffic (instead of rejecting it).

@michaelklishin
Copy link
Member

@foeken if you don't have much time to properly test things with iptables, feel free to run any stress tests you can. In the end, only running things under load for a long time can prove or disprove the effectiveness of either approach.

We also have benchmarks that can be used to asses throughput effect of a change.

@foeken
Copy link
Contributor Author

foeken commented Oct 15, 2014

Okay, I have a basic working setup. The write will time out but only if the socket buffer is filled and it has to block while sending. So if we write less then the buffer it will not time out (yet). I can fix that probably by using blocking writes (just like before, I rewrote the wrong method now).

Currently I only implemented the c-ruby socket method, the SSLSockets and JRuby parts are somewhat outside my comfort zone. Could I just follow the same strategy that is currently used for the SSLSocket on the read part?

@michaelklishin
Copy link
Member

@foeken just do the same for JRuby. If JRuby's socket implementation in 1.7 has issues, it's a JRuby problem, and 1.8 has fixed a ton of I/O related incompatibilities and issues.

We don't generally support JRuby because March Hare is a much better option on JRuby anyway, with a nearly identical API. So don't lose your sleep over JRuby compatibility.

@foeken
Copy link
Contributor Author

foeken commented Oct 15, 2014

Okay the current implementation is finished. I ended up keeping the non-blocking write. I was wrong in assuming this might cause issues. Looking at the way write_nonblock works I believe this will be fine as writes the data to the OS buffer right away.

I left the JRuby implementation alone as you suggested and finished the SSL implementation using the read_fully method as an example.

I ran all the benchmarks and there is no noticeable increase or decrease in performance on my machine.

Locally I keep having issues getting the lower level tests consistently green. This is also the case for the current master, so whether these changes have an effect I am not sure. I had similar issues in the previous PR.

I am having several issues with the stress tests. Previously the read method would just wait forever for the RabbitMq server to 'catch-up'. The stress tests publish massive amounts of messages and then it starts to read from the socket. It takes way longer for RabbitMQ to respond to those reads then the current read/write timeout.

In order to get those tests running I would either have to disable the timeout by default or increase it to somewhere in the region of 2 minutes. I am not sure how to proceed here.

@foeken
Copy link
Contributor Author

foeken commented Oct 15, 2014

Okay we ran into an interesting issue. The current read timeout needs to be higher than the heartbeat interval of the server. The RabbitMQ default for heartbeats is 580 seconds.

If it is lower the connection will constantly time out due to my 'fix' since the reader_loop is just waiting until it times out over and over again.

Our original issue was the socket being kept alive due to stunnel causing the connection to hang. This should have been fixed automatically by the heartbeat but heartbeat frames are discarded and the client does not check if there has been a heartbeat timeout.

If we increase the default socket timeout to the same or a higher value of the heartbeat value that is negotiated we have a solution. But the real/proper solution would be to actually check for a heartbeat timeout.

The current changes do have merit since they remove some instances of the Timeout class, but they do not actually improve performance.

@@ -39,7 +41,7 @@ def initialize(session, host, port, opts)
@logger = session.logger
@tls_enabled = tls_enabled?(opts)

@read_write_timeout = opts[:socket_timeout] || 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to split the read and write timeout here. If we are using the heartbeart * 2 strategy the write timeout has essentially increased from 3 to a potential 2*580 seconds, which might be a bit too long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go for it.

@foeken
Copy link
Contributor Author

foeken commented Oct 16, 2014

Okay I split the two timeouts. Why is the read_write_timeout used for continuations timeout btw?

I think this is ready now, could you review the changes?

@michaelklishin
Copy link
Member

@foeken feel free to introduce another timeout.

As for timeouts vs. heartbeat, I think it's fine to set the value higher than heartbeat and recommend using lower heartbeats value since they eliminate an issue that socket timeouts won't: that it takes a long time to detect a lost TCP connection with common default OS configurations.

I'd expect that rejected packets would immediately result in a socket timeout. Try it. Dropping packages is a pretty extreme case. If that's what happens, I'm curious if we can simply stay with a fairly low (say, 10 seconds) socket timeout by default.

@foeken
Copy link
Contributor Author

foeken commented Oct 17, 2014

Okay I fixed some inconsistent naming and split both timeouts. Rejected sent packets do cause a timeout after the WRITE_TIMEOUT value (5 seconds). The READ_TIMEOUT is set to twice the heartbeat value.

I believe this is finished.

@michaelklishin
Copy link
Member

@foeken fantastic, thank you so much. I will get to it right after we ship RabbitMQ 3.4.0, which should be next week.

@foeken
Copy link
Contributor Author

foeken commented Oct 17, 2014

Cool. We just deployed the code to a server. So we'll even get some actual results from it :)

@foeken
Copy link
Contributor Author

foeken commented Oct 17, 2014

I fixed two distinct things that we found in production (maybe these should be separated later):

  • The logger seemed to cache the first host even after it switched hosts (causing confusion)
  • When the connection is broken after a successful connect it would never cycle to the next host

@michaelklishin
Copy link
Member

👍

@foeken
Copy link
Contributor Author

foeken commented Oct 17, 2014

We are having some other issues on the server:

=INFO REPORT==== 17-Oct-2014::16:04:54 ===
accepting AMQP connection <0.977.0> (127.0.0.1:34746 -> 127.0.0.1:5672)

=ERROR REPORT==== 17-Oct-2014::16:04:57 ===
closing AMQP connection <0.977.0> (127.0.0.1:34746 -> 127.0.0.1:5672):
{handshake_error,starting,25953,
                {amqp_error,frame_error,
                            "type 114, all octets = <<>>: {frame_too_large,1679849327,4088}",
                            none}}

It worked with the previous version, so something is still broken. I'll continue my search monday. If you have any ideas I would appreciate them.

@michaelklishin
Copy link
Member

Interesting. This typically means a serialisation problem (which we haven't had reported for over 6 months) or that multiple write operations on the socket were not synchronised. I suspect the write_nonblock_fully changes.

@foeken
Copy link
Contributor Author

foeken commented Oct 17, 2014

Aha! That's a good place to start ;)

Sent from my iPhone

On 17 Oct 2014, at 16:22, Michael Klishin [email protected] wrote:

Interesting. This typically means a serialisation problem (which we haven't had reported for over 6 months) or that multiple write operations on the socket were not synchronised. I suspect the write_nonblock_fully changes.


Reply to this email directly or view it on GitHub.

@foeken
Copy link
Contributor Author

foeken commented Oct 20, 2014

I cycled the host index when we retried a connection. It was possible to get stuck in a state where the hosts would never cycle and you'd get a hanging bunny consumer/publisher.

Unfortunately I have not been able to reproduce the handshake errors. For now it seems to be very solid. I'll keep it running and see what else comes up.

@michaelklishin
Copy link
Member

@foeken thanks for the update. I'll take a look first thing after RabbitMQ 3.4.0 is out (should be tomorrow).

@michaelklishin
Copy link
Member

Merged it manually. I also discovered why some tests failed when executed together with connection recovery ones, should be fine now.

@jemc
Copy link
Contributor

jemc commented May 30, 2015

I have a question about timeouts. I used git blame to check history of commits for the lines in question and it led me here, so I thought I'd comment in this thread rather than raise a new issue.

In the codebase for Channel, (almost) all of the calls to wait_on_continuations are surrounded by an enclosing timeout, as in this example:

      Bunny::Timeout.timeout(wait_on_continuations_timeout, ClientTimeout) do
        @last_queue_bind_ok = wait_on_continuations
      end

Note that wait_on_continuations_timeout is just an alias:

    def wait_on_continuations_timeout
      @connection.transport_write_timeout
    end

However, in wait_on_continuations, we see another timeout for the blocking call:

@continuations.poll(@connection.continuation_timeout)

I'd like to merge these two timeouts into one if possible, and use the "native" timeout for ContinuationQueue instead of using a watchdog-thread-based timeout (Bunny::Timeout). One approach would be to determine which timeout of the two is lesser at runtime, then use that one. However, I'm not sure whether we actually need both timeout values.

It seems most logical to me to just use @connection.continuation_timeout and ignore @connection.transport_write_timeout, because this is a timeout around a protocol operation, not a write operation. However, I'm not fully understanding all of @foeken's logic in these commits and this thread, and I didn't want to undo work I didn't understand the reason for.

Is there any reason not to eliminate the extra timeouts here?

@michaelklishin
Copy link
Member

@jemc your idea can work quite well. Note that continuation timeouts are actually configured by the users, unlike write timeouts, and assume an exception in the calling thread, not the I/O one (although if we make I/O timeouts trigger a recovery, that may be good enough).

Also note that we have a single-threaded mode which is still supported. Shouldn't change anything for timeouts but continuations implementation may look less than obvious because of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants