-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added read timeout to socket #254
Conversation
Setting it to true means you won't be able to close such connections via management UI, or rather, they will immediately re-connect. |
@socket.read_fully(*args) | ||
begin | ||
if @read_write_timeout | ||
Bunny::Timeout.timeout(@read_write_timeout, Bunny::ClientTimeout) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not happy about bringing Bunny::Timeout.timeout
back. It creates a new thread per invocation which
- Confuses the hell out of you during debugging
- Has very noticeable throughput effect
Please find another way to specify a timeout on the socket.
I can't merge this, it uses |
|
Okay, I didn't delve deeper into the implementation for the timeout class. I've read Mike's article and I'll see what I can do :) I have no issues with keeping the default to 'false' for the setting, but then perhaps it should be more clear in a robust example noting the negative aspects of the option? |
@foeken it is documented. I'm not sure what else we can do. |
Agreed, I did not see that. In an ideal world we would be able to determine a shutdown/restart force kill and a manual action ;) Perhaps the provided message can be used to that effect? |
We can't because the only way to determine a management UI close is via the message which can be altered by whoever performs the operation. Please let's keep this PR about switching socket timeouts to not use Ruby's Timeout everywhere we can. |
Okay, no problem. I'll start working on it tomorrow. It might take some figuring out how to get the low level socket behaviour without sacrificing platform compatibility thouhg, but I'll try my best. |
@foeken we already set some socket options conditionally in |
@@ -183,8 +183,20 @@ def flush | |||
@socket.flush if @socket | |||
end | |||
|
|||
def read_fully(*args) | |||
@socket.read_fully(*args) | |||
def read_fully(count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see any other uses of this method, so I changed it.
I did some tests using native socket timeout options ( Looking at the memcache-client from @mperham they still use them though: https://github.com/mperham/memcache-client/blob/master/lib/memcache.rb#L1084 The IO.select timeout method seems to be the most widely adopted for reliable and speedy timeouts. The support for this was already there, I just hooked it in. I'll start to yank the timeouts for the writes next. |
@foeken thanks. I'm curious what should be used for writes then. |
I am using this: https://github.com/lann/tcp-timeout-ruby/blob/master/lib/tcp_timeout.rb as a template. But so far the main is issue is how to simulate a timeout write in the first place :D |
@foeken using an iptables that drops traffic (instead of rejecting it). |
@foeken if you don't have much time to properly test things with iptables, feel free to run any stress tests you can. In the end, only running things under load for a long time can prove or disprove the effectiveness of either approach. We also have benchmarks that can be used to asses throughput effect of a change. |
Okay, I have a basic working setup. The write will time out but only if the socket buffer is filled and it has to block while sending. So if we write less then the buffer it will not time out (yet). I can fix that probably by using blocking writes (just like before, I rewrote the wrong method now). Currently I only implemented the c-ruby socket method, the SSLSockets and JRuby parts are somewhat outside my comfort zone. Could I just follow the same strategy that is currently used for the SSLSocket on the read part? |
@foeken just do the same for JRuby. If JRuby's socket implementation in 1.7 has issues, it's a JRuby problem, and 1.8 has fixed a ton of I/O related incompatibilities and issues. We don't generally support JRuby because March Hare is a much better option on JRuby anyway, with a nearly identical API. So don't lose your sleep over JRuby compatibility. |
Okay the current implementation is finished. I ended up keeping the non-blocking write. I was wrong in assuming this might cause issues. Looking at the way write_nonblock works I believe this will be fine as writes the data to the OS buffer right away. I left the JRuby implementation alone as you suggested and finished the SSL implementation using the read_fully method as an example. I ran all the benchmarks and there is no noticeable increase or decrease in performance on my machine. Locally I keep having issues getting the lower level tests consistently green. This is also the case for the current master, so whether these changes have an effect I am not sure. I had similar issues in the previous PR. I am having several issues with the stress tests. Previously the read method would just wait forever for the RabbitMq server to 'catch-up'. The stress tests publish massive amounts of messages and then it starts to read from the socket. It takes way longer for RabbitMQ to respond to those reads then the current read/write timeout. In order to get those tests running I would either have to disable the timeout by default or increase it to somewhere in the region of 2 minutes. I am not sure how to proceed here. |
Okay we ran into an interesting issue. The current read timeout needs to be higher than the heartbeat interval of the server. The RabbitMQ default for heartbeats is 580 seconds. If it is lower the connection will constantly time out due to my 'fix' since the reader_loop is just waiting until it times out over and over again. Our original issue was the socket being kept alive due to stunnel causing the connection to hang. This should have been fixed automatically by the heartbeat but heartbeat frames are discarded and the client does not check if there has been a heartbeat timeout. If we increase the default socket timeout to the same or a higher value of the heartbeat value that is negotiated we have a solution. But the real/proper solution would be to actually check for a heartbeat timeout. The current changes do have merit since they remove some instances of the Timeout class, but they do not actually improve performance. |
@@ -39,7 +41,7 @@ def initialize(session, host, port, opts) | |||
@logger = session.logger | |||
@tls_enabled = tls_enabled?(opts) | |||
|
|||
@read_write_timeout = opts[:socket_timeout] || 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to split the read and write timeout here. If we are using the heartbeart * 2 strategy the write timeout has essentially increased from 3 to a potential 2*580 seconds, which might be a bit too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go for it.
Okay I split the two timeouts. Why is the read_write_timeout used for continuations timeout btw? I think this is ready now, could you review the changes? |
@foeken feel free to introduce another timeout. As for timeouts vs. heartbeat, I think it's fine to set the value higher than heartbeat and recommend using lower heartbeats value since they eliminate an issue that socket timeouts won't: that it takes a long time to detect a lost TCP connection with common default OS configurations. I'd expect that rejected packets would immediately result in a socket timeout. Try it. Dropping packages is a pretty extreme case. If that's what happens, I'm curious if we can simply stay with a fairly low (say, 10 seconds) socket timeout by default. |
Okay I fixed some inconsistent naming and split both timeouts. Rejected sent packets do cause a timeout after the WRITE_TIMEOUT value (5 seconds). The READ_TIMEOUT is set to twice the heartbeat value. I believe this is finished. |
@foeken fantastic, thank you so much. I will get to it right after we ship RabbitMQ 3.4.0, which should be next week. |
Cool. We just deployed the code to a server. So we'll even get some actual results from it :) |
I fixed two distinct things that we found in production (maybe these should be separated later):
|
👍 |
We are having some other issues on the server:
It worked with the previous version, so something is still broken. I'll continue my search monday. If you have any ideas I would appreciate them. |
Interesting. This typically means a serialisation problem (which we haven't had reported for over 6 months) or that multiple write operations on the socket were not synchronised. I suspect the |
Aha! That's a good place to start ;) Sent from my iPhone
|
I cycled the host index when we retried a connection. It was possible to get stuck in a state where the hosts would never cycle and you'd get a hanging bunny consumer/publisher. Unfortunately I have not been able to reproduce the handshake errors. For now it seems to be very solid. I'll keep it running and see what else comes up. |
@foeken thanks for the update. I'll take a look first thing after RabbitMQ 3.4.0 is out (should be tomorrow). |
Merged it manually. I also discovered why some tests failed when executed together with connection recovery ones, should be fine now. |
I have a question about timeouts. I used In the codebase for Bunny::Timeout.timeout(wait_on_continuations_timeout, ClientTimeout) do
@last_queue_bind_ok = wait_on_continuations
end Note that def wait_on_continuations_timeout
@connection.transport_write_timeout
end However, in @continuations.poll(@connection.continuation_timeout) I'd like to merge these two timeouts into one if possible, and use the "native" timeout for It seems most logical to me to just use Is there any reason not to eliminate the extra timeouts here? |
@jemc your idea can work quite well. Note that continuation timeouts are actually configured by the users, unlike write timeouts, and assume an exception in the calling thread, not the I/O one (although if we make I/O timeouts trigger a recovery, that may be good enough). Also note that we have a single-threaded mode which is still supported. Shouldn't change anything for timeouts but continuations implementation may look less than obvious because of that. |
After some production runs the code seems to hold up properly. To create a full-proof consumer you will need to set the
recover_from_connection_close
options. If one node is removed (graceful shutdown for maintenance) the connections are closed. Without this setting the consumer will not reconnect.Perhaps it should default to true?
One issue we encountered was that in some cases the socket would block while reconnecting. I have not been able to find a specific reason to explain why this is happening but the result is a consumer that 'hangs'.
The code below adds the
@read_write_timeout
behaviour to reading too, previously it was just applied to writing to the socket. This will ensure that even in these cases the reconnect mechanism will kick-in.We found no other issues in production.
Update The blocking socket was probably caused by stunnel, which we use to enable SSL with client certificates