Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsq_to_nsq: producer never reconnects in some cases #130

Closed
twmb opened this issue Mar 31, 2015 · 3 comments · Fixed by #132
Closed

nsq_to_nsq: producer never reconnects in some cases #130

twmb opened this issue Mar 31, 2015 · 3 comments · Fixed by #132

Comments

@twmb
Copy link
Contributor

twmb commented Mar 31, 2015

18:15:13 [aggregate]: finished 250 - 99th: 186.30ms - 95th: 184.76ms - avg: 184.38ms
18:19:23 [aggregate]: finished 250 - 99th: 187.25ms - 95th: 185.06ms - avg: 184.42ms
18:23:33 [aggregate]: finished 250 - 99th: 185.87ms - 95th: 184.81ms - avg: 184.38ms
18:28:38 ERR    2 (this.is.a.cname:4050) IO error - write tcp 123.45.67.890:4050: i/o timeout
18:28:38 ERR    2 (this.is.a.cname:4050) sending command - write tcp 123.45.67.890:4050: i/o timeout
18:28:38 ERR    2 (this.is.a.cname:4050) IO error - write tcp 123.45.67.890:4050: i/o timeout
18:28:38 ERR    2 (this.is.a.cname:4050) sending command - write tcp 123.45.67.890:4050: i/o timeout
18:28:38 ERR    2 (this.is.a.cname:4050) IO error - EOF
18:28:38 INF    2 (this.is.a.cname:4050) beginning close
18:28:38 INF    2 (this.is.a.cname:4050) readLoop exiting
18:28:38 INF    2 (this.is.a.cname:4050) breaking out of writeLoop
18:28:38 INF    2 (this.is.a.cname:4050) writeLoop exiting
18:28:38 INF    2 (this.is.a.cname:4050) finished draining, cleanup exiting
18:28:38 INF    2 (this.is.a.cname:4050) clean close complete
18:28:38 WRN    1 [my-logs/nsq_to_nsq] backing off for 2.0000 seconds (backoff level 1), setting all to RDY 0
18:28:38 [aggregate]: finished 250 - 99th: 60153.95ms - 95th: 6481.75ms - avg: 2178.38ms
18:28:40 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
18:29:38 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) IO error - read tcp 127.0.0.1:4050: i/o timeout
18:29:38 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) delaying close, 24 outstanding messages
18:29:38 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) readLoop exiting
18:30:03 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) IO error - write tcp 127.0.0.1:4050: connection reset by peer
18:30:03 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) error sending RDY 1 - write tcp 127.0.0.1:4050: connection reset by peer
18:30:03 WRN    1 [my-logs/nsq_to_nsq] backing off for 4.0000 seconds (backoff level 2), setting all to RDY 0
18:30:03 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) IO error - write tcp 127.0.0.1:4050: connection reset by peer
18:30:03 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) error sending command REQ 0811ec9736c5d000 90000 - write tcp 127.0.0.1:4050: connection reset by peer
18:30:03 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) beginning close
18:30:03 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) IO error - write tcp 127.0.0.1:4050: connection reset by peer
18:30:03 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) error sending command REQ 0811ec983085d000 90000 - write tcp 127.0.0.1:4050: connection reset by peer
18:30:03 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) breaking out of writeLoop
18:30:03 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) writeLoop exiting
18:30:03 INF    2 exiting router
18:30:03 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) finished draining, cleanup exiting
18:30:03 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) clean close complete
18:30:03 WRN    1 [my-logs/nsq_to_nsq] there are 0 connections left alive
18:30:03 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) re-connecting in 15 seconds...
18:30:18 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) connecting to nsqd
18:30:18 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) upgrading to TLS
19:01:43 WRN    1 [my-logs/nsq_to_nsq] backing off for 4.0000 seconds (backoff level 2), setting all to RDY 0
19:01:47 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:01:47 INF    2 (this.is.a.cname:4050) connecting to nsqd
19:01:48 ERR    2 (this.is.a.cname:4050) error connecting to nsqd - dial tcp 123.45.67.890:4050: i/o timeout
19:01:48 ERR    1 [my-logs/nsq_to_nsq] Handler returned error (dial tcp 123.45.67.890:4050: i/o timeout) for msg 0811f33e490d7000
19:01:48 WRN    1 [my-logs/nsq_to_nsq] backing off for 8.0000 seconds (backoff level 3), setting all to RDY 0
19:01:56 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:01:56 INF    2 (this.is.a.cname:4050) connecting to nsqd
19:02:00 INF    2 (this.is.a.cname:4050) upgrading to TLS
19:02:05 ERR    2 (this.is.a.cname:4050) error connecting to nsqd - failed to IDENTIFY - tls: oversized record received with length 19968
19:02:05 ERR    1 [my-logs/nsq_to_nsq] Handler returned error (failed to IDENTIFY - tls: oversized record received with length 19968) for msg 0811f33f428d7000
19:02:05 WRN    1 [my-logs/nsq_to_nsq] backing off for 15.0000 seconds (backoff level 4), setting all to RDY 0
19:02:20 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:02:20 INF    2 (this.is.a.cname:4050) connecting to nsqd
19:03:06 INF    2 (this.is.a.cname:4050) upgrading to TLS
19:03:06 ERR    2 (this.is.a.cname:4050) error connecting to nsqd - failed to IDENTIFY - tls: oversized record received with length 19968
19:03:06 ERR    1 [my-logs/nsq_to_nsq] Handler returned error (failed to IDENTIFY - tls: oversized record received with length 19968) for msg 0811f3403d0d7000
19:03:06 WRN    1 [my-logs/nsq_to_nsq] backing off for 15.0000 seconds (backoff level 4), setting all to RDY 0
19:03:21 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:03:21 INF    2 (this.is.a.cname:4050) connecting to nsqd
19:04:10 INF    2 (this.is.a.cname:4050) upgrading to TLS
19:04:10 ERR    2 (this.is.a.cname:4050) error connecting to nsqd - failed to IDENTIFY - tls: oversized record received with length 19968
19:04:10 ERR    1 [my-logs/nsq_to_nsq] Handler returned error (failed to IDENTIFY - tls: oversized record received with length 19968) for msg 0811f341364d7000
19:04:10 WRN    1 [my-logs/nsq_to_nsq] backing off for 15.0000 seconds (backoff level 4), setting all to RDY 0
19:04:25 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:04:25 INF    2 (this.is.a.cname:4050) connecting to nsqd
19:04:25 INF    2 (this.is.a.cname:4050) upgrading to TLS
19:04:26 WRN    1 [my-logs/nsq_to_nsq] backing off for 8.0000 seconds (backoff level 3), setting all to RDY 0
19:04:34 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:04:34 WRN    1 [my-logs/nsq_to_nsq] backing off for 4.0000 seconds (backoff level 2), setting all to RDY 0
19:06:30 ERR    2 (this.is.a.cname:4050) IO error - EOF
19:06:30 INF    2 (this.is.a.cname:4050) beginning close
19:06:30 INF    2 (this.is.a.cname:4050) readLoop exiting
19:06:30 INF    2 (this.is.a.cname:4050) breaking out of writeLoop
19:06:30 INF    2 (this.is.a.cname:4050) writeLoop exiting
19:06:30 ERR    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) IO error - EOF
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) beginning close
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) readLoop exiting
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) breaking out of writeLoop
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) writeLoop exiting
19:06:30 WRN    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) backoff timeout expired, sending RDY 1
19:06:30 INF    2 (this.is.a.cname:4050) finished draining, cleanup exiting
19:06:30 INF    2 (this.is.a.cname:4050) clean close complete
19:06:30 INF    2 exiting router
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) finished draining, cleanup exiting
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) clean close complete
19:06:30 WRN    1 [my-logs/nsq_to_nsq] there are 0 connections left alive
19:06:30 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) re-connecting in 15 seconds...
19:06:45 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) connecting to nsqd
19:06:45 INF    1 [my-logs/nsq_to_nsq] (127.0.0.1:4050) upgrading to TLS

After both, at the end, where it says "upgrading to TLS", the producer side was also alive. I fixed by restarting nsq_to_nsq, which spammed finished 250 logs, aka, it was down for hours.

@mreiferson
Copy link
Member

@twmb thanks for the report - are we missing some sort of timeout on the TLS handshake? Is that where it got stuck?

@mreiferson
Copy link
Member

I can't see why a handshake wouldn't timeout, since it's presumably based on the underlying connection's deadlines (which we set before handshaking).

If this happens again, kill the process via SIGABRT so that it generates a goroutine dump - we'll hopefully be able to figure out where it's getting stuck.

@mreiferson
Copy link
Member

see #132, thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants