Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: close of closed channel #99

Merged
merged 2 commits into from
Jan 6, 2015

Conversation

mreiferson
Copy link
Member

I'm guessing this is user error, but I'm getting a panic when stopping a consumer as of this commit: 4e74fa1. Reading the comment in there, I'm trying to understand what's going on.

My shutdown code looks something like this:

func (n *NSQPeer) Teardown() {
    n.producer.Stop()
    if n.consumer != nil {
        n.consumer.DisconnectFromNSQD(n.host)
        n.consumer.Stop()
        <-n.consumer.StopChan
    }
}

Here's the error:

runtime.panic(0x509600, 0x84c1d5)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/panic.c:279 +0xf5
github.com/bitly/go-nsq.(*Consumer).exit(0xc20800ef00)
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/consumer.go:1082 +0x2e
github.com/bitly/go-nsq.func·007()
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/consumer.go:990 +0x29
created by time.goFunc
        /usr/local/Cellar/go/1.3/libexec/src/pkg/time/sleep.go:121 +0x47

Can anyone provide some insight on why I'm seeing this now?

@mreiferson
Copy link
Member

Hi @tylertreat can you paste the rest of the logs from that process, if possible?

@tylertreat
Copy link
Author

It waits several seconds and then panics. This is what it logs prior to the panic:

2014/12/29 15:37:30 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) connecting to nsqd
2014/12/29 15:37:30 INF    1 (192.168.59.103:5000) connecting to nsqd
2014/12/29 15:37:33 INF    2 stopping
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] stopping...
2014/12/29 15:37:33 ERR    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) IO error - EOF
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) beginning close
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) readLoop exiting
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) breaking out of writeLoop
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) writeLoop exiting
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) finished draining, cleanup exiting
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] (192.168.59.103:5000) clean close complete
2014/12/29 15:37:33 WRN    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] there are 0 connections left alive
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] stopping handlers
2014/12/29 15:37:33 INF    3 [test/3ycJmFzBLa9EkbRFcvSn5ZexOc9YOB7e] rdyLoop exiting
2014/12/29 15:37:33 INF    1 stopping
2014/12/29 15:37:33 INF    1 exiting router
2014/12/29 15:37:33 ERR    1 (192.168.59.103:5000) IO error - EOF
2014/12/29 15:37:33 INF    1 (192.168.59.103:5000) beginning close
2014/12/29 15:37:33 INF    1 (192.168.59.103:5000) readLoop exiting
2014/12/29 15:37:33 INF    1 (192.168.59.103:5000) breaking out of writeLoop
2014/12/29 15:37:33 INF    1 (192.168.59.103:5000) writeLoop exiting
2014/12/29 15:37:33 INF    1 (192.168.59.103:5000) finished draining, cleanup exiting
2014/12/29 15:37:33 INF    1 (192.168.59.103:5000) clean close complete

@mreiferson
Copy link
Member

Hmmm, how long does it wait, exactly? 30 seconds?

Can you share more of the code for the worker?

@tylertreat
Copy link
Author

Yeah, it looks like it's 30 seconds. Here's the producer/consumer code I'm using:

https://gist.github.com/tylertreat/23db79bdfb30e6d01762

@mreiferson
Copy link
Member

Thanks, and how about a simplified example of what's using that code? I'm trying to identify where Terminate is being called..

@tylertreat
Copy link
Author

A very simplified example of the code using it. Basically, the publisher and subscriber are running in two different goroutines.

https://gist.github.com/tylertreat/65269a28931fb0471b1a

@mreiferson
Copy link
Member

If it's stalling for 30 seconds, it means something is blocked and can't make progress to exit cleanly. Can you produce a stack trace using SIGQUIT while the process is sitting for that timeout?

@tylertreat
Copy link
Author

Here's the full, unadulterated stack trace:

2014/12/30 11:20:08 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) connecting to nsqd
2014/12/30 11:20:08 INF    1 (192.168.59.103:5000) connecting to nsqd
2014/12/30 11:20:11 INF    2 stopping
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] stopping...
2014/12/30 11:20:11 ERR    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) IO error - EOF
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) beginning close
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) readLoop exiting
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) breaking out of writeLoop
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) writeLoop exiting
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) finished draining, cleanup exiting
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] (192.168.59.103:5000) clean close complete
2014/12/30 11:20:11 WRN    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] there are 0 connections left alive
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] stopping handlers
2014/12/30 11:20:11 INF    3 [test/Ku4zatRNjyGYvesD0FVM6LMK4nfGVAPz] rdyLoop exiting
2014/12/30 11:20:11 INF    1 stopping
2014/12/30 11:20:11 INF    1 exiting router
2014/12/30 11:20:11 ERR    1 (192.168.59.103:5000) IO error - EOF
2014/12/30 11:20:11 INF    1 (192.168.59.103:5000) beginning close
2014/12/30 11:20:11 INF    1 (192.168.59.103:5000) readLoop exiting
2014/12/30 11:20:11 INF    1 (192.168.59.103:5000) breaking out of writeLoop
2014/12/30 11:20:11 INF    1 (192.168.59.103:5000) writeLoop exiting
2014/12/30 11:20:11 INF    1 (192.168.59.103:5000) finished draining, cleanup exiting
2014/12/30 11:20:11 INF    1 (192.168.59.103:5000) clean close complete

^\SIGQUIT: quit
PC=0x2db43

goroutine 17 [syscall]:
runtime.notetsleepg(0x9a7f60, 0xdf8475800)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/lock_sema.c:263 +0x71 fp=0x9a7f38 sp=0x9a7f08
runtime.MHeap_Scavenger()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/mheap.c:532 +0xa3 fp=0x9a7fa8 sp=0x9a7f38
runtime.goexit()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1445 fp=0x9a7fb0 sp=0x9a7fa8
created by runtime.main
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:207

goroutine 16 [select]:
github.com/gdamore/mangos.(*socket).RecvMsg(0xc208080000, 0x0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:199 +0x230
github.com/gdamore/mangos.(*socket).Recv(0xc208080000, 0x0, 0x0, 0x0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:218 +0x54
github.com/tylertreat/Flotilla/flotilla-server/daemon.(*Daemon).loop(0xc2080043c0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/tylertreat/Flotilla/flotilla-server/daemon/daemon.go:111 +0x6f
github.com/tylertreat/Flotilla/flotilla-server/daemon.(*Daemon).Start(0xc2080043c0, 0x251c, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/tylertreat/Flotilla/flotilla-server/daemon/daemon.go:106 +0x158
main.main()
        /Users/tylertreat/Go/src/github.com/tylertreat/Flotilla/flotilla-server/main.go:33 +0x25b

goroutine 19 [finalizer wait]:
runtime.park(0x146d0, 0x8509f8, 0x84f4a9)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x8509f8, 0x84f4a9)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1445

goroutine 20 [select]:
github.com/gdamore/mangos/protocol/rep.(*rep).sender(0xc2080041e0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:104 +0x329
created by github.com/gdamore/mangos/protocol/rep.(*rep).Init
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:47 +0xa3

goroutine 21 [IO wait]:
net.runtime_pollWait(0x949c50, 0x72, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc20802a060, 0x72, 0x0, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc20802a060, 0x0, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).accept(0xc20802a000, 0x68d720, 0x0, 0x9483f0, 0x23)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/fd_unix.go:409 +0x343
net.(*TCPListener).AcceptTCP(0xc208036018, 0xc208088580, 0x0, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/gdamore/mangos/transport/tcp.(*tcpAccepter).Accept(0xc208082580, 0x0, 0x0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/transport/tcp/tcp.go:46 +0x55
github.com/gdamore/mangos.(*socket).serve(0xc208080000, 0x949d00, 0xc208082580)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:299 +0x61
created by github.com/gdamore/mangos.(*socket).Listen
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:322 +0x2bf

goroutine 23 [select]:
github.com/gdamore/mangos/protocol/rep.(*repEp).sender(0xc20806b290)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:53 +0x143
created by github.com/gdamore/mangos/protocol/rep.(*rep).AddEndpoint
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:157 +0x156

goroutine 25 [select]:
github.com/gdamore/mangos/protocol/rep.(*repEp).sender(0xc20806b710)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:53 +0x143
created by github.com/gdamore/mangos/protocol/rep.(*rep).AddEndpoint
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:157 +0x156

goroutine 42 [chan send]:
github.com/bitly/go-nsq.(*producerConnDelegate).OnClose(0xc208036068, 0xc2080a0a20)
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/delegates.go:143 +0x52
github.com/bitly/go-nsq.(*Conn).waitForCleanup(0xc2080a0a20)
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/conn.go:669 +0xb0
created by github.com/bitly/go-nsq.func·002
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/conn.go:618 +0xe2

rax     0xe
rbx     0x1dcd9d18
rcx     0x9a7eb0
rdx     0x1dcd9d18
rdi     0xe03
rsi     0x39
rbp     0x39
rsp     0x9a7eb0
r8      0x629003
r9      0x7
r10     0x47895553
r11     0x212
r12     0x195b2e2d4521
r13     0x1a44be04e490
r14     0x13b4b618c3438000
r15     0x251c
rip     0x2db43
rflags  0x212
cs      0x7
fs      0x0
gs      0x0
exit status 2

@mreiferson
Copy link
Member

Well, this is suspicious - looks like Producer is stuck trying to delegate the OnClose "event"

goroutine 42 [chan send]:
github.com/bitly/go-nsq.(*producerConnDelegate).OnClose(0xc208036068, 0xc2080a0a20)
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/delegates.go:143 +0x52
github.com/bitly/go-nsq.(*Conn).waitForCleanup(0xc2080a0a20)
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/conn.go:669 +0xb0
created by github.com/bitly/go-nsq.func·002
        /Users/tylertreat/Go/src/github.com/bitly/go-nsq/conn.go:618 +0xe2

I'll look into it...

@tylertreat
Copy link
Author

May or may not help, but the code is actually publicly available now:

https://github.com/tylertreat/Flotilla/blob/master/flotilla-server/daemon/broker/nsq/nsq.go

@mreiferson
Copy link
Member

Cool, thanks.

P.S. now that I can see what you're building you might find this useful https://github.com/bitly/nsq/blob/master/bench/bench.py

@mreiferson
Copy link
Member

@tylertreat try this commit that I pushed here

@jehiah RFR

@mreiferson
Copy link
Member

@tylertreat PS you don't need to explicitly disconnect from nsqd in Teardown(), i.e.:

func (n *NSQPeer) Teardown() {
    n.producer.Stop()
    if n.consumer != nil {
        n.consumer.Stop()
        <-n.consumer.StopChan
    }
}

@tylertreat
Copy link
Author

Unfortunately, I still seem to be running into the same issue, even with that change:

panic: close of closed channel

goroutine 31 [running]:
github.com/bitly/go-nsq.(*Consumer).exit(0xc208052500)
        /Users/Tyler/Go/src/github.com/bitly/go-nsq/consumer.go:1082 +0x2f
github.com/bitly/go-nsq.func·007()
        /Users/Tyler/Go/src/github.com/bitly/go-nsq/consumer.go:990 +0x2a
created by time.goFunc
        /usr/local/go/src/time/sleep.go:129 +0x4b

@mreiferson
Copy link
Member

There's still something blocking, would you mind pasting the SIGQUIT goroutine dump again (looks like code has moved around on the flotilla side too, so it's hard to track without a new dump)?

Thanks for your help debugging...

@tylertreat
Copy link
Author

I'm not seeing the producer goroutine from before.

2015/01/06 09:42:10 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) connecting to nsqd
2015/01/06 09:42:10 INF    1 (192.168.59.103:5000) connecting to nsqd
2015/01/06 09:42:33 INF    2 stopping
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] stopping...
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) received CLOSE_WAIT from nsqd
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) beginning close
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) readLoop exiting
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) breaking out of writeLoop
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) writeLoop exiting
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) finished draining, cleanup exiting
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] (192.168.59.103:5000) clean close complete
2015/01/06 09:42:33 WRN    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] there are 0 connections left alive
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] stopping handlers
2015/01/06 09:42:33 INF    3 [test/mqDSylblq9LohhNCV9ftLyfaU4OPkvt2] rdyLoop exiting
2015/01/06 09:42:33 INF    1 stopping
2015/01/06 09:42:33 INF    1 exiting router
2015/01/06 09:42:33 ERR    1 (192.168.59.103:5000) IO error - EOF
2015/01/06 09:42:33 INF    1 (192.168.59.103:5000) beginning close
2015/01/06 09:42:33 INF    1 (192.168.59.103:5000) readLoop exiting
2015/01/06 09:42:33 INF    1 (192.168.59.103:5000) breaking out of writeLoop
2015/01/06 09:42:33 INF    1 (192.168.59.103:5000) writeLoop exiting
2015/01/06 09:42:33 INF    1 (192.168.59.103:5000) finished draining, cleanup exiting
2015/01/06 09:42:33 INF    1 (192.168.59.103:5000) clean close complete

^\SIGQUIT: quit
PC=0x2db43

goroutine 17 [syscall]:
runtime.notetsleepg(0x9b3f60, 0xdf8475800)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/lock_sema.c:263 +0x71 fp=0x9b3f38 sp=0x9b3f08
runtime.MHeap_Scavenger()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/mheap.c:532 +0xa3 fp=0x9b3fa8 sp=0x9b3f38
runtime.goexit()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1445 fp=0x9b3fb0 sp=0x9b3fa8
created by runtime.main
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:207

goroutine 16 [select]:
github.com/gdamore/mangos.(*socket).RecvMsg(0xc208080000, 0x0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:199 +0x230
github.com/gdamore/mangos.(*socket).Recv(0xc208080000, 0x0, 0x0, 0x0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:218 +0x54
github.com/tylertreat/Flotilla/flotilla-server/daemon.(*Daemon).loop(0xc2080043c0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/tylertreat/Flotilla/flotilla-server/daemon/daemon.go:146 +0x6f
github.com/tylertreat/Flotilla/flotilla-server/daemon.(*Daemon).Start(0xc2080043c0, 0x251c, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/tylertreat/Flotilla/flotilla-server/daemon/daemon.go:141 +0x158
main.main()
        /Users/tylertreat/Go/src/github.com/tylertreat/Flotilla/flotilla-server/main.go:33 +0x25b

goroutine 19 [finalizer wait]:
runtime.park(0x146d0, 0x859a10, 0x8584a9)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x859a10, 0x8584a9)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/proc.c:1445

goroutine 20 [select]:
github.com/gdamore/mangos/protocol/rep.(*rep).sender(0xc208004180)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:104 +0x329
created by github.com/gdamore/mangos/protocol/rep.(*rep).Init
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:47 +0xa3

goroutine 21 [IO wait]:
net.runtime_pollWait(0x955d80, 0x72, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc20802a060, 0x72, 0x0, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc20802a060, 0x0, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).accept(0xc20802a000, 0x693048, 0x0, 0x9543f0, 0x23)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/fd_unix.go:409 +0x343
net.(*TCPListener).AcceptTCP(0xc208036018, 0xc208088500, 0x0, 0x0)
        /usr/local/Cellar/go/1.3/libexec/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/gdamore/mangos/transport/tcp.(*tcpAccepter).Accept(0xc208082580, 0x0, 0x0, 0x0, 0x0)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/transport/tcp/tcp.go:46 +0x55
github.com/gdamore/mangos.(*socket).serve(0xc208080000, 0x955e30, 0xc208082580)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:299 +0x61
created by github.com/gdamore/mangos.(*socket).Listen
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/core.go:322 +0x2bf

goroutine 23 [select]:
github.com/gdamore/mangos/protocol/rep.(*repEp).sender(0xc20806b290)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:53 +0x143
created by github.com/gdamore/mangos/protocol/rep.(*rep).AddEndpoint
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:157 +0x156

goroutine 25 [select]:
github.com/gdamore/mangos/protocol/rep.(*repEp).sender(0xc20806b710)
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:53 +0x143
created by github.com/gdamore/mangos/protocol/rep.(*rep).AddEndpoint
        /Users/tylertreat/Go/src/github.com/gdamore/mangos/protocol/rep/rep.go:157 +0x156

rax     0xe
rbx     0x0
rcx     0x9b3eb0
rdx     0x0
rdi     0xe03
rsi     0x3c
rbp     0x3c
rsp     0x9b3eb0
r8      0x15643fc
r9      0xf
r10     0x3a01b295a
r11     0x212
r12     0x4a408ba51d27
r13     0x4c4f2b7d2870
r14     0x13b6d6da66ed1c00
r15     0x251c
rip     0x2db43
rflags  0x212
cs      0x7
fs      0x0
gs      0x0
exit status 2

@mreiferson
Copy link
Member

So strange, there's no NSQ code even left in the trace 😭

What commands can I run to reproduce on my own? Also what version of Go are you using?

@tylertreat
Copy link
Author

Yeah, basically it looks like consumer.stopHandlers gets called after the consumer starts sleeping for 30 seconds which results in the channel getting closed before that goroutine wakes up. Thus, trying to close it a second time. Haven't been able to trace through all the code to figure out why though.

I've tried this with Go 1.3 and 1.4.

You can try running flotilla locally if you want. If you don't have Docker/boot2docker installed, you could just point it to NSQ running on localhost. Just comment out the orchestrator Start/Stop code and point it to localhost instead.

Then just start the daemon with go run main.go and run the client with something like:

go run main.go --broker=nsq --num-messages=10000

You should see the benchmark complete and get the results back. Then the daemon panics 30 seconds later.

@mreiferson
Copy link
Member

😆

Ok, now that I see how this thing works... it never occurred to me that the process was long-lived.

The commit here does fix the "blocking" issue that we saw in the original stack trace but there's another bug for long lived processes where go-nsq is still firing a 30second timer that causes the duplicate close, even when it's already cleaned up successfully.

I'll fix, thanks!

@mreiferson
Copy link
Member

@tylertreat pushed another commit, I can now run 2 successful runs and the daemon no longer panics:

2015/01/06 11:00:08 Started container nsqio/nsqlookupd: d551c9ecd0cad23d8508d4bd00046f954346e3fb67c83b623bc781d251d4cf5a
2015/01/06 11:00:09 Started container nsqio/nsqd: 50bc0aba50385489761632f2af0146ee6554ef13a88224990300ee057d30bad0
2015/01/06 11:00:17 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) connecting to nsqd
2015/01/06 11:00:17 INF    1 (192.168.59.103:5000) connecting to nsqd
2015/01/06 11:00:18 Publisher completed
2015/01/06 11:01:29 Subscriber completed
2015/01/06 11:01:30 Benchmark completed
2015/01/06 11:01:30 INF    2 stopping
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] stopping...
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) received CLOSE_WAIT from nsqd
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) beginning close
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) readLoop exiting
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) breaking out of writeLoop
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) writeLoop exiting
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) finished draining, cleanup exiting
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] (192.168.59.103:5000) clean close complete
2015/01/06 11:01:30 WRN    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] there are 0 connections left alive
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] stopping handlers
2015/01/06 11:01:30 INF    3 [test/aGCPdRS77nUp9Ti05jg2gQjm9H1uJTSR] rdyLoop exiting
2015/01/06 11:01:30 INF    1 stopping
2015/01/06 11:01:30 INF    1 exiting router
2015/01/06 11:01:30 ERR    1 (192.168.59.103:5000) IO error - EOF
2015/01/06 11:01:30 INF    1 (192.168.59.103:5000) beginning close
2015/01/06 11:01:30 INF    1 (192.168.59.103:5000) readLoop exiting
2015/01/06 11:01:30 INF    1 (192.168.59.103:5000) breaking out of writeLoop
2015/01/06 11:01:30 INF    1 (192.168.59.103:5000) writeLoop exiting
2015/01/06 11:01:30 INF    1 (192.168.59.103:5000) finished draining, cleanup exiting
2015/01/06 11:01:30 INF    1 (192.168.59.103:5000) clean close complete
2015/01/06 11:01:30 Stopped container nsqio/nsqlookupd: d551c9ecd0cad23d8508d4bd00046f954346e3fb67c83b623bc781d251d4cf5a
2015/01/06 11:01:31 Stopped container nsqio/nsqd: 50bc0aba50385489761632f2af0146ee6554ef13a88224990300ee057d30bad0
2015/01/06 11:02:18 Started container nsqio/nsqlookupd: 331720df5572a3ca134c68e30a497a8e45f2bcdfa7422be4f957bf7cb8923a38
2015/01/06 11:02:18 Started container nsqio/nsqd: 4076725c1e96d1f86187cc5de27f229e5450ec617473922343a3f954049d830a
2015/01/06 11:02:26 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) connecting to nsqd
2015/01/06 11:02:26 INF    4 (192.168.59.103:5000) connecting to nsqd
2015/01/06 11:02:27 Publisher completed
2015/01/06 11:02:58 Subscriber completed
2015/01/06 11:02:59 Benchmark completed
2015/01/06 11:02:59 INF    5 stopping
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] stopping...
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) received CLOSE_WAIT from nsqd
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) beginning close
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) readLoop exiting
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) breaking out of writeLoop
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) writeLoop exiting
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) finished draining, cleanup exiting
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] (192.168.59.103:5000) clean close complete
2015/01/06 11:02:59 WRN    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] there are 0 connections left alive
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] stopping handlers
2015/01/06 11:02:59 INF    6 [test/iLeyqCoOv9B2N2nxhAu1B0JBVPuesfEM] rdyLoop exiting
2015/01/06 11:02:59 INF    4 stopping
2015/01/06 11:02:59 INF    4 exiting router
2015/01/06 11:02:59 ERR    4 (192.168.59.103:5000) IO error - EOF
2015/01/06 11:02:59 INF    4 (192.168.59.103:5000) beginning close
2015/01/06 11:02:59 INF    4 (192.168.59.103:5000) readLoop exiting
2015/01/06 11:02:59 INF    4 (192.168.59.103:5000) breaking out of writeLoop
2015/01/06 11:02:59 INF    4 (192.168.59.103:5000) writeLoop exiting
2015/01/06 11:02:59 INF    4 (192.168.59.103:5000) finished draining, cleanup exiting
2015/01/06 11:02:59 INF    4 (192.168.59.103:5000) clean close complete
2015/01/06 11:02:59 Stopped container nsqio/nsqlookupd: 331720df5572a3ca134c68e30a497a8e45f2bcdfa7422be4f957bf7cb8923a38
2015/01/06 11:03:00 Stopped container nsqio/nsqd: 4076725c1e96d1f86187cc5de27f229e5450ec617473922343a3f954049d830a

@mreiferson
Copy link
Member

really ready @jehiah

jehiah added a commit that referenced this pull request Jan 6, 2015
panic: runtime error: close of closed channel
@jehiah jehiah merged commit 28c8ae1 into nsqio:master Jan 6, 2015
@tylertreat
Copy link
Author

Yup, looks good on my end. Thanks for tracking that down and fixing it @mreiferson!

@mreiferson mreiferson deleted the close_race_99 branch January 6, 2015 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants