Possible deadlock when using an aggregator #2914

R7R8 · 2017-06-13T08:05:37Z

Directions

GitHub Issues are reserved for actionable bug reports and feature requests.
General questions should be asked at the InfluxData Community site.

Before opening an issue, search for similar bug reports or feature requests on GitHub Issues.
If no similar issue can be found, fill out either the "Bug Report" or the "Feature Request" section below.
Erase the other section and everything on and above this line.

Please note, the quickest way to fix a bug is to open a Pull Request.

Bug report

version: mater 、release 1.3
telegraf stop sending metrics after 3 days

Relevant telegraf.conf:

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "11s"
  flush_jitter = "5s"
  precision = ""
  debug = true
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://xxxx"] # required
  database = "xxxx" # required

  ## Retention policy to write to. Empty string writes to the default rp.
  retention_policy = "rp_30s"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  username = "xxxx"
  password = "xxxx"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

# Keep the aggregate min/max of each metric passing through.
 [[aggregators.sum]]
   ## General Aggregator Arguments:
   ## The period on which to flush & clear the aggregator.
   period = "30s"
   ## If true, the original metric will be dropped by the
   ## aggregator and will not get sent to the output plugins.
   drop_original = true

# Influx HTTP write listener
 [[inputs.http_listener]]
   ## Address and port to host HTTP listener on
   service_address = ":8186"

   ## maximum duration before timing out read of the request
   read_timeout = "10s"
   ## maximum duration before timing out write of the response
   write_timeout = "10s"

   ## Maximum allowed http request body size in bytes.
   ## 0 means to use the default of 536,870,912 bytes (500 mebibytes)
   max_body_size = 0

   ## Maximum line size allowed to be sent in bytes.
   ## 0 means to use the default of 65536 bytes (64 kibibytes)
   max_line_size = 0

System info:

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

...
...

Expected behavior:

Actual behavior:

Additional info:

send a SIGQUIT (^) to the process

goroutine 2943073 [chan send]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xca36439470, 0xe, 0xca3599be60, 0xca3599be90, 0xca36455fe0, 0x1, 0x1)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xca3646a000, 0x545, 0x10000, 0xed0d18287, 0x327c4dcf, 0x1bde240, 0x0, 0x0, 0x0, ...)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xca353f7a40, 0xca359b3200)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:268 +0x793
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xca353f7a40, 0xca359b3200)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xca353f7a40, 0xca359b3200)
	/usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xca36436e60, 0x1af1660, 0xca35ced580)
	/usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 2943074 [IO wait]:
net.runtime_pollWait(0x7fb31d0f1400, 0x72, 0x23aee)
	/usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xca35d38d88, 0x72, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xca35d38d88, 0xca35ced5d1, 0x1)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xca35d38d20, 0xca35ced5d1, 0x1, 0x1, 0x0, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xca0b51c8a8, 0xca35ced5d1, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xca35ced5c0)
	/usr/local/opt/go/libexec/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/opt/go/libexec/src/net/http/server.go:652 +0xdf

goroutine 2943054 [chan send]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xca363d82c0, 0xe, 0xca35fc1770, 0xca35fc17a0, 0xca363c4e60, 0x1, 0x1)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xca36488000, 0x13df, 0x10000, 0xed0d18287, 0x364b6ebc, 0x1bde240, 0x0, 0x0, 0x0, ...)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xca35317340, 0xca3516fb00)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:268 +0x793
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xca35317340, 0xca3516fb00)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xca35317340, 0xca3516fb00)
	/usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xca3626aaa0, 0x1af1660, 0xca362eed40)
	/usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 2943055 [IO wait]:
net.runtime_pollWait(0x7fb31d0f1340, 0x72, 0x23aef)
	/usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xca3635e4c8, 0x72, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xca3635e4c8, 0xca362eed91, 0x1)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xca3635e460, 0xca362eed91, 0x1, 0x1, 0x0, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xca0b4a67c0, 0xca362eed91, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xca362eed80)
	/usr/local/opt/go/libexec/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/opt/go/libexec/src/net/http/server.go:652 +0xdf

rax    0xca
rbx    0x1
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x1bdf6b0
rsi    0x0
rbp    0x7ffef7e5a8c0
rsp    0x7ffef7e5a878
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0xc420b204e0
r14    0x43fee0
r15    0x160c980
rip    0x45b531
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

[Include gist of relevant config, logs, etc.]

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

The text was updated successfully, but these errors were encountered:

danielnelson · 2017-06-13T18:43:18Z

Nothing stands out to me from the stack trace, anything interesting in the logfile?

R7R8 · 2017-06-14T01:02:45Z

Thank you for your help.
The codes from the beginning of the stderr.log are as follow.

SIGQUIT: quit
PC=0x45b531 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x1bdf6b0, 0x0, 0x0, 0x0, 0xc400000000, 0x100000000, 0x0, 0x0, 0x7ffef7e5a8f0, 0x40e53b, ...)
        /usr/local/opt/go/libexec/src/runtime/sys_linux_amd64.s:422 +0x21
runtime.futexsleep(0x1bdf6b0, 0x0, 0xffffffffffffffff)
        /usr/local/opt/go/libexec/src/runtime/os_linux.go:45 +0x62
runtime.notesleep(0x1bdf6b0)
        /usr/local/opt/go/libexec/src/runtime/lock_futex.go:145 +0x6b
runtime.stopm()
        /usr/local/opt/go/libexec/src/runtime/proc.go:1650 +0xad
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/opt/go/libexec/src/runtime/proc.go:2102 +0x2e4
runtime.schedule()
        /usr/local/opt/go/libexec/src/runtime/proc.go:2222 +0x14c
runtime.park_m(0xc420001ba0)
        /usr/local/opt/go/libexec/src/runtime/proc.go:2285 +0xab
runtime.mcall(0x7ffef7e5aa80)
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:269 +0x5b


goroutine 1 [semacquire, 7235 minutes]:
sync.runtime_Semacquire(0xc4201f033c)
        /usr/local/opt/go/libexec/src/runtime/sema.go:47 +0x34
sync.(*WaitGroup).Wait(0xc4201f0330)
        /usr/local/opt/go/libexec/src/sync/waitgroup.go:131 +0x7a
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc4202be008, 0xc42008c360, 0x0, 0x0)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/agent.go:400 +0x4b3
main.reloadLoop(0xc42008c900, 0x1c013b0, 0x0, 0x0, 0x1c013b0, 0x0, 0x0, 0x1c013b0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:222 +0xa06
main.main()
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:392 +0x65b

goroutine 6 [syscall, 7235 minutes]:
os/signal.signal_recv(0x0)
        /usr/local/opt/go/libexec/src/runtime/sigqueue.go:116 +0x104
os/signal.loop()
        /usr/local/opt/go/libexec/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
        /usr/local/opt/go/libexec/src/os/signal/signal_unix.go:28 +0x41

goroutine 2717751 [IO wait, 291 minutes]:
net.runtime_pollWait(0x7fb334a1e380, 0x72, 0x79bb)
        /usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc562b55028, 0x72, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc562b55028, 0xc55ba55011, 0x1)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc562b54fc0, 0xc55ba55011, 0x1, 0x1, 0x0, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc543dd8350, 0xc55ba55011, 0x1, 0x1, 0x0, 0x0, 0x0)
        /usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xc55ba55000)
        /usr/local/opt/go/libexec/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
        /usr/local/opt/go/libexec/src/net/http/server.go:652 +0xdf

goroutine 2757400 [chan send, 379 minutes]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xc420b14950, 0xe, 0xc4207fe570, 0xc4207fe5a0, 0xc42075abc0, 0x1, 0x1)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xc4236f6000, 0x66b0, 0x10000, 0xed0d12980, 0x3266ef58, 0x1bde240, 0x0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xc4233708c0, 0xc423106b00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:268 +0x793
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xc4233708c0, 0xc423106b00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xc4233708c0, 0xc423106b00)
        /usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc4235a6e60, 0x1af1660, 0xc4233758c0)
        /usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
        /usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 2761034 [chan send, 371 minutes]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xc4390bea20, 0x10, 0xc43582b770, 0xc43582b7a0, 0xc439ac9b00, 0x1, 0x1)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xc43e856000, 0xff09, 0x10000, 0xed0d12b63, 0x7ad47b, 0x1bde240, 0x0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xc43dd2fa40, 0xc43dbb9600)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:293 +0x48b
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xc43dd2fa40, 0xc43dbb9600)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xc43dd2fa40, 0xc43dbb9600)
        /usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc43e7e92c0, 0x1af1660, 0xc43e4d26c0)
        /usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
        /usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 96 [IO wait, 380 minutes]:
net.runtime_pollWait(0x7fb33b34c8c8, 0x72, 0x4)
        /usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4201ee378, 0x72, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4201ee378, 0xc420360000, 0x1000)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc4201ee310, 0xc420360000, 0x1000, 0x1000, 0x0, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc42017a078, 0xc420360000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*persistConn).Read(0xc4203d9c20, 0xc420360000, 0x1000, 0x1000, 0x4, 0xc4202c8260, 0x16)
        /usr/local/opt/go/libexec/src/net/http/transport.go:1316 +0x14b
bufio.(*Reader).fill(0xc4203b8960)
        /usr/local/opt/go/libexec/src/bufio/bufio.go:97 +0x117
bufio.(*Reader).Peek(0xc4203b8960, 0x1, 0x0, 0x1, 0x0, 0xc420b276e0, 0x0)
        /usr/local/opt/go/libexec/src/bufio/bufio.go:129 +0x67
net/http.(*persistConn).readLoop(0xc4203d9c20)
        /usr/local/opt/go/libexec/src/net/http/transport.go:1474 +0x196
created by net/http.(*Transport).dialConn
        /usr/local/opt/go/libexec/src/net/http/transport.go:1117 +0xa35

goroutine 97 [select, 380 minutes]:
net/http.(*persistConn).writeLoop(0xc4203d9c20)
        /usr/local/opt/go/libexec/src/net/http/transport.go:1704 +0x43a
created by net/http.(*Transport).dialConn
        /usr/local/opt/go/libexec/src/net/http/transport.go:1118 +0xa5a

goroutine 2756464 [chan send, 371 minutes]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xc434f16b90, 0xe, 0xc43745d3b0, 0xc43745d500, 0xc4329adfe0, 0x1, 0x1)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xc43e542000, 0xffc5, 0x10000, 0xed0d12b60, 0x277fd991, 0x1bde240, 0x0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xc43d57fb20, 0xc43dbb8d00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:293 +0x48b
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xc43d57fb20, 0xc43dbb8d00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xc43d57fb20, 0xc43dbb8d00)
        /usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc439e77040, 0x1af1660, 0xc439607100)
        /usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
        /usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 25 [select, 7235 minutes, locked to thread]:
runtime.gopark(0x13ec2d0, 0x0, 0x13719c1, 0x6, 0x18, 0x2)
        /usr/local/opt/go/libexec/src/runtime/proc.go:271 +0x13a
runtime.selectgoImpl(0xc420422f50, 0x0, 0x18)
        /usr/local/opt/go/libexec/src/runtime/select.go:423 +0x1364
runtime.selectgo(0xc420422f50)
        /usr/local/opt/go/libexec/src/runtime/select.go:238 +0x1c
runtime.ensureSigM.func1()
        /usr/local/opt/go/libexec/src/runtime/signal_unix.go:434 +0x2dd
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:2197 +0x1

goroutine 26 [select, 7235 minutes]:
main.reloadLoop.func1(0xc42008c3c0, 0xc42008c360, 0xc4201fed90, 0xc42008c900)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:183 +0x24a
created by main.reloadLoop
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:197 +0x6cb

goroutine 27 [IO wait]:
net.runtime_pollWait(0x7fb33b34c988, 0x72, 0x0)
        /usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc420200298, 0x72, 0x0, 0xca363c4800)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc420200298, 0xffffffffffffffff, 0x0)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc420200230, 0x0, 0x1ae6320, 0xca363c4800)
        /usr/local/opt/go/libexec/src/net/fd_unix.go:430 +0x1e5
net.(*TCPListener).accept(0xc4202be020, 0xc420427de0, 0x685e3e, 0x456790)
        /usr/local/opt/go/libexec/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc4202be020, 0x13ebc68, 0xca3626aaa0, 0x1af1720, 0xc42015af60)
        /usr/local/opt/go/libexec/src/net/tcpsock.go:228 +0x49
net/http.(*Server).Serve(0xc42048ad10, 0x1af03e0, 0xc4202be020, 0x0, 0x0)
        /usr/local/opt/go/libexec/src/net/http/server.go:2643 +0x228
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).httpListen(0xc4201bc000, 0x13ec400, 0xc4201bc040)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:170 +0x10b
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).Start.func1(0xc4201bc000)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:133 +0x57
created by github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).Start
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:134 +0x7ce

R7R8 · 2017-06-14T01:05:59Z

here is the whole stack tree.
stderr.log.zip

danielnelson · 2017-06-14T01:22:03Z

Wow, that is a lot of http_listener goroutines, I'll have to think about how we should deal with that...

Can you share the code for this?
/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/aggregators/sum/sum.go

R7R8 · 2017-06-14T01:31:05Z

Thank you for your reply.
Yes, I sent metrics to telegraf every 5 seconds by post.
Is there a better way send metrics to telegraf.

here is the sum code
sum.zip

danielnelson · 2017-06-14T20:00:39Z

I've found the cause of this, the running_aggregator could be blocked during push which would in turn block the add function. This would prevent items from being added to the output and stall the entire process.

R7R8 · 2017-06-15T00:47:36Z

Thank you for your reply.
But why the running_aggregator could be blocked during push.
As shown in the code, running_aggregator pushes every periodT(maybe 30s) second.

I think when you say the running_aggregator could be blocked during push, you actually want to say that the metricC is full. (The running_aggregator pushes the aggregator metric to metricC)
But why metricC could be full?

Or the running_aggregator could be blocked during push because the periodT ticker failed.

danielnelson · 2017-06-15T01:13:29Z

Yes, I believe the aggregator is blocked by metricC being full. I think if metricC fills and the internal metrics channel both fill, then you would be stuck. All the inputs use metricC as well so I think it could possibly happen under load.

R7R8 · 2017-06-16T06:03:43Z

Will you fix these bugs in the future?
If you will . how long will it take to fix these bugs.
Sorry for asking these questions.

Now I plan to use socket_listener instead of http_listener.

danielnelson · 2017-06-16T18:08:39Z

Yes, I'm going to work on this in the next week, it will probably go out in the 1.4 release next month but you could easily backport it since you already have modifications.

R7R8 · 2017-06-23T07:32:25Z

hi, danielnelson.
4024e6b in fix-aggregator-deadlock (Use separate goroutines for push and add) seems to arise a new problem.

The problem is that add gorountine and push gorountine may manipulate the aggragtor's map concurrently, causing fatal error: concurrent map iteration and map write.

danielnelson · 2017-06-23T18:38:09Z

Yeah, that patch is fundamentally flawed. The real fix is going to require removing the loop in metric processing: where we push metrics from the aggregator back into the processors. Unfortunately, this is going to be too large of a change for 1.3.3, so I'm going to have to push it to 1.4.

R7R8 · 2017-06-27T08:53:53Z

Hi, danielnelson.
Inspired by your code, I fix the problem temporarily in this way. It seems works.

I create a temporary variable to store aggregator, then create a new gorountine to push.

running_aggregator.go

// Run runs the running aggregator, listens for incoming metrics, and waits
// for period ticks to tell it when to push and reset the aggregator.
func (r *RunningAggregator) Run(
	acc telegraf.Accumulator,
	shutdown chan struct{},
) {
	// The start of the period is truncated to the nearest second.
	//
	// Every metric then gets it's timestamp checked and is dropped if it
	// is not within:
	//
	//   start < t < end + truncation + delay
	//
	// So if we start at now = 00:00.2 with a 10s period and 0.3s delay:
	//   now = 00:00.2
	//   start = 00:00
	//   truncation = 00:00.2
	//   end = 00:10
	// 1st interval: 00:00 - 00:10.5
	// 2nd interval: 00:10 - 00:20.5
	// etc.
	//
	now := time.Now()
	r.periodStart = now.Truncate(time.Second)
	truncation := now.Sub(r.periodStart)
	r.periodEnd = r.periodStart.Add(r.Config.Period)
	time.Sleep(r.Config.Delay)
	periodT := time.NewTicker(r.Config.Period)
	defer periodT.Stop()

	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		defer wg.Done()
		for {
			select {
			case <-shutdown:
				if len(r.metrics) > 0 {
					// wait until metrics are flushed before exiting
					continue
				}
				return
			case m := <-r.metrics:
				if IsMetricExpired(m, r.periodStart, r.periodEnd.Add(truncation).Add(r.Config.Delay)) {
					// the metric is outside the current aggregation period, so
					// skip it.
					fmt.Printf("%s, %s, %s", m.Time(), r.periodStart, r.periodEnd.Add(truncation).Add(r.Config.Delay))
					continue
				}
				r.add(m)
			case <-periodT.C:
				r.periodStart = r.periodEnd
				r.periodEnd = r.periodStart.Add(r.Config.Period)
				r.reset()
				go func() {
					r.push(acc)
				}()
			}
		}
	}()

	wg.Wait()
}

sum.go

var tempCache map[uint64]aggregate

func (m *Sum) Push(acc telegraf.Accumulator) {
	for _, aggregate := range tempCache {
		fields := map[string]interface{}{}
		for k, v := range aggregate.fields {
			fields[k] = float64(v)
		}
		acc.AddFields(aggregate.name, fields, aggregate.tags)
	}
}

func (m *Sum) Reset() {
	tempCache = m.cache
	m.cache = make(map[uint64]aggregate)
}

danielnelson · 2017-06-27T20:30:07Z

Seems like it should work to me.

danielnelson · 2017-07-13T22:41:45Z

@R7R8 I merge #3016 to master which should allow you to use your aggregator without modification. Even though I know you have a working solution, it would be great if you can try it out.

danielnelson added the need more info label Jun 13, 2017

danielnelson added bug unexpected problem or unintended behavior and removed need more info labels Jun 14, 2017

danielnelson mentioned this issue Jun 14, 2017

Http_listener input can leak sockets #2923

Closed

danielnelson mentioned this issue Jun 16, 2017

Deadlock in statsd input plugin #2927

Closed

danielnelson self-assigned this Jun 16, 2017

danielnelson added this to the 1.3.3 milestone Jun 20, 2017

danielnelson modified the milestones: 1.4.0, 1.3.3 Jun 23, 2017

danielnelson changed the title ~~telegraf stop sending metrics after 3 days~~ Possible deadlock when using an aggregator Jul 6, 2017

danielnelson mentioned this issue Jul 6, 2017

Cgroup plugin support for aggregate metrics across multiple child cgroups #2945

Closed

danielnelson mentioned this issue Jul 13, 2017

Prevent possible deadlock when using aggregators #3016

Merged

3 tasks

danielnelson closed this as completed in #3016 Jul 13, 2017

danielnelson mentioned this issue Jan 3, 2018

Telegraf stops publishing metrics to InfluxDB; All plugins take too long to collect #3629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deadlock when using an aggregator #2914

Possible deadlock when using an aggregator #2914

R7R8 commented Jun 13, 2017 •

edited

Loading

danielnelson commented Jun 13, 2017

R7R8 commented Jun 14, 2017

R7R8 commented Jun 14, 2017

danielnelson commented Jun 14, 2017

R7R8 commented Jun 14, 2017

danielnelson commented Jun 14, 2017

R7R8 commented Jun 15, 2017

danielnelson commented Jun 15, 2017

R7R8 commented Jun 16, 2017

danielnelson commented Jun 16, 2017

R7R8 commented Jun 23, 2017 •

edited

Loading

danielnelson commented Jun 23, 2017

R7R8 commented Jun 27, 2017

danielnelson commented Jun 27, 2017

danielnelson commented Jul 13, 2017

Possible deadlock when using an aggregator #2914

Possible deadlock when using an aggregator #2914

Comments

R7R8 commented Jun 13, 2017 • edited Loading

Directions

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

Feature Request

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

danielnelson commented Jun 13, 2017

R7R8 commented Jun 14, 2017

R7R8 commented Jun 14, 2017

danielnelson commented Jun 14, 2017

R7R8 commented Jun 14, 2017

danielnelson commented Jun 14, 2017

R7R8 commented Jun 15, 2017

danielnelson commented Jun 15, 2017

R7R8 commented Jun 16, 2017

danielnelson commented Jun 16, 2017

R7R8 commented Jun 23, 2017 • edited Loading

danielnelson commented Jun 23, 2017

R7R8 commented Jun 27, 2017

danielnelson commented Jun 27, 2017

danielnelson commented Jul 13, 2017

R7R8 commented Jun 13, 2017 •

edited

Loading

R7R8 commented Jun 23, 2017 •

edited

Loading