Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock when using an aggregator #2914

Closed
R7R8 opened this issue Jun 13, 2017 · 15 comments · Fixed by #3016
Closed

Possible deadlock when using an aggregator #2914

R7R8 opened this issue Jun 13, 2017 · 15 comments · Fixed by #3016
Assignees
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@R7R8
Copy link

R7R8 commented Jun 13, 2017

Directions

GitHub Issues are reserved for actionable bug reports and feature requests.
General questions should be asked at the InfluxData Community site.

Before opening an issue, search for similar bug reports or feature requests on GitHub Issues.
If no similar issue can be found, fill out either the "Bug Report" or the "Feature Request" section below.
Erase the other section and everything on and above this line.

Please note, the quickest way to fix a bug is to open a Pull Request.

Bug report

version: mater 、release 1.3
telegraf stop sending metrics after 3 days

Relevant telegraf.conf:

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "11s"
  flush_jitter = "5s"
  precision = ""
  debug = true
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://xxxx"] # required
  database = "xxxx" # required

  ## Retention policy to write to. Empty string writes to the default rp.
  retention_policy = "rp_30s"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  username = "xxxx"
  password = "xxxx"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

# Keep the aggregate min/max of each metric passing through.
 [[aggregators.sum]]
   ## General Aggregator Arguments:
   ## The period on which to flush & clear the aggregator.
   period = "30s"
   ## If true, the original metric will be dropped by the
   ## aggregator and will not get sent to the output plugins.
   drop_original = true

# Influx HTTP write listener
 [[inputs.http_listener]]
   ## Address and port to host HTTP listener on
   service_address = ":8186"

   ## maximum duration before timing out read of the request
   read_timeout = "10s"
   ## maximum duration before timing out write of the response
   write_timeout = "10s"

   ## Maximum allowed http request body size in bytes.
   ## 0 means to use the default of 536,870,912 bytes (500 mebibytes)
   max_body_size = 0

   ## Maximum line size allowed to be sent in bytes.
   ## 0 means to use the default of 65536 bytes (64 kibibytes)
   max_line_size = 0
    

System info:

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

  1. ...
  2. ...

Expected behavior:

Actual behavior:

Additional info:

send a SIGQUIT (^) to the process

goroutine 2943073 [chan send]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xca36439470, 0xe, 0xca3599be60, 0xca3599be90, 0xca36455fe0, 0x1, 0x1)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xca3646a000, 0x545, 0x10000, 0xed0d18287, 0x327c4dcf, 0x1bde240, 0x0, 0x0, 0x0, ...)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xca353f7a40, 0xca359b3200)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:268 +0x793
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xca353f7a40, 0xca359b3200)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xca353f7a40, 0xca359b3200)
	/usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xca36436e60, 0x1af1660, 0xca35ced580)
	/usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 2943074 [IO wait]:
net.runtime_pollWait(0x7fb31d0f1400, 0x72, 0x23aee)
	/usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xca35d38d88, 0x72, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xca35d38d88, 0xca35ced5d1, 0x1)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xca35d38d20, 0xca35ced5d1, 0x1, 0x1, 0x0, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xca0b51c8a8, 0xca35ced5d1, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xca35ced5c0)
	/usr/local/opt/go/libexec/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/opt/go/libexec/src/net/http/server.go:652 +0xdf

goroutine 2943054 [chan send]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xca363d82c0, 0xe, 0xca35fc1770, 0xca35fc17a0, 0xca363c4e60, 0x1, 0x1)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xca36488000, 0x13df, 0x10000, 0xed0d18287, 0x364b6ebc, 0x1bde240, 0x0, 0x0, 0x0, ...)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xca35317340, 0xca3516fb00)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:268 +0x793
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xca35317340, 0xca3516fb00)
	/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xca35317340, 0xca3516fb00)
	/usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xca3626aaa0, 0x1af1660, 0xca362eed40)
	/usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 2943055 [IO wait]:
net.runtime_pollWait(0x7fb31d0f1340, 0x72, 0x23aef)
	/usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xca3635e4c8, 0x72, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xca3635e4c8, 0xca362eed91, 0x1)
	/usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xca3635e460, 0xca362eed91, 0x1, 0x1, 0x0, 0x1ae8f60, 0x1ae05e0)
	/usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xca0b4a67c0, 0xca362eed91, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xca362eed80)
	/usr/local/opt/go/libexec/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/opt/go/libexec/src/net/http/server.go:652 +0xdf

rax    0xca
rbx    0x1
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x1bdf6b0
rsi    0x0
rbp    0x7ffef7e5a8c0
rsp    0x7ffef7e5a878
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0xc420b204e0
r14    0x43fee0
r15    0x160c980
rip    0x45b531
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

[Include gist of relevant config, logs, etc.]

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

@danielnelson
Copy link
Contributor

Nothing stands out to me from the stack trace, anything interesting in the logfile?

@R7R8
Copy link
Author

R7R8 commented Jun 14, 2017

Thank you for your help.
The codes from the beginning of the stderr.log are as follow.

SIGQUIT: quit
PC=0x45b531 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x1bdf6b0, 0x0, 0x0, 0x0, 0xc400000000, 0x100000000, 0x0, 0x0, 0x7ffef7e5a8f0, 0x40e53b, ...)
        /usr/local/opt/go/libexec/src/runtime/sys_linux_amd64.s:422 +0x21
runtime.futexsleep(0x1bdf6b0, 0x0, 0xffffffffffffffff)
        /usr/local/opt/go/libexec/src/runtime/os_linux.go:45 +0x62
runtime.notesleep(0x1bdf6b0)
        /usr/local/opt/go/libexec/src/runtime/lock_futex.go:145 +0x6b
runtime.stopm()
        /usr/local/opt/go/libexec/src/runtime/proc.go:1650 +0xad
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/opt/go/libexec/src/runtime/proc.go:2102 +0x2e4
runtime.schedule()
        /usr/local/opt/go/libexec/src/runtime/proc.go:2222 +0x14c
runtime.park_m(0xc420001ba0)
        /usr/local/opt/go/libexec/src/runtime/proc.go:2285 +0xab
runtime.mcall(0x7ffef7e5aa80)
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:269 +0x5b


goroutine 1 [semacquire, 7235 minutes]:
sync.runtime_Semacquire(0xc4201f033c)
        /usr/local/opt/go/libexec/src/runtime/sema.go:47 +0x34
sync.(*WaitGroup).Wait(0xc4201f0330)
        /usr/local/opt/go/libexec/src/sync/waitgroup.go:131 +0x7a
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc4202be008, 0xc42008c360, 0x0, 0x0)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/agent.go:400 +0x4b3
main.reloadLoop(0xc42008c900, 0x1c013b0, 0x0, 0x0, 0x1c013b0, 0x0, 0x0, 0x1c013b0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:222 +0xa06
main.main()
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:392 +0x65b

goroutine 6 [syscall, 7235 minutes]:
os/signal.signal_recv(0x0)
        /usr/local/opt/go/libexec/src/runtime/sigqueue.go:116 +0x104
os/signal.loop()
        /usr/local/opt/go/libexec/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
        /usr/local/opt/go/libexec/src/os/signal/signal_unix.go:28 +0x41

goroutine 2717751 [IO wait, 291 minutes]:
net.runtime_pollWait(0x7fb334a1e380, 0x72, 0x79bb)
        /usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc562b55028, 0x72, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc562b55028, 0xc55ba55011, 0x1)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc562b54fc0, 0xc55ba55011, 0x1, 0x1, 0x0, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc543dd8350, 0xc55ba55011, 0x1, 0x1, 0x0, 0x0, 0x0)
        /usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*connReader).backgroundRead(0xc55ba55000)
        /usr/local/opt/go/libexec/src/net/http/server.go:656 +0x58
created by net/http.(*connReader).startBackgroundRead
        /usr/local/opt/go/libexec/src/net/http/server.go:652 +0xdf

goroutine 2757400 [chan send, 379 minutes]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xc420b14950, 0xe, 0xc4207fe570, 0xc4207fe5a0, 0xc42075abc0, 0x1, 0x1)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xc4236f6000, 0x66b0, 0x10000, 0xed0d12980, 0x3266ef58, 0x1bde240, 0x0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xc4233708c0, 0xc423106b00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:268 +0x793
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xc4233708c0, 0xc423106b00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xc4233708c0, 0xc423106b00)
        /usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc4235a6e60, 0x1af1660, 0xc4233758c0)
        /usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
        /usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 2761034 [chan send, 371 minutes]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xc4390bea20, 0x10, 0xc43582b770, 0xc43582b7a0, 0xc439ac9b00, 0x1, 0x1)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xc43e856000, 0xff09, 0x10000, 0xed0d12b63, 0x7ad47b, 0x1bde240, 0x0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xc43dd2fa40, 0xc43dbb9600)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:293 +0x48b
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xc43dd2fa40, 0xc43dbb9600)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xc43dd2fa40, 0xc43dbb9600)
        /usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc43e7e92c0, 0x1af1660, 0xc43e4d26c0)
        /usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
        /usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 96 [IO wait, 380 minutes]:
net.runtime_pollWait(0x7fb33b34c8c8, 0x72, 0x4)
        /usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4201ee378, 0x72, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4201ee378, 0xc420360000, 0x1000)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc4201ee310, 0xc420360000, 0x1000, 0x1000, 0x0, 0x1ae8f60, 0x1ae05e0)
        /usr/local/opt/go/libexec/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc42017a078, 0xc420360000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/opt/go/libexec/src/net/net.go:181 +0x70
net/http.(*persistConn).Read(0xc4203d9c20, 0xc420360000, 0x1000, 0x1000, 0x4, 0xc4202c8260, 0x16)
        /usr/local/opt/go/libexec/src/net/http/transport.go:1316 +0x14b
bufio.(*Reader).fill(0xc4203b8960)
        /usr/local/opt/go/libexec/src/bufio/bufio.go:97 +0x117
bufio.(*Reader).Peek(0xc4203b8960, 0x1, 0x0, 0x1, 0x0, 0xc420b276e0, 0x0)
        /usr/local/opt/go/libexec/src/bufio/bufio.go:129 +0x67
net/http.(*persistConn).readLoop(0xc4203d9c20)
        /usr/local/opt/go/libexec/src/net/http/transport.go:1474 +0x196
created by net/http.(*Transport).dialConn
        /usr/local/opt/go/libexec/src/net/http/transport.go:1117 +0xa35

goroutine 97 [select, 380 minutes]:
net/http.(*persistConn).writeLoop(0xc4203d9c20)
        /usr/local/opt/go/libexec/src/net/http/transport.go:1704 +0x43a
created by net/http.(*Transport).dialConn
        /usr/local/opt/go/libexec/src/net/http/transport.go:1118 +0xa5a

goroutine 2756464 [chan send, 371 minutes]:
github.com/influxdata/telegraf/agent.(*accumulator).AddFields(0xc4203c0760, 0xc434f16b90, 0xe, 0xc43745d3b0, 0xc43745d500, 0xc4329adfe0, 0x1, 0x1)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/agent/accumulator.go:53 +0x12e
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).parse(0xc4201bc000, 0xc43e542000, 0xffc5, 0x10000, 0xed0d12b60, 0x277fd991, 0x1bde240, 0x0, 0x0, 0x0, ...)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:310 +0x270
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).serveWrite(0xc4201bc000, 0x1af05e0, 0xc43d57fb20, 0xc43dbb8d00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:293 +0x48b
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).ServeHTTP(0xc4201bc000, 0x1af05e0, 0xc43d57fb20, 0xc43dbb8d00)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:180 +0x268
net/http.serverHandler.ServeHTTP(0xc42048ad10, 0x1af05e0, 0xc43d57fb20, 0xc43dbb8d00)
        /usr/local/opt/go/libexec/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc439e77040, 0x1af1660, 0xc439607100)
        /usr/local/opt/go/libexec/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
        /usr/local/opt/go/libexec/src/net/http/server.go:2668 +0x2ce

goroutine 25 [select, 7235 minutes, locked to thread]:
runtime.gopark(0x13ec2d0, 0x0, 0x13719c1, 0x6, 0x18, 0x2)
        /usr/local/opt/go/libexec/src/runtime/proc.go:271 +0x13a
runtime.selectgoImpl(0xc420422f50, 0x0, 0x18)
        /usr/local/opt/go/libexec/src/runtime/select.go:423 +0x1364
runtime.selectgo(0xc420422f50)
        /usr/local/opt/go/libexec/src/runtime/select.go:238 +0x1c
runtime.ensureSigM.func1()
        /usr/local/opt/go/libexec/src/runtime/signal_unix.go:434 +0x2dd
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:2197 +0x1

goroutine 26 [select, 7235 minutes]:
main.reloadLoop.func1(0xc42008c3c0, 0xc42008c360, 0xc4201fed90, 0xc42008c900)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:183 +0x24a
created by main.reloadLoop
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:197 +0x6cb

goroutine 27 [IO wait]:
net.runtime_pollWait(0x7fb33b34c988, 0x72, 0x0)
        /usr/local/opt/go/libexec/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc420200298, 0x72, 0x0, 0xca363c4800)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc420200298, 0xffffffffffffffff, 0x0)
        /usr/local/opt/go/libexec/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc420200230, 0x0, 0x1ae6320, 0xca363c4800)
        /usr/local/opt/go/libexec/src/net/fd_unix.go:430 +0x1e5
net.(*TCPListener).accept(0xc4202be020, 0xc420427de0, 0x685e3e, 0x456790)
        /usr/local/opt/go/libexec/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc4202be020, 0x13ebc68, 0xca3626aaa0, 0x1af1720, 0xc42015af60)
        /usr/local/opt/go/libexec/src/net/tcpsock.go:228 +0x49
net/http.(*Server).Serve(0xc42048ad10, 0x1af03e0, 0xc4202be020, 0x0, 0x0)
        /usr/local/opt/go/libexec/src/net/http/server.go:2643 +0x228
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).httpListen(0xc4201bc000, 0x13ec400, 0xc4201bc040)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:170 +0x10b
github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).Start.func1(0xc4201bc000)
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:133 +0x57
created by github.com/influxdata/telegraf/plugins/inputs/http_listener.(*HTTPListener).Start
        /Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/inputs/http_listener/http_listener.go:134 +0x7ce

@R7R8
Copy link
Author

R7R8 commented Jun 14, 2017

here is the whole stack tree.
stderr.log.zip

@danielnelson
Copy link
Contributor

Wow, that is a lot of http_listener goroutines, I'll have to think about how we should deal with that...

Can you share the code for this?
/Users/zj-db0743/go/src/github.com/influxdata/telegraf/plugins/aggregators/sum/sum.go

@R7R8
Copy link
Author

R7R8 commented Jun 14, 2017

Thank you for your reply.
Yes, I sent metrics to telegraf every 5 seconds by post.
Is there a better way send metrics to telegraf.

here is the sum code
sum.zip

@danielnelson danielnelson added bug unexpected problem or unintended behavior and removed need more info labels Jun 14, 2017
@danielnelson
Copy link
Contributor

I've found the cause of this, the running_aggregator could be blocked during push which would in turn block the add function. This would prevent items from being added to the output and stall the entire process.

@R7R8
Copy link
Author

R7R8 commented Jun 15, 2017

Thank you for your reply.
But why the running_aggregator could be blocked during push.
As shown in the code, running_aggregator pushes every periodT(maybe 30s) second.

I think when you say the running_aggregator could be blocked during push, you actually want to say that the metricC is full. (The running_aggregator pushes the aggregator metric to metricC)
But why metricC could be full?

Or the running_aggregator could be blocked during push because the periodT ticker failed.

@danielnelson
Copy link
Contributor

Yes, I believe the aggregator is blocked by metricC being full. I think if metricC fills and the internal metrics channel both fill, then you would be stuck. All the inputs use metricC as well so I think it could possibly happen under load.

@R7R8
Copy link
Author

R7R8 commented Jun 16, 2017

Will you fix these bugs in the future?
If you will . how long will it take to fix these bugs.
Sorry for asking these questions.

Now I plan to use socket_listener instead of http_listener.

@danielnelson
Copy link
Contributor

Yes, I'm going to work on this in the next week, it will probably go out in the 1.4 release next month but you could easily backport it since you already have modifications.

@danielnelson danielnelson self-assigned this Jun 16, 2017
@danielnelson danielnelson added this to the 1.3.3 milestone Jun 20, 2017
@R7R8
Copy link
Author

R7R8 commented Jun 23, 2017

hi, danielnelson.
4024e6b in fix-aggregator-deadlock (Use separate goroutines for push and add) seems to arise a new problem.

The problem is that add gorountine and push gorountine may manipulate the aggragtor's map concurrently, causing fatal error: concurrent map iteration and map write.

@danielnelson
Copy link
Contributor

Yeah, that patch is fundamentally flawed. The real fix is going to require removing the loop in metric processing: where we push metrics from the aggregator back into the processors. Unfortunately, this is going to be too large of a change for 1.3.3, so I'm going to have to push it to 1.4.

@danielnelson danielnelson modified the milestones: 1.4.0, 1.3.3 Jun 23, 2017
@R7R8
Copy link
Author

R7R8 commented Jun 27, 2017

Hi, danielnelson.
Inspired by your code, I fix the problem temporarily in this way. It seems works.

I create a temporary variable to store aggregator, then create a new gorountine to push.

running_aggregator.go

// Run runs the running aggregator, listens for incoming metrics, and waits
// for period ticks to tell it when to push and reset the aggregator.
func (r *RunningAggregator) Run(
	acc telegraf.Accumulator,
	shutdown chan struct{},
) {
	// The start of the period is truncated to the nearest second.
	//
	// Every metric then gets it's timestamp checked and is dropped if it
	// is not within:
	//
	//   start < t < end + truncation + delay
	//
	// So if we start at now = 00:00.2 with a 10s period and 0.3s delay:
	//   now = 00:00.2
	//   start = 00:00
	//   truncation = 00:00.2
	//   end = 00:10
	// 1st interval: 00:00 - 00:10.5
	// 2nd interval: 00:10 - 00:20.5
	// etc.
	//
	now := time.Now()
	r.periodStart = now.Truncate(time.Second)
	truncation := now.Sub(r.periodStart)
	r.periodEnd = r.periodStart.Add(r.Config.Period)
	time.Sleep(r.Config.Delay)
	periodT := time.NewTicker(r.Config.Period)
	defer periodT.Stop()

	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		defer wg.Done()
		for {
			select {
			case <-shutdown:
				if len(r.metrics) > 0 {
					// wait until metrics are flushed before exiting
					continue
				}
				return
			case m := <-r.metrics:
				if IsMetricExpired(m, r.periodStart, r.periodEnd.Add(truncation).Add(r.Config.Delay)) {
					// the metric is outside the current aggregation period, so
					// skip it.
					fmt.Printf("%s, %s, %s", m.Time(), r.periodStart, r.periodEnd.Add(truncation).Add(r.Config.Delay))
					continue
				}
				r.add(m)
			case <-periodT.C:
				r.periodStart = r.periodEnd
				r.periodEnd = r.periodStart.Add(r.Config.Period)
				r.reset()
				go func() {
					r.push(acc)
				}()
			}
		}
	}()

	wg.Wait()
}

sum.go

var tempCache map[uint64]aggregate

func (m *Sum) Push(acc telegraf.Accumulator) {
	for _, aggregate := range tempCache {
		fields := map[string]interface{}{}
		for k, v := range aggregate.fields {
			fields[k] = float64(v)
		}
		acc.AddFields(aggregate.name, fields, aggregate.tags)
	}
}

func (m *Sum) Reset() {
	tempCache = m.cache
	m.cache = make(map[uint64]aggregate)
}

@danielnelson
Copy link
Contributor

Seems like it should work to me.

@danielnelson danielnelson changed the title telegraf stop sending metrics after 3 days Possible deadlock when using an aggregator Jul 6, 2017
@danielnelson
Copy link
Contributor

@R7R8 I merge #3016 to master which should allow you to use your aggregator without modification. Even though I know you have a working solution, it would be great if you can try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants