Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat stops sending logs when load balancing is enabled #1829

Closed
tokle opened this issue Jun 10, 2016 · 8 comments
Closed

Filebeat stops sending logs when load balancing is enabled #1829

tokle opened this issue Jun 10, 2016 · 8 comments

Comments

@tokle
Copy link

tokle commented Jun 10, 2016

I have a basic elastic setup where logs are harvested and published to logstash/elasticsearch using filebeat.
On several occasions I have observed that no logs are sent even though everything looks ok on elasticsearch and logstash ends.
The log pipeline will only return to a good state by restarting the filebeat process (restarting logstash or elasticsearch has no effect).
We have seen this issue in correlation with a network issue on or reboot of the elasticsearch nodes. (logstash logs "Beats input: the pipeline is blocked, temporary refusing new connection.")

After some experimenting I have been able to reproduce the issue:

  • Configure filebeat with config below.
    • 3 important settings that seems to trigger this (I have not been able to reproduce without them)
      • publish_async: true
      • loadbalance: true (logstash output)
      • 2 or more logstash hosts
  • Filebeat is connected and publishing logs successfully to logstash
  • Elasticsearch goes down (by stopping the service)
  • Logstash starts logging "Beats input: the pipeline is blocked, temporary refusing new connection"
  • After some time elasticsearch is available again. In my test I set the downtime to 2 minutes.
  • Logstash stops logging that the pipeline is blocked, but filebeat is not publishing any logs.

This seems to be related to or even the same issue as #878

I have tested this with filebeat versions 1.2.3 and 5.0.0-alpha3. Below outputs is with filebeat 1.2.3.

The filebeat config:

filebeat:
  prospectors:
    - paths:
        - /var/log/nginx/access.log
      exclude_files: [".gz$"]
      document_type: nginxproxyaccess
      fields_under_root: true
      close_older: 20m
      force_close_files: true
  publish_async: true
  registry_file: /var/lib/filebeat/registry
output:
  logstash:
    hosts: ["logstash1:5044","logstash2:5044"]
    timeout: 15
    compression_level: 3
    loadbalance: true

Logstash logs:

{:timestamp=>"2016-06-10T06:07:05.362000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.364000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.364000+0000", :message=>"CircuitBreaker::Open", :name=>"Beats input", :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.364000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::OpenBreaker, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:05.864000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:06.365000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-06-10T06:07:06.865000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}

Filebeat logs:

2016-06-10T06:08:25Z DBG  Start next scan
2016-06-10T06:08:25Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:25Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:25Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:25Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:25Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
2016-06-10T06:08:26Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:26Z DBG  Try to publish 2048 events to logstash with window size 8
2016-06-10T06:08:30Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:30Z DBG  Try to publish 2048 events to logstash with window size 8
2016-06-10T06:08:35Z DBG  Start next scan
2016-06-10T06:08:35Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:35Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:35Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:35Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:35Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
2016-06-10T06:08:41Z DBG  close connection
2016-06-10T06:08:41Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:41Z INFO Error publishing events (retrying): read tcp fbeat:56583->logstash1:5044: i/o timeout
2016-06-10T06:08:45Z DBG  close connection
2016-06-10T06:08:45Z DBG  0 events out of 2048 events sent to logstash. Continue sending ...
2016-06-10T06:08:45Z INFO Error publishing events (retrying): read tcp fbeat:34991->logstash2:5044: i/o timeout
2016-06-10T06:08:45Z DBG  Start next scan
2016-06-10T06:08:45Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:45Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:45Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:45Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:45Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Start next scan
2016-06-10T06:08:55Z DBG  scan path /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Check file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Same file as before found. Fetch the state.
2016-06-10T06:08:55Z DBG  Update existing file for harvesting: /var/log/nginx/access.log
2016-06-10T06:08:55Z DBG  Not harvesting, file didn't change: /var/log/nginx/access.log
@mpolden
Copy link

mpolden commented Jun 10, 2016

After some debugging, we have discovered what seems to be a deadlock in load balancer mode. This occurs when Logstash is rejecting connections due to some error condition, e.g. Elasticsearch being down, as described in the first commant

At balance.go#L167 each worker reads retry attempts from m.retries and at balance.go#L261 forwardEvent puts a message on ch (m.retries).

When m.retries reaches it's maximum capacity (4 in our test case, 2 * number of clients). forwardEvent will block because ch is full. cap(ch) = 4, len(ch) = 4 at balance.go#L261. All nodes needs to be down until m.retriesis filled for this to occur.

At this point everything is blocked as each worker is blocking at balance.go#L170, waiting for onMessage to complete (which it never does because forwardEvent is blocking).

I noticed that the first select in forwardEvent does not have a timeout case balance.go#L260 vs. balance.go#L260. Adding a timeout to the first select fixes the deadlock, but I'm not 100% sure about the implications of doing so.

@ruflin
Copy link
Contributor

ruflin commented Jun 10, 2016

I didn't check it in detail yet but that sounds like related to one of these?

@mpolden
Copy link

mpolden commented Jun 10, 2016

@ruflin Not sure, I can try reproducing with master.

@mpolden
Copy link

mpolden commented Jun 10, 2016

@ruflin We can reproduce the same issue on master (c8d2ccb).

@ruflin
Copy link
Contributor

ruflin commented Jun 10, 2016

@martinp Thanks a lot for testing. @urso Can you have a look here?

@urso
Copy link

urso commented Jun 10, 2016

First select has no timeout on purpose. As events/messages are not allowed to be dropped. It's used if Guaranteed flag or max_retries<0 is set.

One potential problem I see with code (receive method) is beats not enforcing any order of queue reads. The select statement chooses the channel to try/read from first by random. Better solution for load-balancer would be to not read any events from work-channel as long as there are any messages in retry-channel. The retry channel is sized to be 2*number of workers. This should help breaking the deadlock.

I will try to reproduce the issue with an unit test.

Thanks for very detailed debugging information. Helps a ton identifying these kind of issues.

Current workaround in filebeat is: disable publish_async and set filebeat.spool_size to workers * logstash endpoints * bulk_max_size. This config will get you at least some load-balancing with all logstash instance operating in lock-step.

@mpolden
Copy link

mpolden commented Jun 10, 2016

@urso Thanks, we'll try the workaround and see how it goes.

@tokle
Copy link
Author

tokle commented Jun 10, 2016

Thanks for the suggested workaround @urso. I've tried it in our environment and it looks acceptable both in terms of throughput and the fact that reconnect seems to work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants