Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeats not closing "old" file handles on Windows #922

Closed
sebbarg opened this issue Feb 4, 2016 · 34 comments
Closed

Filebeats not closing "old" file handles on Windows #922

sebbarg opened this issue Feb 4, 2016 · 34 comments
Labels
Filebeat Filebeat

Comments

@sebbarg
Copy link

sebbarg commented Feb 4, 2016

Hi,

I'm on Windows 2012R2 / NTFS / filebeat 1.1.0.

From the log:
2016-02-04T09:12:25Z INFO Set ignore_older duration to 24h0m0s

Once a file has been recorded in the "registry" the above setting has /no/ effect for files with modified timestamps older than 24hrs.

If I manually delete the file entry in the registry and restart filebeat, too-old files are correctly ignored (i.e. no handles to them exists).

You can verify this by running Sysinternal's handle.exe (https://technet.microsoft.com/en-us/sysinternals/handle) in both scenarios.

Have I misunderstood the setting "ignore_older" ?

Best,
Sebastian

@ruflin
Copy link
Contributor

ruflin commented Feb 5, 2016

You understand ignore_older correct. All files older should not have an open file handler anymore. Can you share your config? Also if possible it would be nice to have some debug output (enabled with -e -d "*" flags) to see what filebeat reports on the files that are older then ignore_older.

@sebbarg
Copy link
Author

sebbarg commented Feb 5, 2016

Thanks ruflin,

Config and debug log attached.

Notice the file request-2016-02-01.log being correctly ignored, but request-2016-02-02.log is loaded and kept open even though it is too old. This happens because it is still referenced in the registry (I manually removed the other one).

filebeat-log.txt
filebeat-config.txt

@bitfehler
Copy link

Hi,

i have a similar (the same?) problem on Linux. We ship a huge amount of logs, so they are rotated frequently. For a while all seems well, but at some point Filebeat doesn't close old files anymore, causing the handles to stay around and thus preventing the disk space from being free'd. I tried using ignore_older: 1m and force_close_files: true, but to no avail.

I will try to extract some meaningful debug output, but it's not easy given the amount logs we ship (and thus the amount of debug output produced). We just read a single file and output to Logstash.

Could this possibly be related to either Filebeat or the receiving Logstash not being able to keep up with the amount of data to process?

@ruflin
Copy link
Contributor

ruflin commented Feb 5, 2016

@sebbarg Thanks a lot for these config files. I tried to create a summary here:

43: 2016-02-05T12:33:52Z: Resuming
34924: 2016-02-05T12:33:53Z: Backoff now.
127184: 2016-02-05T12:34:02Z: Not harvesting, file didn't change
194184: 2016-02-05T12:34:14Z: Not Harvesting
226044: 2016-02-05T12:34:18Z: EOF

I'm somehow surprised by the last two lines as it seems it is not harvesting on the second last line and then has a few seconds later. It somehow gives the impression that two harvesters are running on the same file. As the log file is quite big, I could also have missed some stuff. This could potentially also explain what @bitfehler describes, that it happens under high load. Be aware, this is only a theory so far.

@bitfehler It would be great if you could provide some log files here.

In case one of you has the opportunity to also test the nightly that would be great. Quite a bit of refactoring happened in master to prevent potential race conditions. Nightly can be found here: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

I will investigate further in the code to see if I can find the potential issue.

@ruflin
Copy link
Contributor

ruflin commented Feb 5, 2016

@sebbarg One note about the registry file: Also files that were already read stay and are under ignore_older stay in the registry file. This is intended. With the introduction of close_older we should probably discuss to change that: #718

@sebbarg
Copy link
Author

sebbarg commented Feb 8, 2016

@ruflin running latest nightly now. I'll update when I know more.

@bitfehler
Copy link

@ruflin I will try out the nightly later today. In the meantime, I managed to reproduce the issue with a reasonable amount of logs by using -d "prospector", which seems to be the most relevant part here? Would happily add other flags, but using "*" is just too much.

Some more context: we are using Filebeat 1.1.0 to monitor a single file that is written and rotated by svlogd, maybe this is related to their specific implemetation of rotating files, but I am not aware of any super-special things they would be doing.

Here is our config file and the log.

Also, for completeness, this is the output of sudo lsof | grep deleted, showing the file handles that filebeat keeps around.

Let me know if there is anything else I can supply, I will get back after testing the nightly build. Thanks a lot!

@bitfehler
Copy link

@ruflin Sorry to report that the issue still exists for me with the latest 1.2.0 nightly build from ~2 hours ago. Let me know if there is any further data I can supply to help track this down...

@bitfehler
Copy link

One more thing. To not make you read it, this is the relevant part from the svlogd man page that I linked to above, about ho it rotates the log file (called current):

svlogd closes current, changes permission of current to 0755, renames current to @timestamp.s, and starts with a new empty current. If svlogd sees num or more old log files in the log directory, it removes the oldest one.

During peak time, current is rotated ever 1-2 minutes. We keep 10 files, so the rotated files are deleted by svlogd after ~10 minutes. This seems like it should reasonably work with ignore_older set to 1 minute, right?

Note that this also happens during off-peak, so even the file is rotated every 3-5 minutes for example. I will try to make a minimal reproduction case for this with small files, but not sure if that will work...

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

@bitfehler Thanks for digging deeper here. When you tried the nightly, did you use ignore_older or close_older. Master has a new config close_older to prevent the duplicated meaning of ignore_older. I would recommend to set close_older to about 1m and leave ignore_older untouched (which is infinity in master).

One thing that caught my attention reading the above is the changing of the permissions. This should not change the file identifiers, but I have to test it on my side. An other related issue could be https://github.com/elastic/filebeat/issues/271

I'm somehow surprised the by error log lines for force_close_file as they show nil but it should only be printed if error is not nil. Have to check this in more detail.

For my understanding: Are no files closed or just some are kept open?

@bitfehler
Copy link

@ruflin I did indeed use ignore_older, as I didn't know about the new options and just used the same config file. I will try again with close_older later today!
As for your question, from a quick check it seems that all files are affected, but I'd need to write a small script to check that.
About the issue you linked to, I am not sure it is related, but it might be. I checked the svlogd source, and the rotation works exactly as described in the man page, so it's rename(), open() new file, so I don't immediately see how it might be reusing inodes. Will take a closer look at this as well, though.

@bitfehler
Copy link

Hi,
tried the nightly build with this config:

  prospectors:
    -
      paths:
        - /etc/sv/ampelmann-logs/log/main/current
      close_older: 1m

Unfortunately, same result...

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

@bitfehler What is your OS?

@bitfehler
Copy link

I have two different setups where this happening:

conrad@ip-10-70-37-48:~$ uname -a
Linux ip-10-70-37-48.eu-west.s-cloud.net 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux
conrad@ip-10-70-37-48:~$ cat /etc/debian_version 
7.8

and

conrad@ip-10-33-32-38:~$ uname -a
Linux ip-10-33-32-38.m01.ams5.s-cloud.net 3.16.7-ckt20+soundcloud #1 SMP Tue Jan 19 14:32:40 UTC 2016 x86_64 GNU/Linux    
conrad@ip-10-33-32-38:~$ cat /etc/debian_version 
7.9

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

@bitfehler Do you get in your logs something like the following when you have close_older enabled? Did you remove the ignore_older config part?

2016/02/09 13:31:33.850804 log.go:99: DBG  Closing file: /var/log/system.log

I will try to do some more tests on debian.

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

Two more things:

  • Can you try what happens if use /etc/sv/ampelmann-logs/log/main/* as pattern instead of only the file? If you set ignore_older low enough, it should never pick up the old files. Perhaps there is an issue with close_older if the file is never found again. But this should be handled by force_close_files :-(
  • Can you share again some log parts with the nightly build?

@bitfehler
Copy link

I will try changing the pattern and then share the logs.

One thing I am not sure about: in the nightly build, shoud I still be using force_close_files?

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

If you use the pattern * it shouldn't be needed. In both cases it should work without it, but it would be interesting to see if it solves the issue.

@bitfehler
Copy link

The problem is still occurs with the following config:

filebeat:
  prospectors:
    -
      paths:
        - /etc/sv/ampelmann-logs/log/main/*
      ignore_older: 1m
      close_older: 1m
      force_close_files: true
output:
  logstash:
    hosts: ["ip-10-33-49-35.k02.ams5.s-cloud.net:5044", "ip-10-33-49-35.k02.ams5.s-cloud.net:5045", "ip-10-33-49-35.k02.ams5.s-cloud.net:5046", "ip-10-33-49-35.k02.ams5.s-cloud.net:5047", "ip-10-33-49-35.k02.ams5.s-cloud.net:5048", "ip-10-33-49-35.k02.ams5.s-cloud.net:5049", "ip-10-33-49-35.k02.ams5.s-cloud.net:5050", "ip-10-33-49-35.k02.ams5.s-cloud.net:5051", "ip-10-33-49-91.k02.ams5.s-cloud.net:5044", "ip-10-33-49-91.k02.ams5.s-cloud.net:5045", "ip-10-33-49-91.k02.ams5.s-cloud.net:5046", "ip-10-33-49-91.k02.ams5.s-cloud.net:5047", "ip-10-33-49-91.k02.ams5.s-cloud.net:5048", "ip-10-33-49-91.k02.ams5.s-cloud.net:5049", "ip-10-33-49-91.k02.ams5.s-cloud.net:5050", "ip-10-33-49-91.k02.ams5.s-cloud.net:5051", "ip-10-33-81-43.l02.ams5.s-cloud.net:5044", "ip-10-33-81-43.l02.ams5.s-cloud.net:5045", "ip-10-33-81-43.l02.ams5.s-cloud.net:5046", "ip-10-33-81-43.l02.ams5.s-cloud.net:5047", "ip-10-33-81-43.l02.ams5.s-cloud.net:5048", "ip-10-33-81-43.l02.ams5.s-cloud.net:5049", "ip-10-33-81-43.l02.ams5.s-cloud.net:5050", "ip-10-33-81-43.l02.ams5.s-cloud.net:5051", "ip-10-33-81-71.l02.ams5.s-cloud.net:5044", "ip-10-33-81-71.l02.ams5.s-cloud.net:5045", "ip-10-33-81-71.l02.ams5.s-cloud.net:5046", "ip-10-33-81-71.l02.ams5.s-cloud.net:5047", "ip-10-33-81-71.l02.ams5.s-cloud.net:5048", "ip-10-33-81-71.l02.ams5.s-cloud.net:5049", "ip-10-33-81-71.l02.ams5.s-cloud.net:5050", "ip-10-33-81-71.l02.ams5.s-cloud.net:5051"]
    loadbalance: true

Here is the log file. At the time the log ends there are seven open handles to deleted files. There are no lines like the one you mentioned above in #922 (comment), but if I understand correctly that is because I only run with -d "prospector", and this line would require harvester. I will add that and produce new logs, but maybe this here is helpful in the meantime.

@bitfehler
Copy link

Ran again with -d "harvester,prospector", here is the result:
filebeat_nightly_harvester_prospector_log.txt
The log line you mentioned is still not present, but maybe it helps anyways...

@bitfehler
Copy link

Forgot to add: this time, by the time the log ends there were two open handles.

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

I followed this file: 4000000056b9fdbb1af3680c.s

  • It is rotated at `2016/02/09 14:54:50
  • It is still harvested at 2016/02/09 15:00:10 even though there was no message in between, that an update to the file happened

As you have set closer_older to 1 minute and above 5 minutes passed, it should already have been closed.

Your note that two handles are open somehow suggests that two harvesters were started for the file (which I can't find in the log files).

Are your log files on any shared or special drive?

@ruflin
Copy link
Contributor

ruflin commented Feb 9, 2016

@bitfehler Sorry to ask, but could you provide me with one more log file with "harvester, prospector" enable, force_close_files: true, close_older to 15 seconds, harvesting only the "current" file (no glob).

And can you "clean / remove" your .registrar file before starting again?

@bitfehler
Copy link

@ruflin The logs are on a plain disk. But I did indeed not clear the registry file, forgot about that. I guess that could have caused some weirdness?

Anyways, I now ran the following config:

filebeat:
  prospectors:
    -
      paths:
        - /etc/sv/ampelmann-logs/log/main/current
      close_older: 15s
      force_close_files: true
output:
  logstash:
    hosts: [ <many hosts...> ]
    loadbalance: true

This time, I stopped the service, deleted the registry file, and restarted it. The issue still occured. This is the resulting log:
filebeat_log.txt

For reference, these stales file handles showed up in lsof:

/etc/sv/ampelmann-logs/log/main/@4000000056bb07362967f3e4.s (deleted)
/etc/sv/ampelmann-logs/log/main/@4000000056bb07f226927d24.s (deleted)

Hope that helps...

@ruflin
Copy link
Contributor

ruflin commented Feb 10, 2016

Summary of the findings:

  • It seems like the current file was rotated 14 times, every 2-3 minutes
  • force_close is happening twice but not because of close_older

So it really looks like close_older does nothing. That probably means that lastTimeRead has a wrong date inside: https://github.com/elastic/beats/blob/master/filebeat/harvester/reader.go#L114 An alternative could be that for whatever reason h.file is nil here and filebeat can't close the file anymore: https://github.com/elastic/beats/blob/master/filebeat/harvester/log.go#L97

Is the above your complete config file or did you leave parts out (except for the hosts list)?

I'm currently refactoring part of the prospector and harvester and try to introduce some more log messages to see what exactly is going on: #956

@bitfehler
Copy link

The config I pasted is all there was. I tried to build filebeat myself, so I could just add some log lines, but unfortunately it didn't build out of the box:

conrad@tryptophan ~/go/src/github.com/elastic/beats/filebeat $ go version
go version go1.5.3 linux/amd64
conrad@tryptophan ~/go/src/github.com/elastic/beats/filebeat $ go get github.com/elastic/beats/filebeat
# github.com/elastic/beats/filebeat/harvester/encoding
harvester/encoding/encoding.go:50: cannot use encoding.Replacement.NewEncoder (type func() *encoding.Encoder) as type func() transform.Transformer in field value
harvester/encoding/encoding.go:51: cannot use encoding.Replacement.NewEncoder (type func() *encoding.Encoder) as type func() transform.Transformer in field value

Will happily run new builds or different configs. In the meantime, I have an idea for a potential reproduction case that I will I try to implement...

@ruflin
Copy link
Contributor

ruflin commented Feb 10, 2016

In the meantime I did some system-tests trying to reproduce it. So far no success: #958 Can you tell me more about your use case?

Can you try to clone the repo from the repository itself instead of using go get? Or if you use go get try the full beats repo (not only filebeat).

@bitfehler
Copy link

Hi, sorry for not responding for a while, I was pretty busy yesterday. I don't have any updates on the issue at hand so far, but I did finally manage to compile Filebeat myself. The problem was that this commit golang/text@c27e06c will break your builds once you upgrade your golang.org/x/text to the latest version, so be warned 😄
As such, I'd now be happy to try out any branches, custom patches or other modifications if that would help.

As for the use case, it's rather simple. The ampelmann-logs service is a simple runit service. It receives access logs from an HTTP proxy and writes them to stdout, where they are picked up by svlogd, a common scenario in the runit world. Svlogd takes care of writing the logs to disk and rotating files.

I have tried to reproduce this with a very simple runit service that just spits out a log line every couple of seconds, but the issue didn't occur. Thus I am wondering if this is related to the amount of logs being processed? Maybe Filebeat still needs to do some work when the file is closed and that triggers some kind of race condition?

@ruflin
Copy link
Contributor

ruflin commented Feb 12, 2016

Unfortunately it is not compatible with the most recent version of golang/text, that is why we tagged a specific commit: https://github.com/elastic/beats/blob/master/glide.yaml#L31

I'm currently working on improving the tracking of open harvesters and also the logging around it. This will make it possible to track much better what is happening: #964

Two additional ideas:

  • To make sure the issue is not related to the output, could you run it with file output instead of ls output?
  • Filebeat also supports stdin. So an idea could be to directly read from stdout instead of first writing it to file. That said, the goal is of course that filebeat works correctly with the files.

@bitfehler
Copy link

Cool, I will see if can get your branch to run and give that a go and will also try the file output!

@ruflin
Copy link
Contributor

ruflin commented Feb 12, 2016

Be aware that this is not stable yet, still WIP. Interesting is the httpprof output with the harvester counter. If you start it with the httpprof option you can load in your browser how many harvesters are running.

@andrewkroh andrewkroh added the Filebeat Filebeat label Mar 9, 2016
@ruflin
Copy link
Contributor

ruflin commented Apr 27, 2016

@bitfehler @sebbarg Quite some time passed since our last interfaction and a few new filebeat versions with improvements came out (current is 1.2.2). Is this still an ongoing issue? I actually hope with all the recent improvements that we fixed this issue. So I will close this issue. Please reopen if this issue still persists.

@ruflin ruflin closed this as completed Apr 27, 2016
@mrunalgosar
Copy link

This issue still persists. I had a log rolled over on 25th Aug 2016 and on 27th Aug 2016 when i started Filebeat it started reading the older one as well.
I am using FileBeat 1.2.3
ignore_older: 12h
close_older: 12h

@ruflin
Copy link
Contributor

ruflin commented Aug 31, 2016

@mrunalgosar This is issue is related to file handlers not being closed. Your issue seems to be that ignore_older does not work as expected. Can you please open a topic on discuss to investigate your issue? https://discuss.elastic.co/c/beats/filebeat Please post your config and if possible some logs files there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat
Projects
None yet
Development

No branches or pull requests

5 participants