-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat is losing events due to file rotate #5503
Comments
I'm currently trying to reproduce this. Some notes so far:
|
Didn't manage to reproduce this yet. So far I'm not sure if that is just an issue of files rotating faster then filebeat can read it? To make sure we are talking about the same thing:
As Beats treat the Output like a log output with rotating files, the above is the expected behaviour. If I understand you correctly, what you have is more a feature request so that file output does not rotate files but creates them with a timestamp. The side affect of this is that the storage would grow over time. One important question for me is to understand why you use the file output in the first place? Was this to only reproduce a potential bug (which is great)? Please let me know if we are talking about the same thing here or if I miss some points above. Thanks for putting all the effort into this, I'm happy to investigate further. |
Hi
At the beginning I tried with multiple generators, but at the end I used just one. Ignore it, as there no more log* files.
That's not what I'm doing. Notice the "Please, notice that:" section.
I'm able to reproduce it with generator running for a while, until i got filebeat.log.99 (aprox). Event lost happens 100% time.
That's correct
That's correct
I'm reading only "one file" (filebeat.log), due to my concerns about events being duplicated.
If Beats is intelligent enough to detect files being rotated, and hence reading only "events that were unread from last known position on filebeat.log, now named filebeat.log.1, while it starts reading filebeat.log from offset 0", then I have to test again with filebeat.log*. If using filebeat.log* reads all filebeat.log.XX again (hence, duplicating events), then I'm requesting to change how rotation is done.
That's correct. We're trying to get 0% loss from a fast-growing apache access file.
Thank you for your help |
Filebeat keeps track of rotated files, so it should never send duplicate events. So in case Following rotation is crucial to exactly not have data loss, because it is not guaranteed that filebeat finished reading when you rotate the file. Of course, you will loose that if filebeat can never catch up until the file is actually removed from disk. I assume the above note means, you should test again with |
Indeed. Be right back. |
@mostolog I think / hope your issues or not related to rotation. A general thing unrelated to this issue that this brought up for me is the question if it makes sense to treat the file output the same as writing logs. The nice thing about log format output with rotation is that it is possible to set a maximum to the data stored, at the same time I'm not 100% sure that is always what is expected. In general normally file output is only used for testing and not in production, but still a topic worth discussing in an other thread. Let me know what your outcome is. |
Hi
I'm not sure if I understood this. Do you mean filebeat output being a log file itself? I sadly report that seems it's still happening. Dockerfile Build image
Run and check version
Delete before testing
Run filebeat1
Run filebeat2
Run generator1
When there are ~80 filebeat.log files, stop generator container
Check for gaps on filebeat.log*
Check for gaps on final.log*
Test results looking for an specific sequence number
I have uploaded log files if that if somehow useful...(hope it works: http://s000.tinyupload.com/index.php?file_id=56525537947986187956) Regards |
Thanks for sharing the log files. I did one specific check for the sequence number In the final log I found sequence Not 100% sure I read the output above correctly but I expected that the numbers you list above have a gap inbetween? |
I also checked for An additional note here: The last lines of |
Checking further, it seems like the complete Can you share the registry file that was generated during the tests. It would be interesting to see if log60 shows up or not. And if yes, what the offset is. |
@mostolog I just realised now, that this issue is actually in the logstash repo instead of the beats. We can keep it here for the moment but as soon as we find the potential bug, we have to open a new issue in the beats repo. |
Hi
That's correct.
With grep -nH 118489 final.log I got 2 relevant results:
which means the sequence 118489 was processed twice (due to rotation). Isn't it?
Not if using filebeat* as prospector path, but it seemed buggy also when using "filebeat.log" (without *)
Using "filebeat.log" i didn't expect gaps, as it only read one file. Using filebeat.log*, as far as I can understand, it's possible to have numbers unarranged, hence having gaps inbetween. But, please, notice that: grep -nH sequence\":106980 filebeat.log*
while grep -nH 106980 final.log
final.log doesn't have a sequence number 106980. That is: is loosing events. Regards |
I'll test a few times and let you know if it happens "too often"
Same on my side
I didn't notice it. Yes: generation, filebeatA and filebeatB are running at the same time.
Haven't tried. Probably won't happen.
Glad to hear "there's a chance of being a bug". Regards |
I didn't realise, that it shows up twice. You are right, that this definitively should not happen. Interestingly it was crawled once as There is definitively here the potential for a bug :-) The problem with this race conditions is it is really hard to find a simple test to reproduce them all the time. I really appreciate all the effort you put into this. For future testing, can we make sure to have the following:
|
BTW: When you check for the first events in a new file to be missing, can you check if then the complete file is missing? I get more and more the impression we don't have an issue related to single events missing, but actually complete files missing or read twice. |
Thank YOU.
You mean directories? Please, elaborate.
Already ;)
Although provided example states:
😜
Do you mean "before doing grep -nH 106980 final.log, check all filebeat.log* are present on final.log" ? Otherwise, i didn't understand.
And now's when I forward you to:
|
About your last two points:
|
Tomorrow (or next Monday) I'll test a few times setting different registry values, although using docker that shouldn't worry/happen, isn't it? I'll include "check if any filebeat.log* is missing on final.log" to the script.
Now my approach seems simpler to me... |
Our approach of fingerprint is not to solve output issues with filebeat, but to solve challenges with lots of other logging systems. Only relying on path for file detection is definitively also a configuration option we are discussing. For docker: Assuming you run each filebeat instance in its own docker container and the registry directory is not a shared volume, it should not matter. TBH it would be better to do the tests without shared volumes in docker. I used shared volumes in the past for some system tests and it leaded to some crazy things as inodes and devices sometimes just randomly changed. But that was on OS X and mounting local volumes. I'm just mentioning this to make sure in the end we don't discover a docker volume issue :-) |
Ok. I'll let you know any update/opening issue at logstash. |
@mostolog any updates from your side here? I was thinking of writing a small python script for this to test so we could even add it to our test suite. The simplified version would be to use Python logger that generates about the 100 rotated files with a sequence number, filebeat keeps reading and when its finished, python reads the file, loads all sequence numbers in memory and checks if one is missing. |
@mostolog Based on your script, I wrote some python tests. It seems that especially under heavy rotation, something goes wrong: elastic/beats#1954 Can we move this issue to the Beats repo? |
On my way! (Thanks, I have been quite busy these days) |
@mostolog Thanks, closing this in favor of elastic/beats#1956 AFAIK it is not possible to move issues :-( |
I'm not sure whether I'm throwing this comment out of context, but here goes. Please let me know if this is the wrong place to ask this. In our production env we're doing log rotation through log4j, and renaming json.log to json.log.n where n is between 1 and 5 and fixed to 100MB. Disk space usage and what not. Filebeat version is 1.2.2 and configured to pick a single log file which log4j generates and rotates. The output is aimed towards a logstash listener over the wire. Any idea whether this issue will be hit in such conditions? |
@sacasumo Did it happen that you lost some events? There are some edge cases that can in the 1.x release because of the way states are stored but events should not be lost. Please use https://discuss.elastic.co/c/beats/filebeat for questions and add additional details like config file etc. In case you hit some issues, please try 5.0.0-alpha5 as all known issues are fixed there: https://www.elastic.co/downloads/beats/filebeat |
We haven't investigated this in a lot of detail so we can't really say whether we're losing logs through the cracks. We just stumbled across the issue and reached out for info 😄 We haven't put in an external systematic check to figure out whether this happens. What edge cases do you know of? We've pondered using 5.0.0-alpha5 however this is in production, and the alpha tag gives me the shivers 🙀 |
The issue can happen, if the file rotation of multiple files happens during the file system scan exactly between reading the info and putting it into memory. To put this into perspective: To reproduce it I had to do 100 file rotations per second and set scan_frequency to 0.1 which is not recommend at all. Obviously this "race" condition can also happen with much lower file rotations and scan_frequency time but it is much less likely. That is why initially it was quite trick to even find the problem. To prevent these kind of state race conditions, state handling was completely rewritten and restructured in the 5.0 release. From this point of view I can recommend the alpha5 release as all these kind of know edge cases do not happen anymore or are handled gracefully. But obviously we do not recommend alphas for production. If it helps, there is currently no major issue in the filebeat alpha5 version I know of ;-) |
thanks for the in depth response, in such context! In our case rotation happens much less frequently, just under over once an hour, with a default thanks again! |
Hi!
Already posted on discusss.
I hope not made any mistakes, cause it took me quite a long to test it and to report it in here.
Also hopes it deserves pioneer's. ;)
Thanks in advance for your kind help.
Regards
Scenario
Please, notice that:
Result
FilebeatB seems to be losing events registered on FilebeatA due to rotation.
Some events registered on filebeat.log.X doesn't exists on final.log*
Seems FilebeatA rotates file before filebeatB can read last file events.
Details
Configuration files
logstash.yml
filebeatA.yml
filebeatB.yml
Verification
This command checks for gaps on sequence numbers used by logstash generator.
This command checks for gaps on sequence number received from filebeat.
If the above last command returns at least a pair of numbers, you can check those exists on filebeat* but doesn't exists on final.log*
For example, if printed 344356-352529:
In fact, doing the query for the previous sequence value (344355), shows the filebeat.log.X filename (grep -H) is different (just truncated).
Proposed solution
The text was updated successfully, but these errors were encountered: