Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat is losing events due to file rotate #5503

Closed
mostolog opened this issue Jun 15, 2016 · 29 comments
Closed

Filebeat is losing events due to file rotate #5503

mostolog opened this issue Jun 15, 2016 · 29 comments

Comments

@mostolog
Copy link

mostolog commented Jun 15, 2016

Hi!

Already posted on discusss.
I hope not made any mistakes, cause it took me quite a long to test it and to report it in here.
Also hopes it deserves pioneer's. ;)
Thanks in advance for your kind help.

Regards

Scenario

  • Logstash event generator writing to file "generator.log"
  • FilebeatA reading from "generator.log" and writing to "filebeat.log", rotating files each 1MB
  • FilebeatB reading from "filebeat.log" writing to "final.log", rotating each 3MB (really, rotation on this file can be disabled, but the problem persist)

Please, notice that:

  • If rotation renames file appending .X (eg: file.X) and FilebeatB is listening to file*, most events will be re-procesed (ie: processed twice)
  • If rotation creates a new file named with a timestamp (eg: file.YYYY.MM.dd.hh.mm.ss) and FilebeatB is listening to file*, most events will be re-procesed (ie: processed twice)

Result

FilebeatB seems to be losing events registered on FilebeatA due to rotation.
Some events registered on filebeat.log.X doesn't exists on final.log*

Seems FilebeatA rotates file before filebeatB can read last file events.

Details

logstash 5.0.0-alpha3-1
filebeat 1.2.3 and 5.0.0-alpha3

Configuration files

logstash.yml

input {
    generator{
        add_field => { "from" => "${HOSTNAME}" }
    }
}
filter {}
output {
    stdout {}
    file {
        path => "/tmp/generator.log"
        flush_interval => 0
    }
}

filebeatA.yml

filebeat:
  prospectors:
    -
      paths:
        - /tmp/log*
      input_type: log
      scan_frequency: 0s
output:
  file:
    path: "/tmp"
    filename: filebeat.log
    rotate_every_kb: 1000
    number_of_files: 100
  console:
shipper:
  name: filebeatA
  tags: ["service-X", "web-tier"]
logging:
  to_syslog: false
  to_files: false
  level: info

filebeatB.yml

filebeat:
  prospectors:
    -
      paths:
        - /tmp/filebeat.log
      input_type: log
      scan_frequency: 0s
output:
  file:
    path: "/tmp"
    filename: final.log
    rotate_every_kb: 3000
    number_of_files: 10
  console:
shipper:
  name: filebeatB
  tags: ["service-X", "web-tier"]
logging:
  to_syslog: false
  to_files: false
  level: info

Verification

This command checks for gaps on sequence numbers used by logstash generator.

ls -v filebeat.log* | xargs -i grep generator '{}' | sed 's/.sequence\"://g' | sed 's/,.//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'

This command checks for gaps on sequence number received from filebeat.

ls -v final.log* | xargs -i grep generator '{}' | sed 's/.sequence\"://g' | sed 's/,.//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'

If the above last command returns at least a pair of numbers, you can check those exists on filebeat* but doesn't exists on final.log*

For example, if printed 344356-352529:

grep -H generator filebeat.log* | grep 344356 prints one line.
grep -H generator final.log* | grep 344356 doesn't print anything.

In fact, doing the query for the previous sequence value (344355), shows the filebeat.log.X filename (grep -H) is different (just truncated).

Proposed solution

  • Each FilebeatA file is created with creation timestamp (eg: filebeat.log.YY-MM-dd-hh-mm-ss), hence FilebeatB will be able to read it until rotation happens.
  • At the moment of rotation, a new file is created with the creation timestamp, but the previous one still exists to be processed by FilebeatB, which will close it after a certain timeout.
@ruflin
Copy link

ruflin commented Jun 15, 2016

I'm currently trying to reproduce this. Some notes so far:

  • I use scan_frequency: 1s
  • The filebeatA.yml path I use is /tmp/generator.log. You have /tmp/log*, not sure why this one is matching?
  • In filebeatB you use /tmp/filebeat.log. This should be /tmp/filebeat* to make sure it also crawls the rotated files
  • I'm trying it with a fixed number of events generated to keep it reproducible
  • Both Filebeat instances are probably using the same registry file. This should never be the case as otherwise they overwrite the state of each other. It should not be an issue here as filebeat is never restarted, but worth mentioning.
  • In your first command, the last ' is missing.

@ruflin
Copy link

ruflin commented Jun 15, 2016

Didn't manage to reproduce this yet. So far I'm not sure if that is just an issue of files rotating faster then filebeat can read it? To make sure we are talking about the same thing:

  • filebeat A which reads the generator.log file, persists it to disk. Each file is 1MB, max 100 files are created. If 100 files are exceeded, the oldest data is removed.
  • filebeat B reads the persisted filebeat A data. If persists to disk with 3MB per file and max 10 files. This means total storage of 30MB compared to filebeat A with total storage of 100MB. So in case A writes more then 30MB, B will always only have a subset of the data. Over time this will be the most recent data as in the beginning it will just read all files in parallel (no priority) but then only the most recent files will be updated, so recent data is on top.

As Beats treat the Output like a log output with rotating files, the above is the expected behaviour. If I understand you correctly, what you have is more a feature request so that file output does not rotate files but creates them with a timestamp. The side affect of this is that the storage would grow over time.

One important question for me is to understand why you use the file output in the first place? Was this to only reproduce a potential bug (which is great)?

Please let me know if we are talking about the same thing here or if I miss some points above. Thanks for putting all the effort into this, I'm happy to investigate further.

@mostolog
Copy link
Author

Hi

The filebeatA.yml path I use is /tmp/generator.log. You have /tmp/log*, not sure why this one is matching?

At the beginning I tried with multiple generators, but at the end I used just one. Ignore it, as there no more log* files.

In filebeatB you use /tmp/filebeat.log. This should be /tmp/filebeat* to make sure it also crawls the rotated files

That's not what I'm doing. Notice the "Please, notice that:" section.
If filebeatB is "listening" to filebeat.log* and filebeat.log is rotated (creating a new filebeat.XX file), wouldn't filebeatB "read all events from newly created filebeat.XX"? Won't that mean all events being duplicated? (Hope I explained myself properly)

I'm trying it with a fixed number of events generated to keep it reproducible

I'm able to reproduce it with generator running for a while, until i got filebeat.log.99 (aprox). Event lost happens 100% time.

Both Filebeat instances are probably using the same registry file. This should never be the case as otherwise they overwrite the state of each other. It should not be an issue here as filebeat is never restarted, but worth mentioning.
Using docker, so less to worry.

In your first command, the last ' is missing.
Corrected. Thanks

filebeat A which reads the generator.log file, persists it to disk. Each file is 1MB, max 100 files are created. If 100 files are exceeded, the oldest data is removed.

That's correct

filebeat B reads the persisted filebeat A data. If persists to disk with 3MB per file and max 10 files. This means total storage of 30MB compared to filebeat A with total storage of 100MB. So in case A writes more then 30MB, B will always only have a subset of the data.

That's correct

Over time this will be the most recent data as in the beginning it will just read all files in parallel (no priority) but then only the most recent files will be updated, so recent data is on top.

I'm reading only "one file" (filebeat.log), due to my concerns about events being duplicated.

As Beats treat the Output like a log output with rotating files, the above is the expected behaviour. If I understand you correctly, what you have is more a feature request so that file output does not rotate files but creates them with a timestamp. The side affect of this is that the storage would grow over time.

If Beats is intelligent enough to detect files being rotated, and hence reading only "events that were unread from last known position on filebeat.log, now named filebeat.log.1, while it starts reading filebeat.log from offset 0", then I have to test again with filebeat.log*.
Otherwise, you should be able to reproduce this.

If using filebeat.log* reads all filebeat.log.XX again (hence, duplicating events), then I'm requesting to change how rotation is done.
If using filebeat.log* read all events on truncated filebeat.log + last events that were not readed from filebeat.log.1, then that's great and I should test again.

One important question for me is to understand why you use the file output in the first place? Was this to only reproduce a potential bug (which is great)?

That's correct. We're trying to get 0% loss from a fast-growing apache access file.

Please let me know if we are talking about the same thing here or if I miss some points above. Thanks for putting all the effort into this, I'm happy to investigate further.

Thank you for your help

@ruflin
Copy link

ruflin commented Jun 15, 2016

Filebeat keeps track of rotated files, so it should never send duplicate events. So in case filebeat.log is renamed to filebeat.log1 filebeat is detecting that and just continue at the old position. It is important, that the file is renamed and not a new one is created. But that is the case with log rotation algorithms I know so far.

Following rotation is crucial to exactly not have data loss, because it is not guaranteed that filebeat finished reading when you rotate the file. Of course, you will loose that if filebeat can never catch up until the file is actually removed from disk.

I assume the above note means, you should test again with filebeat.*?

@mostolog
Copy link
Author

I assume the above note means, you should test again with filebeat.*?

Indeed.
On the other hand, if rotation were made using creation timestamps (as I proposed at the begin of this topic), could solve any issues related to rotation, and perhaps it's good to consider it.

Be right back.
Thanks ;)

@ruflin
Copy link

ruflin commented Jun 15, 2016

@mostolog I think / hope your issues or not related to rotation.

A general thing unrelated to this issue that this brought up for me is the question if it makes sense to treat the file output the same as writing logs. The nice thing about log format output with rotation is that it is possible to set a maximum to the data stored, at the same time I'm not 100% sure that is always what is expected. In general normally file output is only used for testing and not in production, but still a topic worth discussing in an other thread.

Let me know what your outcome is.

@mostolog
Copy link
Author

Hi

if it makes sense to treat the file output the same as writing logs

I'm not sure if I understood this. Do you mean filebeat output being a log file itself?

I sadly report that seems it's still happening.
To detail as much as possible, here's the trace of what I'm doing:

Dockerfile
filebeat.dockerfile.txt

Build image

docker build -f filebeat.dockerfile.txt -t filebeat .

Run and check version

docker run --rm -it filebeat /usr/share/filebeat/bin/filebeat -version
filebeat version 5.0.0-alpha3 (amd64), libbeat 5.0.0-alpha3

Delete before testing

sudo rm filebeat.log* final.log log*

Run filebeat1
filebeat1.yml.txt

docker run -d --name filebeat1 -h filebeat1 -v ~/beat/filebeat.yml:/usr/share/filebeat/bin/filebeat1.yml -v /tmp:/tmp/filebeat filebeat /usr/share/filebeat/bin/filebeat -c /usr/share/filebeat/bin/filebeat.yml

Run filebeat2
filebeat2.yml.txt

docker run -d --name filebeat2 -h filebeat2 -v ~/beat/filebeat2.yml:/usr/share/filebeat/bin/filebeat.yml -v /tmp:/tmp/filebeat filebeat /usr/share/filebeat/bin/filebeat -c /usr/share/filebeat/bin/filebeat.yml

Run generator1
generator.yml.txt

docker run -d --name generator1 -h generator1 -v ~/logstash/_generator/config:/etc/logstash/config -v /tmp:/tmp/filebeat logstash:5.0.0-alpha3-1 /usr/share/logstash/bin/logstash --path.settings=/etc/logstash/config

When there are ~80 filebeat.log files, stop generator container

docker rm -v -f generator1

Check for gaps on filebeat.log*

ls -v filebeat.log* | xargs -i grep generator1 '{}' | sed 's/.sequence\"://g' | sed 's/,.//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'

Check for gaps on final.log*

ls -v final.log* | xargs -i grep generator1 '{}' | sed 's/.sequence\"://g' | sed 's/,.//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'

106980-109856
118489-118487
118490-118488
118491-118489
118492-11849
...

Test results looking for an specific sequence number

grep -H 106980, filebeat.log*
filebeat.log.60:{"@timestamp":"2016-06-15T17:27:22.393Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{"sequence":106980,"@timestamp":"2016-06-15T17:27:21.802Z","@Version":"1","host":"generator1","from":"generator1","message":"Hello world!"}","offset":14866230,"source":"/tmp/filebeat/loggenerator1","type":"log"}

grep -H 106980, final.log
(No results)

I have uploaded log files if that if somehow useful...(hope it works: http://s000.tinyupload.com/index.php?file_id=56525537947986187956)

Regards

@ruflin
Copy link

ruflin commented Jun 16, 2016

Thanks for sharing the log files. I did one specific check for the sequence number 118489-118487 you listed above. 118487 is the last entry in log file 57, 118488, 118489 are in log file 56.

In the final log I found sequence 118487 on line 115494 and squence 118488 and following at line 104388. Did you expect the sequence numbers to be in order also in the final.log?

Not 100% sure I read the output above correctly but I expected that the numbers you list above have a gap inbetween?

@ruflin
Copy link

ruflin commented Jun 16, 2016

I also checked for 106980 and this one I couldn't find in the final.log. It seems the events that are potentially missing, are the first ones in a new file?

An additional note here: The last lines of final.log which I downloaded has some wiered lines inside where it seems something went wrong with the encoding. Same on your side?

@ruflin
Copy link

ruflin commented Jun 16, 2016

Checking further, it seems like the complete log.60 is missing. I'm curious if this could be some internal race condition. Do you run the generation and crawling in parallel, correct? If you run it separately, means first run filebeatA and then run filebeatB, you still have missing events?

Can you share the registry file that was generated during the tests. It would be interesting to see if log60 shows up or not. And if yes, what the offset is.

@ruflin
Copy link

ruflin commented Jun 16, 2016

@mostolog I just realised now, that this issue is actually in the logstash repo instead of the beats. We can keep it here for the moment but as soon as we find the potential bug, we have to open a new issue in the beats repo.

@mostolog
Copy link
Author

Hi

Thanks for sharing the log files. I did one specific check for the sequence number 118489-118487 you listed above. 118487 is the last entry in log file 57, 118488, 118489 are in log file 56.

That's correct.

In the final log I found sequence 118487 on line 115494 and sequence 118488 and following at line 104388.

With grep -nH 118489 final.log I got 2 relevant results:

final.log:104388:{"@timestamp":"2016-06-15T17:27:24.238Z","beat":{"hostname":"filebeat2","name":"filebeat2"},"input_type":"log","message":"{"@timestamp":"2016-06-15T17:27:23.956Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{\"sequence\":118488,\"@timestamp\":\"2016-06-15T17:27:22.380Z\",\"@Version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}","offset":16477350,"source":"/tmp/filebeat/loggenerator1","type":"log"}","offset":356,"source":"/tmp/filebeat/filebeat.log","type":"log"}

final.log:115836:{"@timestamp":"2016-06-15T17:27:25.239Z","beat":{"hostname":"filebeat2","name":"filebeat2"},"input_type":"log","message":"{"@timestamp":"2016-06-15T17:27:23.956Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{\"sequence\":118488,\"@timestamp\":\"2016-06-15T17:27:22.380Z\",\"@Version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}","offset":16477350,"source":"/tmp/filebeat/loggenerator1","type":"log"}","offset":356,"source":"/tmp/filebeat/filebeat.log.8","type":"log"}

which means the sequence 118489 was processed twice (due to rotation). Isn't it?

Did you expect the sequence numbers to be in order also in the final.log?

Not if using filebeat* as prospector path, but it seemed buggy also when using "filebeat.log" (without *)

Not 100% sure I read the output above correctly but I expected that the numbers you list above have a gap inbetween?

Using "filebeat.log" i didn't expect gaps, as it only read one file. Using filebeat.log*, as far as I can understand, it's possible to have numbers unarranged, hence having gaps inbetween.

But, please, notice that:

grep -nH sequence\":106980 filebeat.log*

filebeat.log.60:1:{"@timestamp":"2016-06-15T17:27:22.393Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{"sequence":106980,"@timestamp":"2016-06-15T17:27:21.802Z","@Version":"1","host":"generator1","from":"generator1","message":"Hello world!"}","offset":14866230,"source":"/tmp/filebeat/loggenerator1","type":"log"}

while grep -nH 106980 final.log

final.log:100660:{"@timestamp":"2016-06-15T17:27:24.075Z","beat":{"hostname":"filebeat2","name":"filebeat2"},"input_type":"log","message":"{"@timestamp":"2016-06-15T17:27:20.270Z","beat":{"hostname":"filebeat1","name":"filebeat1"},"input_type":"log","message":"{\"sequence\":76918,\"@timestamp\":\"2016-06-15T17:27:20.212Z\",\"@Version\":\"1\",\"host\":\"generator1\",\"from\":\"generator1\",\"message\":\"Hello world!\"}","offset":10698006,"source":"/tmp/filebeat/loggenerator1","type":"log"}","offset":631900,"source":"/tmp/filebeat/filebeat.log","type":"log"}

final.log doesn't have a sequence number 106980. That is: is loosing events.

Regards

@mostolog
Copy link
Author

It seems the events that are potentially missing, are the first ones in a new file?

I'll test a few times and let you know if it happens "too often"

An additional note here: The last lines of final.log which I downloaded has some wiered lines inside where it seems something went wrong with the encoding. Same on your side?

Same on my side

Checking further, it seems like the complete log.60 is missing. I'm curious if this could be some internal race condition. Do you run the generation and crawling in parallel, correct?

I didn't notice it. Yes: generation, filebeatA and filebeatB are running at the same time.

If you run it separately, means first run filebeatA and then run filebeatB, you still have missing events?

Haven't tried. Probably won't happen.

I just realised now, that this issue is actually in the logstash repo instead of the beats. We can keep it here for the moment but as soon as we find the potential bug, we have to open a new issue in the beats repo

Glad to hear "there's a chance of being a bug".

Regards

@ruflin
Copy link

ruflin commented Jun 16, 2016

I didn't realise, that it shows up twice. You are right, that this definitively should not happen. Interestingly it was crawled once as filebeat.log and filebeat.log.8 means it was picked up at a later stage again. This could speak for a race condition, means two harvester picked it up idependent of each other.

There is definitively here the potential for a bug :-) The problem with this race conditions is it is really hard to find a simple test to reproduce them all the time. I really appreciate all the effort you put into this.

For future testing, can we make sure to have the following:

  • Define separate registry files for each filebeat instance
  • Use Filebeat 5.0.0-alpha3 for testing
  • Use scan_frequency > 0

@ruflin
Copy link

ruflin commented Jun 16, 2016

BTW: When you check for the first events in a new file to be missing, can you check if then the complete file is missing? I get more and more the impression we don't have an issue related to single events missing, but actually complete files missing or read twice.

@mostolog
Copy link
Author

I really appreciate all the effort you put into this.

Thank YOU.

Define separate registry files for each filebeat instance

You mean directories? Please, elaborate.

Use Filebeat 5.0.0-alpha3 for testing

Already ;)

Use scan_frequency > 0

Although provided example states:

Scan frequency in seconds.

How often these files should be checked for changes. In case it is set

to 0s, it is done as often as possible. Default: 10s

😜

BTW: When you check for the first events in a new file to be missing, can you check if then the complete file is missing?

Do you mean "before doing grep -nH 106980 final.log, check all filebeat.log* are present on final.log" ? Otherwise, i didn't understand.

I get more and more the impression we don't have an issue related to single events missing, but actually complete files missing or read twice.

And now's when I forward you to:

  • Review if filebeat it's using i-nodes/file descriptors/whatever that "bypass" the file renaming that happens when rotate
  • Consider my wonderful proposal of rotating using creation timestamps (Although logrotate do it standard way and my proposal is reinventing the wheel)

@ruflin
Copy link

ruflin commented Jun 16, 2016

  • Registry File: Just that the path is different in filebeatA.yml and filebeatB.yml: https://www.elastic.co/guide/en/beats/filebeat/master/configuration-filebeat-options.html#_registry_file
  • scan_frequency: That is not nice. I even plan to throw an error if it is set to 0 in the future. Means 0.001s is still possible (but not recommended) ;-) I put it on my list to be updated.
  • Missing event: What I meant is that if you realised, an event is missing, that you check if all events from the file are missing in the final.log. Same as the 60 log above which is completely missing. This is to confirm my theory that we are not talking about missing events, but missing files.

About your last two points:

  • Identifiers: We are discussing to use fingerprints for files in the future and not only inode /device. This should also improve some problems like inode reuse. See Registry info cleanup beats#1022
  • File output: That would not be an issue if Filebeat actually works as expected ;-) But I like the suggestion to if you use file output, that a config could be set to instead rotate files, to assign timestamps.

@mostolog
Copy link
Author

Tomorrow (or next Monday) I'll test a few times setting different registry values, although using docker that shouldn't worry/happen, isn't it?

I'll include "check if any filebeat.log* is missing on final.log" to the script.

Identifiers: We are discussing to use fingerprints for files in the future and not only inode /device. This should also improve some problems like inode reuse. See elastic/beats#1022

Now my approach seems simpler to me...

@ruflin
Copy link

ruflin commented Jun 16, 2016

Our approach of fingerprint is not to solve output issues with filebeat, but to solve challenges with lots of other logging systems. Only relying on path for file detection is definitively also a configuration option we are discussing.

For docker: Assuming you run each filebeat instance in its own docker container and the registry directory is not a shared volume, it should not matter. TBH it would be better to do the tests without shared volumes in docker. I used shared volumes in the past for some system tests and it leaded to some crazy things as inodes and devices sometimes just randomly changed. But that was on OS X and mounting local volumes. I'm just mentioning this to make sure in the end we don't discover a docker volume issue :-)

@mostolog
Copy link
Author

Ok.

I'll let you know any update/opening issue at logstash.
Thanks a lot!

@ruflin
Copy link

ruflin commented Jun 21, 2016

@mostolog any updates from your side here?

I was thinking of writing a small python script for this to test so we could even add it to our test suite. The simplified version would be to use Python logger that generates about the 100 rotated files with a sequence number, filebeat keeps reading and when its finished, python reads the file, loads all sequence numbers in memory and checks if one is missing.

@ruflin
Copy link

ruflin commented Jul 4, 2016

@mostolog Based on your script, I wrote some python tests. It seems that especially under heavy rotation, something goes wrong: elastic/beats#1954 Can we move this issue to the Beats repo?

@mostolog
Copy link
Author

mostolog commented Jul 4, 2016

On my way! (Thanks, I have been quite busy these days)

@ruflin
Copy link

ruflin commented Jul 4, 2016

@mostolog Thanks, closing this in favor of elastic/beats#1956 AFAIK it is not possible to move issues :-(

@ruflin ruflin closed this as completed Jul 4, 2016
@stv-io
Copy link

stv-io commented Aug 17, 2016

I'm not sure whether I'm throwing this comment out of context, but here goes. Please let me know if this is the wrong place to ask this. In our production env we're doing log rotation through log4j, and renaming json.log to json.log.n where n is between 1 and 5 and fixed to 100MB. Disk space usage and what not. Filebeat version is 1.2.2 and configured to pick a single log file which log4j generates and rotates. The output is aimed towards a logstash listener over the wire. Any idea whether this issue will be hit in such conditions?

@ruflin
Copy link

ruflin commented Aug 18, 2016

@sacasumo Did it happen that you lost some events? There are some edge cases that can in the 1.x release because of the way states are stored but events should not be lost. Please use https://discuss.elastic.co/c/beats/filebeat for questions and add additional details like config file etc. In case you hit some issues, please try 5.0.0-alpha5 as all known issues are fixed there: https://www.elastic.co/downloads/beats/filebeat

@stv-io
Copy link

stv-io commented Aug 18, 2016

We haven't investigated this in a lot of detail so we can't really say whether we're losing logs through the cracks. We just stumbled across the issue and reached out for info 😄 We haven't put in an external systematic check to figure out whether this happens. What edge cases do you know of? We've pondered using 5.0.0-alpha5 however this is in production, and the alpha tag gives me the shivers 🙀

@ruflin
Copy link

ruflin commented Aug 18, 2016

The issue can happen, if the file rotation of multiple files happens during the file system scan exactly between reading the info and putting it into memory. To put this into perspective: To reproduce it I had to do 100 file rotations per second and set scan_frequency to 0.1 which is not recommend at all. Obviously this "race" condition can also happen with much lower file rotations and scan_frequency time but it is much less likely. That is why initially it was quite trick to even find the problem. To prevent these kind of state race conditions, state handling was completely rewritten and restructured in the 5.0 release. From this point of view I can recommend the alpha5 release as all these kind of know edge cases do not happen anymore or are handled gracefully. But obviously we do not recommend alphas for production. If it helps, there is currently no major issue in the filebeat alpha5 version I know of ;-)

@stv-io
Copy link

stv-io commented Aug 18, 2016

thanks for the in depth response, in such context! In our case rotation happens much less frequently, just under over once an hour, with a default scan_frequency

thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants