Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registry info cleanup #1022

Closed
smaniero opened this issue Feb 21, 2016 · 22 comments
Closed

Registry info cleanup #1022

smaniero opened this issue Feb 21, 2016 · 22 comments

Comments

@smaniero
Copy link

Why when log file is deleted registry file entries isn't updated? (remove keys of deleted files)

@ruflin
Copy link
Contributor

ruflin commented Feb 22, 2016

That is currently the expected behaviour as it is hard to define if a file actually was removed or just disappeared for some time (renamed, volume not available, ...). We discussed a solution for this some time ago here: https://github.com/elastic/filebeat/issues/269

Is the issue you have that the registrar file is growing over time?

@ruflin ruflin added the Filebeat Filebeat label Feb 22, 2016
@smaniero
Copy link
Author

In my case log file name policy is based on date:

logfile.log.<YYYYMMDD>

So day by day registry file grow and I need some tecnique to purge old info in a safe way

@ruflin
Copy link
Contributor

ruflin commented Feb 23, 2016

We should definitively provide a method to cleanup the registrar file in the long run.

As you have only one entry per day you shouldn't run into issues too soon.

@smaniero
Copy link
Author

Good

@ruflin
Copy link
Contributor

ruflin commented Feb 24, 2016

@smaniero I suggest to keep this issue open and closing https://github.com/elastic/filebeat/issues/269 so we keep track of it in this repo.

@ruflin ruflin reopened this Feb 24, 2016
@smaniero
Copy link
Author

Ok

@ruflin
Copy link
Contributor

ruflin commented Mar 1, 2016

I would suggest the following solution: As in the next release, close_older is introduced and ignore_older is set to infinity by default, I suggest to use ignore_older for the cleanup of the registrar file.

That means every time a file hits ignore_older it will also be removed from the registrar file. By default that never happens as ignore_older is infinity. Is ignore_older set to 12h, every file older then 12hours will be removed from the registrar file. Should it be updated afterwards again, the reading would start at the end of the file as no state is persisted.

Would that work for you @vinod8427 @smaniero

@sprysprocket
Copy link

This is a feature I am anxiously awaiting. We have several high volume log sources that feature hourly log slices which are created with a time-stamp suffix. We have no control over the file names, but we need to ship them regardless. The registry quickly grows into the MBs, and we need to knock it back. This involves shutting down the filebeat daemon, hand modifying the file, and starting it back up. Not the highlight of my week. Yes, we'll end up automating this, but still...

Since nobody replied to your last comment, I will. It seems like a reasonable approach to me.

One point that could use some clarification... You mention that the file will be purged from the registry if is older than ignore_older. The filebeat registry (at least for 1.1.2) does not appear to track any timestamps related to the file inode, implying that you would need access to the file to determine its age. If the file has been removed (inode gone... not simply moved to a new filename), would you also consider this as eligible for removal? Has the format already been changed in 1.2.x to support this?

The discussion you linked to above (https://github.com/elastic/filebeat/issues/269) seems to indicate that a timestamp would be added for this purpose, but wanted to confirm. It is entirely likely that we will have files that are purged from the host before the configured ignore_older window, and we wouldn't want those to continue to linger around.

@driskell
Copy link

driskell commented Apr 7, 2016

This goes way back to forwarder days.

The purpose of registry is to maintain the current file system snapshot (inside, name, etc) and record progress (offset.) Currently, it is not doing the former and only receives updates that imply an addition. I've not fully tested things but based on the fixes I created for Log Courier a long time ago I'd say some of the following still remains and hopefully it'll offer some guidance.

  • New empty files don't appear in registrar until they receive write and ACK. If shipper stopped, log written, and shipper started, it would start from end and skip (contrast this to behaviour with a non-empty file - nothing skipped)
  • Renames do not persist to registrar unless harvester stops and starts again
  • Log rotation corrupts registrar if the old rotation updates due to missing rename propagation and incorrect use of path as key. This is because the harvester for the previous rotation still saves to registrar with old path, overwriting the key entry for the new file with the old file's offset. I've seen this trigger truncation detection or event loss.
  • File deletions are not propagated meaning inode reuse causes problems that get worse the longer you use the shipper.
  • Sometimes after a restart for whatever reason if insides have been reused on new files created while shipper was offline it will incorrectly think it needs to resume

Generally I would suggest the following to help nip this one in the bud. All are in the registrar/prospector code for Log Courier and work really well from what I hear.

  • When new file is detected, ship a new file event to registrar to log the start offset so it is not lost, before starting harvester
  • Do not use the path as key in registrar. Use a unique identifier such as UUID or a memory pointer (it gets rewritten completely each run anyway).
  • When rename is detected, ship a rename event to registrar to set new name. You could even remove the file name completely as long as initial prospector can deal with inode lookups like that. But that impacts diagnostic capability so it's nice to have it
  • Rare log rotation detection issues should be resolved by the above. Sending new file event after the rename event ensures synchronisation of the registrar shuffling. Using the unique identifier still allows harvester updates to be received without needing a path.
  • If a file is not seen after a prospector scan, remove it from registrar. Prospector likely has forgotten everything about it anyway so whatever is held in registrar isn't going to be used until restart and impacts startup behaviour.

I suppose overall the idea is that registrar should reflect the prospector view state, plus harvester offsets. At the moment it only truly reflects harvester states - yet it is used during startup to seed the prospector state. So you can imagine after restart the prospector state is very often completely different to what it was before restart.

@sprysprocket
Copy link

@driskell That makes a lot of sense to me.

@ruflin
Copy link
Contributor

ruflin commented Apr 11, 2016

Linking this issue here as I think if we start to refactor the registrar file and potentially have to break BC, we should also take something like this into account: jordansissel/ruby-filewatch#79 (comment) Not necessarly directly implement it in filebeat but having the structure of the registry file ready to support more then just inode and device.

@ruflin
Copy link
Contributor

ruflin commented Apr 11, 2016

@driskell Thanks a lot for the detailed analysis. There is not much to add and it is definitively the direction we should follow.

The only point I'm not sure about is your last point about removing a file from registrar if it is not found after a scan. Couldn't this have the affect that assuming a volume disappears for a second and reappears again, that it will reread all files? I was thinking using ignore_older for the cleanup based on a timestamp in the registry file. Like this it could be handled directly by the "registry writer" and harvester / prospector would not have to deal with it.

@guyboertje
Copy link

@ruflin - as part of my fingerprint update in filewatch I took the opportunity to add an expires timestamp into the sincedb record, when an expired entry is read it is not put into the in-memory db and therefore is not written out when the db is persisted to the sincedb file.

@driskell
Copy link

@ruflin Disappearing volumes could cause issue I guess - I had not considered that. If file system disappears then it likely all closed files. Depending on close timing settings if files vanished and were seen again it would see them as too old to open (ignore older) though of course if you close earlier it might do that.

Guess it is a tuning ppl may need to do as longer the record kept the more likely inode conflict on sets of fast rotating files. Both edge cases really :)

ruflin added a commit to ruflin/beats that referenced this issue Apr 26, 2016
Before there were 2 different channels to report state so also prospector and registrar could report the state. State is now always reported by events from the harvesters. This means that events can also be empty to only update state. This means if 3 events were processed, this does not necessarily mean that 3 events were sent, as some of the events can also be state information.

* Introduction of channel to prospector for events. This allows for actions on the event on a prospector level
* Cleanup variable naming

Part of elastic#1022
andrewkroh pushed a commit that referenced this issue Apr 27, 2016
* Simplify state handling in Filebeat

Before there were 2 different channels to report state so also prospector and registrar could report the state. State is now always reported by events from the harvesters. This means that events can also be empty to only update state. This means if 3 events were processed, this does not necessarily mean that 3 events were sent, as some of the events can also be state information.

* Introduction of channel to prospector for events. This allows for actions on the event on a prospector level
* Cleanup variable naming

Part of #1022

* Prevent race in state
ruflin added a commit to ruflin/beats that referenced this issue Apr 29, 2016
The Registrar should only persist the state but should not have to fetch information from the file itself before writing the state. First this is not the responsibility of the registrar to have knowledge about the different harvester implementation, second the information of a file could already have changed when the registrar writes. With this PR the file state is decoupled from the Registrar.

Currently there is duplicated information in FileEvent. This will be cleaned up in a second step to keep the changes as minimal as possible per PR.

Related to elastic#1022
tsg pushed a commit that referenced this issue May 2, 2016
The Registrar should only persist the state but should not have to fetch information from the file itself before writing the state. First this is not the responsibility of the registrar to have knowledge about the different harvester implementation, second the information of a file could already have changed when the registrar writes. With this PR the file state is decoupled from the Registrar.

Currently there is duplicated information in FileEvent. This will be cleaned up in a second step to keep the changes as minimal as possible per PR.

Related to #1022
ruflin added a commit to ruflin/beats that referenced this issue May 3, 2016
Previously the file state was loaded from disk every time a new file was found. The state from the registry file is now only loaded once during startup and from the one the prospector internal in memory state is used. This is more efficient and prevents race conditions.

This can have one side affect. In case a shared file system is not available during a restart and becomes available later, all files are read from the start again as the state for these files was not loaded. Only the state for files which are found during startup (rotated or not) are loaded.

This is related to elastic#1022

Changes:
* Registry only loaded once during startup
* File Stats are only read once and reuse during one scan
* No state for empty files is written
* Additional tests to confirm changes were added
tsg pushed a commit that referenced this issue May 4, 2016
Previously the file state was loaded from disk every time a new file was found. The state from the registry file is now only loaded once during startup and from the one the prospector internal in memory state is used. This is more efficient and prevents race conditions.

This can have one side affect. In case a shared file system is not available during a restart and becomes available later, all files are read from the start again as the state for these files was not loaded. Only the state for files which are found during startup (rotated or not) are loaded.

This is related to #1022

Changes:
* Registry only loaded once during startup
* File Stats are only read once and reuse during one scan
* No state for empty files is written
* Additional tests to confirm changes were added
@ruflin
Copy link
Contributor

ruflin commented May 10, 2016

I created under #1600 a suggestion on how we could proceed here.

@driskell Would be nice to get your feedback on this.

@ruflin
Copy link
Contributor

ruflin commented May 26, 2016

Brief update here: Quite a few changes were implemented in Filebeat recently to tackle the problems above (see above linked commits / PR's). Registrar is not decoupled from Prospector, Registrar stores it state as an array without the path as index and registrar is only used for the state on startup. There are still a few open issues as #1600 but things are moving forward.

@zxy12
Copy link

zxy12 commented May 30, 2016

@ruflin will this feature be possible support with filebeat-1.2(cleanup registry states of deleted files )?

@ruflin
Copy link
Contributor

ruflin commented May 30, 2016

@zxy12 Probably this feature will only be available in 5.x. The reason is that to make this feature available the inner workings of prospector and registrar had to be heavily changed.

@zxy12
Copy link

zxy12 commented May 30, 2016

@ruflin Well, I will upgrade it.

@ruflin
Copy link
Contributor

ruflin commented Jul 18, 2016

I'm going to close this issue as all the necessary changes to separate prospector and registrar state handling are now in master. In addition new clean_* variables were introduce to clean up the registry file either on removal or after a certain time. All the discussions and related PR's can be found in #1600. Some small changes related to naming and docs are still tracked in #2012. All these changes should be release with beta1 soonish.

Thanks everyone for contributing in this issue. Special thanks goes to @driskell as most of the work done is very similar on what he initially suggested.

Please test the new features and let us know if you find some issues.

@ruflin ruflin closed this as completed Jul 18, 2016
@doingitbigdata
Copy link

I am currently experiencing this inode reuse issue in a production system using filebeat v1.2.3. Can someone advise me if the only current solution is to upgrade to v5.0.0 alpha5? Will the clean_* variables be backported to a release version as this is a pretty critical issue?

@ruflin
Copy link
Contributor

ruflin commented Sep 21, 2016

@doingitbigdata Backporting the clean_* options to the 1.x release is not possible because of the inner workings of filebeat and the registry file structure in 1.x. Your only way is to upgrade to the 5.0.0 releases. Beta1 is coming soon if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants