Registry info cleanup #1022

smaniero · 2016-02-21T22:59:41Z

Why when log file is deleted registry file entries isn't updated? (remove keys of deleted files)

ruflin · 2016-02-22T05:29:56Z

That is currently the expected behaviour as it is hard to define if a file actually was removed or just disappeared for some time (renamed, volume not available, ...). We discussed a solution for this some time ago here: https://github.com/elastic/filebeat/issues/269

Is the issue you have that the registrar file is growing over time?

smaniero · 2016-02-22T20:36:40Z

In my case log file name policy is based on date:

logfile.log.<YYYYMMDD>

So day by day registry file grow and I need some tecnique to purge old info in a safe way

ruflin · 2016-02-23T17:40:59Z

We should definitively provide a method to cleanup the registrar file in the long run.

As you have only one entry per day you shouldn't run into issues too soon.

smaniero · 2016-02-23T20:23:30Z

Good

ruflin · 2016-02-24T01:19:17Z

@smaniero I suggest to keep this issue open and closing https://github.com/elastic/filebeat/issues/269 so we keep track of it in this repo.

smaniero · 2016-02-24T09:25:25Z

Ok

ruflin · 2016-03-01T10:27:07Z

I would suggest the following solution: As in the next release, close_older is introduced and ignore_older is set to infinity by default, I suggest to use ignore_older for the cleanup of the registrar file.

That means every time a file hits ignore_older it will also be removed from the registrar file. By default that never happens as ignore_older is infinity. Is ignore_older set to 12h, every file older then 12hours will be removed from the registrar file. Should it be updated afterwards again, the reading would start at the end of the file as no state is persisted.

Would that work for you @vinod8427 @smaniero

sprysprocket · 2016-04-05T21:09:25Z

This is a feature I am anxiously awaiting. We have several high volume log sources that feature hourly log slices which are created with a time-stamp suffix. We have no control over the file names, but we need to ship them regardless. The registry quickly grows into the MBs, and we need to knock it back. This involves shutting down the filebeat daemon, hand modifying the file, and starting it back up. Not the highlight of my week. Yes, we'll end up automating this, but still...

Since nobody replied to your last comment, I will. It seems like a reasonable approach to me.

One point that could use some clarification... You mention that the file will be purged from the registry if is older than ignore_older. The filebeat registry (at least for 1.1.2) does not appear to track any timestamps related to the file inode, implying that you would need access to the file to determine its age. If the file has been removed (inode gone... not simply moved to a new filename), would you also consider this as eligible for removal? Has the format already been changed in 1.2.x to support this?

The discussion you linked to above (https://github.com/elastic/filebeat/issues/269) seems to indicate that a timestamp would be added for this purpose, but wanted to confirm. It is entirely likely that we will have files that are purged from the host before the configured ignore_older window, and we wouldn't want those to continue to linger around.

driskell · 2016-04-07T07:28:43Z

This goes way back to forwarder days.

The purpose of registry is to maintain the current file system snapshot (inside, name, etc) and record progress (offset.) Currently, it is not doing the former and only receives updates that imply an addition. I've not fully tested things but based on the fixes I created for Log Courier a long time ago I'd say some of the following still remains and hopefully it'll offer some guidance.

New empty files don't appear in registrar until they receive write and ACK. If shipper stopped, log written, and shipper started, it would start from end and skip (contrast this to behaviour with a non-empty file - nothing skipped)
Renames do not persist to registrar unless harvester stops and starts again
Log rotation corrupts registrar if the old rotation updates due to missing rename propagation and incorrect use of path as key. This is because the harvester for the previous rotation still saves to registrar with old path, overwriting the key entry for the new file with the old file's offset. I've seen this trigger truncation detection or event loss.
File deletions are not propagated meaning inode reuse causes problems that get worse the longer you use the shipper.
Sometimes after a restart for whatever reason if insides have been reused on new files created while shipper was offline it will incorrectly think it needs to resume

Generally I would suggest the following to help nip this one in the bud. All are in the registrar/prospector code for Log Courier and work really well from what I hear.

When new file is detected, ship a new file event to registrar to log the start offset so it is not lost, before starting harvester
Do not use the path as key in registrar. Use a unique identifier such as UUID or a memory pointer (it gets rewritten completely each run anyway).
When rename is detected, ship a rename event to registrar to set new name. You could even remove the file name completely as long as initial prospector can deal with inode lookups like that. But that impacts diagnostic capability so it's nice to have it
Rare log rotation detection issues should be resolved by the above. Sending new file event after the rename event ensures synchronisation of the registrar shuffling. Using the unique identifier still allows harvester updates to be received without needing a path.
If a file is not seen after a prospector scan, remove it from registrar. Prospector likely has forgotten everything about it anyway so whatever is held in registrar isn't going to be used until restart and impacts startup behaviour.

I suppose overall the idea is that registrar should reflect the prospector view state, plus harvester offsets. At the moment it only truly reflects harvester states - yet it is used during startup to seed the prospector state. So you can imagine after restart the prospector state is very often completely different to what it was before restart.

sprysprocket · 2016-04-07T12:50:23Z

@driskell That makes a lot of sense to me.

ruflin · 2016-04-11T07:03:38Z

Linking this issue here as I think if we start to refactor the registrar file and potentially have to break BC, we should also take something like this into account: jordansissel/ruby-filewatch#79 (comment) Not necessarly directly implement it in filebeat but having the structure of the registry file ready to support more then just inode and device.

ruflin · 2016-04-11T07:10:13Z

@driskell Thanks a lot for the detailed analysis. There is not much to add and it is definitively the direction we should follow.

The only point I'm not sure about is your last point about removing a file from registrar if it is not found after a scan. Couldn't this have the affect that assuming a volume disappears for a second and reappears again, that it will reread all files? I was thinking using ignore_older for the cleanup based on a timestamp in the registry file. Like this it could be handled directly by the "registry writer" and harvester / prospector would not have to deal with it.

guyboertje · 2016-04-11T14:18:22Z

@ruflin - as part of my fingerprint update in filewatch I took the opportunity to add an expires timestamp into the sincedb record, when an expired entry is read it is not put into the in-memory db and therefore is not written out when the db is persisted to the sincedb file.

driskell · 2016-04-11T21:26:09Z

@ruflin Disappearing volumes could cause issue I guess - I had not considered that. If file system disappears then it likely all closed files. Depending on close timing settings if files vanished and were seen again it would see them as too old to open (ignore older) though of course if you close earlier it might do that.

Guess it is a tuning ppl may need to do as longer the record kept the more likely inode conflict on sets of fast rotating files. Both edge cases really :)

Before there were 2 different channels to report state so also prospector and registrar could report the state. State is now always reported by events from the harvesters. This means that events can also be empty to only update state. This means if 3 events were processed, this does not necessarily mean that 3 events were sent, as some of the events can also be state information. * Introduction of channel to prospector for events. This allows for actions on the event on a prospector level * Cleanup variable naming Part of elastic#1022

* Simplify state handling in Filebeat Before there were 2 different channels to report state so also prospector and registrar could report the state. State is now always reported by events from the harvesters. This means that events can also be empty to only update state. This means if 3 events were processed, this does not necessarily mean that 3 events were sent, as some of the events can also be state information. * Introduction of channel to prospector for events. This allows for actions on the event on a prospector level * Cleanup variable naming Part of #1022 * Prevent race in state

The Registrar should only persist the state but should not have to fetch information from the file itself before writing the state. First this is not the responsibility of the registrar to have knowledge about the different harvester implementation, second the information of a file could already have changed when the registrar writes. With this PR the file state is decoupled from the Registrar. Currently there is duplicated information in FileEvent. This will be cleaned up in a second step to keep the changes as minimal as possible per PR. Related to elastic#1022

The Registrar should only persist the state but should not have to fetch information from the file itself before writing the state. First this is not the responsibility of the registrar to have knowledge about the different harvester implementation, second the information of a file could already have changed when the registrar writes. With this PR the file state is decoupled from the Registrar. Currently there is duplicated information in FileEvent. This will be cleaned up in a second step to keep the changes as minimal as possible per PR. Related to #1022

Previously the file state was loaded from disk every time a new file was found. The state from the registry file is now only loaded once during startup and from the one the prospector internal in memory state is used. This is more efficient and prevents race conditions. This can have one side affect. In case a shared file system is not available during a restart and becomes available later, all files are read from the start again as the state for these files was not loaded. Only the state for files which are found during startup (rotated or not) are loaded. This is related to elastic#1022 Changes: * Registry only loaded once during startup * File Stats are only read once and reuse during one scan * No state for empty files is written * Additional tests to confirm changes were added

Previously the file state was loaded from disk every time a new file was found. The state from the registry file is now only loaded once during startup and from the one the prospector internal in memory state is used. This is more efficient and prevents race conditions. This can have one side affect. In case a shared file system is not available during a restart and becomes available later, all files are read from the start again as the state for these files was not loaded. Only the state for files which are found during startup (rotated or not) are loaded. This is related to #1022 Changes: * Registry only loaded once during startup * File Stats are only read once and reuse during one scan * No state for empty files is written * Additional tests to confirm changes were added

ruflin · 2016-05-10T06:38:27Z

I created under #1600 a suggestion on how we could proceed here.

@driskell Would be nice to get your feedback on this.

ruflin · 2016-05-26T08:12:55Z

Brief update here: Quite a few changes were implemented in Filebeat recently to tackle the problems above (see above linked commits / PR's). Registrar is not decoupled from Prospector, Registrar stores it state as an array without the path as index and registrar is only used for the state on startup. There are still a few open issues as #1600 but things are moving forward.

zxy12 · 2016-05-30T11:27:03Z

@ruflin will this feature be possible support with filebeat-1.2(cleanup registry states of deleted files )?

ruflin · 2016-05-30T11:33:51Z

@zxy12 Probably this feature will only be available in 5.x. The reason is that to make this feature available the inner workings of prospector and registrar had to be heavily changed.

zxy12 · 2016-05-30T11:47:58Z

@ruflin Well, I will upgrade it.

ruflin · 2016-07-18T11:06:21Z

I'm going to close this issue as all the necessary changes to separate prospector and registrar state handling are now in master. In addition new clean_* variables were introduce to clean up the registry file either on removal or after a certain time. All the discussions and related PR's can be found in #1600. Some small changes related to naming and docs are still tracked in #2012. All these changes should be release with beta1 soonish.

Thanks everyone for contributing in this issue. Special thanks goes to @driskell as most of the work done is very similar on what he initially suggested.

Please test the new features and let us know if you find some issues.

doingitbigdata · 2016-09-21T13:55:43Z

I am currently experiencing this inode reuse issue in a production system using filebeat v1.2.3. Can someone advise me if the only current solution is to upgrade to v5.0.0 alpha5? Will the clean_* variables be backported to a release version as this is a pretty critical issue?

ruflin · 2016-09-21T14:02:33Z

@doingitbigdata Backporting the clean_* options to the 1.x release is not possible because of the inner workings of filebeat and the registry file structure in 1.x. Your only way is to upgrade to the 5.0.0 releases. Beta1 is coming soon if that helps.

ruflin added the Filebeat Filebeat label Feb 22, 2016

ruflin added the enhancement label Feb 23, 2016

smaniero closed this as completed Feb 23, 2016

ruflin reopened this Feb 24, 2016

driskell mentioned this issue Apr 7, 2016

filebeat - inode re-use on OS leading to loss of beats data #1341

Closed

ruflin mentioned this issue Apr 25, 2016

Simplify state handling in Filebeat #1473

Merged

ruflin mentioned this issue Apr 28, 2016

Move state creation to harvester #1515

Merged

ruflin mentioned this issue May 3, 2016

Refactoring of FileState loading for Prospector #1558

Merged

ruflin mentioned this issue Jun 16, 2016

Filebeat is losing events due to file rotate elastic/logstash#5503

Closed

ruflin closed this as completed Jul 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registry info cleanup #1022

Registry info cleanup #1022

smaniero commented Feb 21, 2016

ruflin commented Feb 22, 2016

smaniero commented Feb 22, 2016

ruflin commented Feb 23, 2016

smaniero commented Feb 23, 2016

ruflin commented Feb 24, 2016

smaniero commented Feb 24, 2016

ruflin commented Mar 1, 2016

sprysprocket commented Apr 5, 2016

driskell commented Apr 7, 2016

sprysprocket commented Apr 7, 2016

ruflin commented Apr 11, 2016

ruflin commented Apr 11, 2016

guyboertje commented Apr 11, 2016

driskell commented Apr 11, 2016

ruflin commented May 10, 2016

ruflin commented May 26, 2016

zxy12 commented May 30, 2016

ruflin commented May 30, 2016 •

edited

Loading

zxy12 commented May 30, 2016 •

edited

Loading

ruflin commented Jul 18, 2016

doingitbigdata commented Sep 21, 2016

ruflin commented Sep 21, 2016

Registry info cleanup #1022

Registry info cleanup #1022

Comments

smaniero commented Feb 21, 2016

ruflin commented Feb 22, 2016

smaniero commented Feb 22, 2016

ruflin commented Feb 23, 2016

smaniero commented Feb 23, 2016

ruflin commented Feb 24, 2016

smaniero commented Feb 24, 2016

ruflin commented Mar 1, 2016

sprysprocket commented Apr 5, 2016

driskell commented Apr 7, 2016

sprysprocket commented Apr 7, 2016

ruflin commented Apr 11, 2016

ruflin commented Apr 11, 2016

guyboertje commented Apr 11, 2016

driskell commented Apr 11, 2016

ruflin commented May 10, 2016

ruflin commented May 26, 2016

zxy12 commented May 30, 2016

ruflin commented May 30, 2016 • edited Loading

zxy12 commented May 30, 2016 • edited Loading

ruflin commented Jul 18, 2016

doingitbigdata commented Sep 21, 2016

ruflin commented Sep 21, 2016

ruflin commented May 30, 2016 •

edited

Loading

zxy12 commented May 30, 2016 •

edited

Loading