-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporary files filling up _delta_log folder - increasing table load time #2351
Comments
You could likely include .json.tmp in the regex parsing of files that can be removed as long as they have a creation time before a checkpoint. |
Also in the future if we switch to put if absent: #2296 those tmp files won't be written anyways |
@mortnstak added a pr to fix the issue |
# Description We didn't clean up failed commits when possible. # Related Issue(s) - fixes #2351
Hi - updated to Python v 0.16.4 today, which explicitly refers to #2356 in the release notes. However, the issue does not appear to be fixed, tmp files are being generated and not removed by cleanup_metadata() |
@mortnstak they are only removed up to a checkpoint |
@ion-elgreco -in our streaming data, the last checkpoint was created at 14:10. When running a metadata_cleanup, all tmp files generated prior to 14:10 should be deleted, right? This does not happen. |
@mortnstak did you also set a logRetentionDuration? Clean up metadata respects that as well. |
@ion-elgreco - yes this is set to 10 minutes. And json files are deleted according to this value |
@mortnstak in that case please make a reproducible example, so I can look into it. Because I added a test that checks the .tmp files being deleted and they passed |
Environment
Delta-rs version: 0.17.1
Binding: Python
Environment:
Bug
What happened:
We are streaming data from distributed writers. The main issue is that the time to run a DeltaTable() call for reading or writing increases after a while, until it becomes several minutes.
First - the source of the increasing load time was identified as the increasing number of json files in the _delta_log folder. The DeltaTable call appearantly does a file listing operations, which takes longer as the number of files increases. This was fixed by setting the delta.logRetentionDuration to 1 hour and running cleanup_metadata() on schedule.
The second identified issue was a high number of temporary files filling up the _delta_log folder. The effect is the same, after three days of streaming the number of .tmp files was above 30 000. Running DeltaTable() hence takes 5 minutes. Manually deleting the file reduces load time to 5 seconds.
Do not know why these files in some cases are generated and not deleted. I manually created a CommitFailedError due to concurrent compact operations, but this did not produce a tmp file.
What you expected to happen:
I would expect that the files are removed by the cleanup_metadata operations. And if they are not needed, they should be deleted as a part of the operation that generates them.
How to reproduce it:
More details:
The files have names of the following pattern: _commit_00014eab-4847-44c3-91a1-3f06a2f7f5f5.json.tmp. They appear to be a part of compact operations, snippet from one of the files: {"commitInfo":{"timestamp":1711099511948,"operation":"OPTIMIZE" ...
The current workaround is to delete the temporary files on schedule.
Some more info on the workload: We are generating checkpoints about every 10th write. Whe are running frequent compact operations on individual partitions, some of which are conflicting and generating CommitErrors to due concurrent deletes.
The text was updated successfully, but these errors were encountered: