-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstore issues on AWS Lambda #2410
Comments
@timon-schmelzer-gcx check the _delta_log files, and share those contents here |
Sure, hope this helps!
|
I don't see anything wrong here. You overwrote the table, so it now also has a remove action, and that file later got vacuumed |
Wait a second, I do not want to overwrite when adding new data, just when the table is not existing yet. Is the problem that the write config cannot be changed (or is ignored) once the table is created? def create_or_append_delta(data: pd.DataFrame, s3_bucket: str, s3_path: str, schema: pa.Schema):
"""Create new delta table with reasonable configuration.
If delta table already exists, appends data to it.
"""
write_config = {
"mode": "append",
}
if not folder_exists(s3_bucket, s3_path):
print("Writing new table")
write_config = {
"mode": "overwrite",
"configuration": {"delta.logRetentionDuration": "interval 0 hour"},
}
write_deltalake(s3_path, data, schema=schema, **write_config) |
@timon-schmelzer-gcx the writer mode is not in the table config. There is likely something wrong in how you check if the table exists or not, you need to look into that |
Environment
Delta-rs version:
0.16.4
Binding:
python
Environment:
Bug
What happened:
I am currently working on a simple AWS data pipeline, mainly build on AWS Lambda function + deltars. There are two steps:
a. One special case: If the table is not existing yet, create one and set table parameters correctly
As explained here, we also configured a DynamoDb as logstore. The corresponding environment variables are directly set for both lambda functions.
Here is the code:
Optimize code:
What you expected to happen:
I would expect that the OPTIMZE merges multiple small files into bigger ones and that the VACUUM remove the now outdated small files. Instead, the OPTIMIZE code does nothing and the VACUUM removes every file besides the most recent one! As written in the comment, I understood that VACUUM will not remove parts of the table that are currently in use, even if you set
retention_hours=0
.How to reproduce it:
I expect a logstore issue here, as the optimize / vacuum code is working fine locally.
More details:
This is the situation before the second lambda function is called:
Output of the second lambda function:
So even if there are two files available,
bronze_delta_table.file_uris()
only returns one. Also, the OPTIMIZE only considers one file instead of two. I run this code also on a table containing hundereds of small files with the result that all files besides the most recent ones have been removed.There is also a warning about the lock client (
[2024-04-12T08:44:16Z WARN deltalake_aws::logstore] LockClientError::VersionAlreadyExists(2)
), which I do not really understand. When looking at the dynamodb, it seems like the second lambda function created two entries at the same time, but I am not sure, why. Could that be the reason for this strange behaviour?The text was updated successfully, but these errors were encountered: