-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic S3 error: Error after 0 retries ... Broken pipe (os error 32) #2403
Comments
@t1g0rz does this happen every time? Can you see the logs in S3 during the execution? What are the storage options you are passing? |
Yes, this is persistent problem with this table. I managed to work around that by dividing one whole update into smaller partition-related updates. Here is the storage options storage_options = {
"AWS_S3_LOCKING_PROVIDER": "dynamodb",
"DELTA_DYNAMO_TABLE_NAME": "delta_log",
"AWS_REGION": "us-east-1",
"DELTA_DYNAMO_REGION": "us-east-1",
"AWS_ACCESS_KEY_ID": credentials.access_key,
"AWS_SECRET_ACCESS_KEY": credentials.secret_key,
"AWS_SESSION_TOKEN": credentials.token
} Let me try to find out if there are logs in s3. |
No, I cannot. The folder _delta_log contains only |
I found a combination that throws this error on EC2 (r6a.2xlarge - 64 GiB, 8 CPU): dt = DeltaTable.create("s3://lake/test3",
schema=pa.schema([("id", pa.int64()), ("part", pa.string())] + [(f"c{i}", pa.float64()) for i in range(250)]),
storage_options=storage_options,
partition_by=["part"],
)
r = []
for p in "ABCDEFJ":
part_df = pd.DataFrame(np.random.random((280_000, 250)))
part_df.columns = [f"c{i}" for i in range(250)]
part_df.insert(0, 'id', range(280_000))
part_df.insert(1, 'part', p)
r.append(part_df)
df = pd.concat(r)
dt = DeltaTable("s3://lake/test3", storage_options=storage_options)
dt.merge(
df, predicate="s.id = t.id and s.part = t.part", source_alias="s", target_alias="t"
).when_not_matched_insert_all().when_matched_update_all().execute() t seems to me like a memory problem. Otherwise, I cannot explain why I had to increase the number of rows to reproduce that. However, the peak memory consumption was 38.7 Gb. |
@t1g0rz this is not a supported option in delta-rs is it? |
@Dobatymo, |
@t1g0rz I could not find a single mention of it in the code or the docs. I also tried using it and it had zero effect. |
Environment
Delta-rs version: 0.16.4
Binding: python
Environment:
Bug
What happened:
I was updating a fairly large table on S3, but during the update process, I encountered the error below. I monitored the memory, and it was sufficient. However, for some reason, Delta Lake only writes the first two files (100 MB each) to each partition. When attempting to write the third file, it crashes with the error below. The table parameters are as follows: 8 partitions, with 250 columns and 180,000 rows inside each partition. I have no idea where to start debugging; I would appreciate your help.
What you expected to happen:
Normal completion of writing to the table
The text was updated successfully, but these errors were encountered: