Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add buffer flushing to filesystem writes #1911

Merged
merged 2 commits into from
Nov 29, 2023

Conversation

r3stl355
Copy link
Contributor

Description

Current implementation of ObjectOutputStream does not invoke flush when writing out files to Azure storage which seem to cause intermittent issues when the write_deltalake hangs with no progress and no error.

I'm adding a periodic flush to the write process, based on the written buffer size, which can be parameterized via storage_options parameter (I could not find another way without changing the interface). I don't know if this is an acceptable approach (also, it requires string values)

Setting the "max_buffer_size": f"{100 * 1024}" in storage_options passed to write_deltalake helps me resolve the issue with writing a dataset to Azure which was otherwise failing constantly.

Default max buffer size is set to 4MB which looks reasonable and used by other implementations I've seen (e.g. https://github.com/fsspec/filesystem_spec/blob/3c247f56d4a4b22fc9ffec9ad4882a76ee47237d/fsspec/spec.py#L1577)

Related Issue(s)

Can help with resolving #1770

Documentation

If the approach is accepted then I need to find the best way of adding this to docs

@github-actions github-actions bot added the binding/python Issues for the Python package label Nov 25, 2023
@ion-elgreco ion-elgreco enabled auto-merge (squash) November 29, 2023 18:06
@ion-elgreco ion-elgreco merged commit 6628493 into delta-io:main Nov 29, 2023
24 checks passed
ion-elgreco pushed a commit to ion-elgreco/delta-rs that referenced this pull request Dec 1, 2023
# Description
Current implementation of `ObjectOutputStream` does not invoke flush
when writing out files to Azure storage which seem to cause intermittent
issues when the `write_deltalake` hangs with no progress and no error.

I'm adding a periodic flush to the write process, based on the written
buffer size, which can be parameterized via `storage_options` parameter
(I could not find another way without changing the interface). I don't
know if this is an acceptable approach (also, it requires string values)

Setting the `"max_buffer_size": f"{100 * 1024}"` in `storage_options`
passed to `write_deltalake` helps me resolve the issue with writing a
dataset to Azure which was otherwise failing constantly.

Default max buffer size is set to 4MB which looks reasonable and used by
other implementations I've seen (e.g.
https://github.com/fsspec/filesystem_spec/blob/3c247f56d4a4b22fc9ffec9ad4882a76ee47237d/fsspec/spec.py#L1577)

# Related Issue(s)
Can help with resolving delta-io#1770

# Documentation
If the approach is accepted then I need to find the best way of adding
this to docs

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants