Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Parquet: Snappy buffer size limitation. #932

Closed
ritchie46 opened this issue Apr 4, 2022 · 3 comments · Fixed by #946
Closed

Parquet: Snappy buffer size limitation. #932

ritchie46 opened this issue Apr 4, 2022 · 3 comments · Fixed by #946
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@ritchie46
Copy link
Collaborator

Writing large parquet files can lead to:

ArrowError(ExternalFormat("External format error: underlying snap error: snappy: input buffer (size = 68397383472) is larger than allowed (size = **4294967295)**

I believe we already discussed this earlier in a polars issue, but I cannot reproduce it. I open this issue to explore ways to circumvent it.
I assume we can prevent this by passing smaller Chunks to the writer?

Is there a way upfront to estimate the size of a buffer that will be send to snappy?

@jorgecarleitao
Copy link
Owner

We need to support splitting the arrays in multiple parquet pages when writing, since each page is compressed independently

@jorgecarleitao
Copy link
Owner

Alternatively, split the chunk in smaller parts ^^

@ritchie46
Copy link
Collaborator Author

Alternatively, split the chunk in smaller parts ^^

I have been thinking about this solution. Is what we send to snappy the whole row group, or separate columns?

I am thinking of a proxy to determine the needed chunk size.

Something like this? estimated_size / (snappy_limit * n_columns)

@jorgecarleitao jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Apr 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants