-
Notifications
You must be signed in to change notification settings - Fork 224
Unable to read parquet files with double nested float list #852
Comments
Sorry, @Igosuki ; I am looking into this. It is related to reading dremel. |
@jorgecarleitao Thank you, I was trying to debug it yesterday. Most of the arrow2 code actually works flawlessly and manages to read the file, until it considers that there are no more bytes to read despite having more rows since the decoding didn't infer the proper offsets (or something like that). |
I think that the reason is that we assume RLE-encoded rep and def levels, but this file has repetition levels encoded using BIT_PACKED. I haven't had the time to support it because it is quite old, but it should be supported for compatibility reasons (we need a decoder in parquet2). |
Yeah unfortunately, the Arrow parquet implementation is the only one that
allows to set encoding at such a level, the old
https://github.com/apache/parquet-format implementation which is used in
many platforms seems to be using the older encoding.
…On Thu, Mar 3, 2022 at 11:13 PM Jorge Leitao ***@***.***> wrote:
I think that the reason is that we assume RLE-encoded rep and def levels,
but this file has repetition levels encoded using BIT_PACKED
<https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4>
.
I haven't had the time to support it because it is quite old, but it
should be supported for compatibility reasons (we need a decoder in
parquet2).
—
Reply to this email directly, view it on GitHub
<#852 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDFBV75JOYIZDI4FRCSWDU6E2RVANCNFSM5OYI7SJA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have a proposed fix here: #884. (it was not a bit_packed encoded - it was something else) Checking it via
|
Very cool, thank you ! |
I produced a simple parquet file using spark, I started monkey patching arrow2 to try and get it to read the file, but it doesn't seem to work.
Any idea ?
The file :
https://github.com/Igosuki/arrow2/blob/main/part-00000-b4749aa1-94e4-4ddb-bab2-954c4d3a290f.c000.snappy.parquet
The schema :
The patches https://github.com/Igosuki/arrow2/tree/tests_nesting
The text was updated successfully, but these errors were encountered: