Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Unable to read parquet files with double nested float list #852

Closed
Igosuki opened this issue Feb 18, 2022 · 6 comments · Fixed by #884
Closed

Unable to read parquet files with double nested float list #852

Igosuki opened this issue Feb 18, 2022 · 6 comments · Fixed by #884
Assignees
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@Igosuki
Copy link
Contributor

Igosuki commented Feb 18, 2022

I produced a simple parquet file using spark, I started monkey patching arrow2 to try and get it to read the file, but it doesn't seem to work.
Any idea ?

The file :
https://github.com/Igosuki/arrow2/blob/main/part-00000-b4749aa1-94e4-4ddb-bab2-954c4d3a290f.c000.snappy.parquet

The schema :

schema = Schema { fields: [Field { name: "pr", data_type: Utf8, is_nullable: true, metadata: {} }, Field { name: "asks", data_type: List(Field { name: "element", data_type: List(Field { name: "element", data_type: Float64, is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }, Field { name: "bids", data_type: List(Field { name: "element", data_type: List(Field { name: "element", data_type: Float64, is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }), is_nullable: true, metadata: {} }, Field { name: "event_ms", data_type: Int64, is_nullable: true, metadata: {} }], metadata: {"org.apache.spark.sql.parquet.row.metadata": "{\"type\":\"struct\",\"fields\":[{\"name\":\"pr\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"asks\",\"type\":{\"type\":\"array\",\"elementType\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":true},\"containsNull\":true},\"nullable\":true,\"metadata\":{}},{\"name\":\"bids\",\"type\":{\"type\":\"array\",\"elementType\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":true},\"containsNull\":true},\"nullable\":true,\"metadata\":{}},{\"name\":\"event_ms\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}", "org.apache.spark.version": "3.2.0"} }

The patches https://github.com/Igosuki/arrow2/tree/tests_nesting

@jorgecarleitao
Copy link
Owner

Sorry, @Igosuki ; I am looking into this. It is related to reading dremel.

@Igosuki
Copy link
Contributor Author

Igosuki commented Mar 3, 2022

@jorgecarleitao Thank you, I was trying to debug it yesterday. Most of the arrow2 code actually works flawlessly and manages to read the file, until it considers that there are no more bytes to read despite having more rows since the decoding didn't infer the proper offsets (or something like that).

@jorgecarleitao jorgecarleitao self-assigned this Mar 3, 2022
@jorgecarleitao
Copy link
Owner

I think that the reason is that we assume RLE-encoded rep and def levels, but this file has repetition levels encoded using BIT_PACKED.

I haven't had the time to support it because it is quite old, but it should be supported for compatibility reasons (we need a decoder in parquet2).

@Igosuki
Copy link
Contributor Author

Igosuki commented Mar 3, 2022 via email

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Mar 5, 2022

I have a proposed fix here: #884. (it was not a bit_packed encoded - it was something else)

Checking it via

cargo run --features io_parquet,io_parquet_compression --example parquet_read -- part-00000-b4749aa1-94e4-4ddb-bab2-954c4d3a290f.c000.snappy.parquet

@Igosuki
Copy link
Contributor Author

Igosuki commented Mar 5, 2022

Very cool, thank you !

@jorgecarleitao jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Mar 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants