Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckdb fails to read generated files #75

Closed
FauxFaux opened this issue Jan 7, 2022 · 5 comments
Closed

duckdb fails to read generated files #75

FauxFaux opened this issue Jan 7, 2022 · 5 comments
Labels
question Further information is requested

Comments

@FauxFaux
Copy link

FauxFaux commented Jan 7, 2022

duckdb won't read files generated by this library, with OPTIONAL/ZSTD columns.

I think it's attempting to read the validity map as page data; it appears to try to decompress an array of vec![0xffu8; 3000] as if it was compressed page data.

I haven't had a go at reproducing this with a reasonable size file, nor checked whether it's duckdb or parquet2 which deviates from the specification, as I have absolutely no idea what I'm doing!

I will attempt to fill this bug report with some vaguely useful information later, but I thought this was better than nothing.


That is, duckdb prints:

Error: ZSTD Decompression failure

..for any column in my file, all of which are Utf8Array<i32> with (optional) nulls.

@FauxFaux
Copy link
Author

FauxFaux commented Jan 7, 2022

Okay, it does it with the arrow2 example, if you turn compression on:

--- a/examples/parquet_write_record.rs
+++ b/examples/parquet_write_record.rs
@@ -14,7 +14,7 @@ use arrow2::{
 fn write_batch(path: &str, schema: Schema, columns: Chunk<Arc<dyn Array>>) -> Result<()> {
     let options = WriteOptions {
         write_statistics: true,
-        compression: Compression::Uncompressed,
+        compression: Compression::Zstd,
         version: Version::V2,
     };
cargo run --features io_parquet,io_parquet_compression --example parquet_write_record
% duckdb                                                                               
v0.3.1 88aa81c6b
D select * from parquet_scan(['test.parquet']);
Error: ZSTD Decompression failure

pip's parquet-tools doesn't seem to have the same problem, so maybe duckdb is in the wrong here?

@FauxFaux
Copy link
Author

FauxFaux commented Jan 7, 2022

duckdb can round-trip parquet files with the same shape.

D create table foo (c1 bigint null);
D insert into foo select * from generate_series(1,1000);
D copy foo to 'oot.parquet' (format 'parquet', codec 'zstd');

...and arrow2 can read this generated file.

I notice arrow2's file has the validity map .. raw? Unencoded? As a list of 1 bits. The spec says it would be RLE:

Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

Are we violating the spec, or is duckdb being silly? I cannot see in their code where they're trying to read or apply the validity map. https://github.com/duckdb/duckdb/blob/b701ecd1dd2540f6680c8996a001726b279eeb28/extension/parquet/column_reader.cpp#L408 is after decompression?

@jorgecarleitao
Copy link
Owner

So, first of all, thank you so much for using parquet2/arrow2, for this issue and for the detail analysis here.

The RLE encoding has been deprecated for some time - I think we should be looking into the RLE-bitpacked hybrid, which mixes RLE with bitpacked runs and is what def and rep levels are encoded in most implementations. Our RLE-bitpacked hybrid encoder only writes bitpacked runs atm, which is why we write 1s in the def level for a nullable array, even when there is no validity in the array itself (the bitpacked representation is all ones apart from the run header at the beginning).

As a follow up of this, I fielded jorgecarleitao/arrow2#740 where we write compressed parquet files with arrow2 and read them from pyarrow and (py)spark. pyspark does not ship zstd codecs. The PR checks that

  • pyarrow reads arrow2-written zstd-compressed parquet
  • both pyarrow and pyspark read arrow2-written snappy-compressed parquet

Based on this evidence, my current hypothesis is that there may be something going on with duckdb's reader, that is not accepting the "zstd format" written by arrow2 even though pyarrow accepts it.

Could you check if pyarrow can read the file that you generated with arrow2?

Maybe an idea is to post the parquet file there together with a pyarrow snipped that reads it, and see if they can understand why duckdb is not able to read it?

@FauxFaux
Copy link
Author

FauxFaux commented Jan 7, 2022

Yup, I can see both pyarrow and arrow2 read both files.

Here's the files as generated by arrow2 and duckdb, with the same data and schema, and snappy compression. I'll go raise a bug with duckdb, having again completely failed to read their code about validity.

pq.zip

@jorgecarleitao
Copy link
Owner

Closing as not an issue.

@jorgecarleitao jorgecarleitao added the question Further information is requested label Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants