duckdb fails to read generated files #75

FauxFaux · 2022-01-07T13:54:28Z

duckdb won't read files generated by this library, with OPTIONAL/ZSTD columns.

I think it's attempting to read the validity map as page data; it appears to try to decompress an array of vec![0xffu8; 3000] as if it was compressed page data.

I haven't had a go at reproducing this with a reasonable size file, nor checked whether it's duckdb or parquet2 which deviates from the specification, as I have absolutely no idea what I'm doing!

I will attempt to fill this bug report with some vaguely useful information later, but I thought this was better than nothing.

That is, duckdb prints:

Error: ZSTD Decompression failure

..for any column in my file, all of which are Utf8Array<i32> with (optional) nulls.

The text was updated successfully, but these errors were encountered:

FauxFaux · 2022-01-07T15:33:07Z

Okay, it does it with the arrow2 example, if you turn compression on:

--- a/examples/parquet_write_record.rs
+++ b/examples/parquet_write_record.rs
@@ -14,7 +14,7 @@ use arrow2::{
 fn write_batch(path: &str, schema: Schema, columns: Chunk<Arc<dyn Array>>) -> Result<()> {
     let options = WriteOptions {
         write_statistics: true,
-        compression: Compression::Uncompressed,
+        compression: Compression::Zstd,
         version: Version::V2,
     };

cargo run --features io_parquet,io_parquet_compression --example parquet_write_record

% duckdb                                                                               
v0.3.1 88aa81c6b
D select * from parquet_scan(['test.parquet']);
Error: ZSTD Decompression failure

pip's parquet-tools doesn't seem to have the same problem, so maybe duckdb is in the wrong here?

FauxFaux · 2022-01-07T15:46:17Z

duckdb can round-trip parquet files with the same shape.

D create table foo (c1 bigint null);
D insert into foo select * from generate_series(1,1000);
D copy foo to 'oot.parquet' (format 'parquet', codec 'zstd');

...and arrow2 can read this generated file.

I notice arrow2's file has the validity map .. raw? Unencoded? As a list of 1 bits. The spec says it would be RLE:

Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

Are we violating the spec, or is duckdb being silly? I cannot see in their code where they're trying to read or apply the validity map. https://github.com/duckdb/duckdb/blob/b701ecd1dd2540f6680c8996a001726b279eeb28/extension/parquet/column_reader.cpp#L408 is after decompression?

jorgecarleitao · 2022-01-07T16:41:19Z

So, first of all, thank you so much for using parquet2/arrow2, for this issue and for the detail analysis here.

The RLE encoding has been deprecated for some time - I think we should be looking into the RLE-bitpacked hybrid, which mixes RLE with bitpacked runs and is what def and rep levels are encoded in most implementations. Our RLE-bitpacked hybrid encoder only writes bitpacked runs atm, which is why we write 1s in the def level for a nullable array, even when there is no validity in the array itself (the bitpacked representation is all ones apart from the run header at the beginning).

As a follow up of this, I fielded jorgecarleitao/arrow2#740 where we write compressed parquet files with arrow2 and read them from pyarrow and (py)spark. pyspark does not ship zstd codecs. The PR checks that

pyarrow reads arrow2-written zstd-compressed parquet
both pyarrow and pyspark read arrow2-written snappy-compressed parquet

Based on this evidence, my current hypothesis is that there may be something going on with duckdb's reader, that is not accepting the "zstd format" written by arrow2 even though pyarrow accepts it.

Could you check if pyarrow can read the file that you generated with arrow2?

Maybe an idea is to post the parquet file there together with a pyarrow snipped that reads it, and see if they can understand why duckdb is not able to read it?

FauxFaux · 2022-01-07T17:29:03Z

Yup, I can see both pyarrow and arrow2 read both files.

Here's the files as generated by arrow2 and duckdb, with the same data and schema, and snappy compression. I'll go raise a bug with duckdb, having again completely failed to read their code about validity.

pq.zip

jorgecarleitao · 2022-03-15T06:25:30Z

Closing as not an issue.

jorgecarleitao mentioned this issue Jan 7, 2022

Added integration tests for writing compressed parquet jorgecarleitao/arrow2#740

Merged

FauxFaux mentioned this issue Jan 7, 2022

Decompression fails for files written by arrow2.rs/parquet2.rs duckdb/duckdb#2882

Closed

jorgecarleitao closed this as completed Mar 15, 2022

jorgecarleitao added the question Further information is requested label Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duckdb fails to read generated files #75

duckdb fails to read generated files #75

FauxFaux commented Jan 7, 2022

FauxFaux commented Jan 7, 2022

FauxFaux commented Jan 7, 2022

jorgecarleitao commented Jan 7, 2022

FauxFaux commented Jan 7, 2022

jorgecarleitao commented Mar 15, 2022

duckdb fails to read generated files #75

duckdb fails to read generated files #75

Comments

FauxFaux commented Jan 7, 2022

FauxFaux commented Jan 7, 2022

FauxFaux commented Jan 7, 2022

jorgecarleitao commented Jan 7, 2022

FauxFaux commented Jan 7, 2022

jorgecarleitao commented Mar 15, 2022