-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duckdb fails to read generated files #75
Comments
Okay, it does it with the arrow2 example, if you turn compression on: --- a/examples/parquet_write_record.rs
+++ b/examples/parquet_write_record.rs
@@ -14,7 +14,7 @@ use arrow2::{
fn write_batch(path: &str, schema: Schema, columns: Chunk<Arc<dyn Array>>) -> Result<()> {
let options = WriteOptions {
write_statistics: true,
- compression: Compression::Uncompressed,
+ compression: Compression::Zstd,
version: Version::V2,
};
|
duckdb can round-trip parquet files with the same shape.
...and arrow2 can read this generated file. I notice arrow2's file has the validity map .. raw? Unencoded? As a list of
Are we violating the spec, or is |
So, first of all, thank you so much for using parquet2/arrow2, for this issue and for the detail analysis here. The RLE encoding has been deprecated for some time - I think we should be looking into the RLE-bitpacked hybrid, which mixes RLE with bitpacked runs and is what def and rep levels are encoded in most implementations. Our RLE-bitpacked hybrid encoder only writes bitpacked runs atm, which is why we write 1s in the def level for a nullable array, even when there is no validity in the array itself (the bitpacked representation is all ones apart from the run header at the beginning). As a follow up of this, I fielded jorgecarleitao/arrow2#740 where we write compressed parquet files with arrow2 and read them from pyarrow and (py)spark. pyspark does not ship zstd codecs. The PR checks that
Based on this evidence, my current hypothesis is that there may be something going on with duckdb's reader, that is not accepting the "zstd format" written by arrow2 even though pyarrow accepts it. Could you check if pyarrow can read the file that you generated with arrow2? Maybe an idea is to post the parquet file there together with a pyarrow snipped that reads it, and see if they can understand why duckdb is not able to read it? |
Yup, I can see both pyarrow and arrow2 read both files. Here's the files as generated by arrow2 and duckdb, with the same data and schema, and snappy compression. I'll go raise a bug with duckdb, having again completely failed to read their code about validity. |
Closing as not an issue. |
duckdb
won't read files generated by this library, with OPTIONAL/ZSTD columns.I think it's attempting to read the validity map as page data; it appears to try to decompress an array of
vec![0xffu8; 3000]
as if it was compressed page data.I haven't had a go at reproducing this with a reasonable size file, nor checked whether it's
duckdb
orparquet2
which deviates from the specification, as I have absolutely no idea what I'm doing!I will attempt to fill this bug report with some vaguely useful information later, but I thought this was better than nothing.
That is,
duckdb
prints:Error: ZSTD Decompression failure
..for any column in my file, all of which are
Utf8Array<i32>
with (optional) nulls.The text was updated successfully, but these errors were encountered: