Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing structs nested in lists produces an incorrect output #1184

Closed
helgikrs opened this issue Jan 15, 2022 · 0 comments · Fixed by #1185
Closed

Writing structs nested in lists produces an incorrect output #1184

helgikrs opened this issue Jan 15, 2022 · 0 comments · Fixed by #1185
Assignees
Labels
bug parquet Changes to the parquet crate

Comments

@helgikrs
Copy link
Contributor

Describe the bug
Writing an arrow record batch with structs nested within lists using the parquet writer produces a parquet file with incorrect values when there are null or empty lists present.

To Reproduce
The following program produces a parquet file out.parquet.

use std::sync::Arc;

use arrow::array::{Int32Builder, ListBuilder, StructBuilder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;

fn main() {
    // define schema
    let int_field = Field::new("a", DataType::Int32, true);
    let item_field = Field::new("item", DataType::Struct(vec![int_field.clone()]), true);
    let list_field = Field::new("list", DataType::List(Box::new(item_field)), true);

    let int_builder = Int32Builder::new(10);
    let struct_builder = StructBuilder::new(vec![int_field], vec![Box::new(int_builder)]);
    let mut list_builder = ListBuilder::new(struct_builder);

    // [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
    //
    // [{a: 1}]
    let values = list_builder.values();
    values
        .field_builder::<Int32Builder>(0)
        .unwrap()
        .append_value(1)
        .unwrap();
    values.append(true).unwrap();
    list_builder.append(true).unwrap();

    // []
    list_builder.append(true).unwrap();

    // null
    list_builder.append(false).unwrap();

    // [null, null]
    let values = list_builder.values();
    values
        .field_builder::<Int32Builder>(0)
        .unwrap()
        .append_null()
        .unwrap();
    values.append(false).unwrap();
    values
        .field_builder::<Int32Builder>(0)
        .unwrap()
        .append_null()
        .unwrap();
    values.append(false).unwrap();
    list_builder.append(true).unwrap();

    // [{a: null}]
    let values = list_builder.values();
    values
        .field_builder::<Int32Builder>(0)
        .unwrap()
        .append_null()
        .unwrap();
    values.append(true).unwrap();
    list_builder.append(true).unwrap();

    // [{a: 2}]
    let values = list_builder.values();
    values
        .field_builder::<Int32Builder>(0)
        .unwrap()
        .append_value(2)
        .unwrap();
    values.append(true).unwrap();
    list_builder.append(true).unwrap();

    let array = Arc::new(list_builder.finish());

    let schema = Arc::new(Schema::new(vec![list_field]));

    let rb = RecordBatch::try_new(schema, vec![array]).unwrap();

    let out = std::fs::File::create("out.parquet").unwrap();
    let mut writer = parquet::arrow::ArrowWriter::try_new(out, rb.schema(), None).unwrap();
    writer.write(&rb).unwrap();
    writer.close().unwrap();
}

Running parquet-dump on out.parquet produces the following output

value 1: R:0 D:4 V:1
value 2: R:0 D:1 V:<null>
value 3: R:0 D:0 V:<null>
value 4: R:0 D:2 V:<null>
value 5: R:1 D:2 V:<null>
value 6: R:0 D:3 V:<null>
value 7: R:0 D:4 V:0

Expected behavior
The last value (value 7) should have been a 2

value 1: R:0 D:4 V:1
value 2: R:0 D:1 V:<null>
value 3: R:0 D:0 V:<null>
value 4: R:0 D:2 V:<null>
value 5: R:1 D:2 V:<null>
value 6: R:0 D:3 V:<null>
value 7: R:0 D:4 V:2

Additional context
filter_array_indices function in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L760 produces incorrect indices when the immediate parent of a field is not a list. In the writer https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer.rs#L244, those indices are then used to produce the values to write at https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer.rs#L284 causing the incorrect behavior described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
2 participants