Sliced ListArrays written to parquet include more data than necessary #1251

tjwilson90 · 2022-09-14T19:54:21Z

If a ListArray is sliced and the slice is written to parquet, the entire values array backing the list will be written out, not just the portion of the list in the slice.

Sliced primitive arrays do not exhibit this behavior.

Repro (on 0.14.0 and current head of main (commit 96df5e1))

use arrow2::array::{MutableListArray, MutableUtf8Array, TryPush};
use arrow2::chunk::Chunk;
use arrow2::datatypes::{DataType, Field, Schema};
use arrow2::io::parquet::write::{
    transverse, CompressionOptions, Encoding, FileWriter, RowGroupIterator, Version, WriteOptions,
};
use std::fs::File;

#[test]
fn write_list_slice_to_parquet() {
    let mut array = MutableListArray::<i32, MutableUtf8Array<i32>>::new();
    array
        .try_push(Some([Some("include me"), None, Some("include me too")]))
        .unwrap();
    array
        .try_push(Some([Some("skip this"), Some("should not be written")]))
        .unwrap();
    let array = array.into_box();
    let array = array.slice(0, 1);
    let schema = Schema::from(vec![Field::new(
        "lists",
        DataType::List(Box::new(Field::new("strings", DataType::Utf8, true))),
        false,
    )]);
    let chunk = Chunk::new(vec![array]);

    let options = WriteOptions {
        write_statistics: true,
        compression: CompressionOptions::Uncompressed,
        version: Version::V2,
    };

    let iter = vec![Ok(chunk)];

    let encodings = schema
        .fields
        .iter()
        .map(|f| transverse(&f.data_type, |_| Encoding::Plain))
        .collect();

    let row_groups =
        RowGroupIterator::try_new(iter.into_iter(), &schema, options, encodings).unwrap();

    let mut file = File::create("sliced-list.parquet").unwrap();
    let mut writer = FileWriter::try_new(&mut file, schema, options).unwrap();
    for group in row_groups {
        writer.write(group.unwrap()).unwrap();
    }
    writer.end(None).unwrap();
}

will create a file with this contents

$ strings sliced-list.parquet 
PAR1
skip this
include me
include me
include me too
skip this
should not be written
lists
list
strings
skip this
include me
include me
skip this
root
lists
list
strings%
lists
list
strings
skip this
include me

ARROW:schema
/////64AAAAEAAAA8v///xQAAAAEAAEAAAAKAAsACAAKAAQA+P///wwAAAAIAAgAAAAEAAEAAAAEAAAA7P///2wAAABgAAAAGAAAAAwAAAAQABEABAAAABAACAAAAAwAAQAAAAQAAADs////LAAAACAAAAAYAAAAAQUAABAAEgAEABAAEQAIAAAADAAAAAAA/P///wQABAAHAAAAc3RyaW5ncwD8////BAAEAAUAAABsaXN0cwA=
,Arrow2 - Native Rust implementation of Arrow

The text was updated successfully, but these errors were encountered:

freeformstu · 2022-12-05T21:42:24Z

I had the same issue which I discovered because I kept running out of RAM.

The issue is surfaced here:
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/write/pages.rs#L170

But the actual problem is in the slice_unchecked() function or the values() function for ListArray.

slice_unchecked() just clones in the values when slicing. I think this makes sense because the offsets must be indices into this array. But that means that the values() function should be returning only the values contained within the slice.

Some alternative solutions:

Slice the values and adjust all of the offsets accordingly.
Provide and use a different method for accessing the values within the slice.

Here's a simple reproduction of the issue:

    let rows = 10;
    let items_per_row = 20;
    let mut data: Vec<u8> = vec![0_u8; rows * items_per_row];
    let values = PrimitiveArray::<u8>::from_data(DataType::UInt8, Buffer::from(data), None);

    let list_array = ListArray::<i64>::from_data(
        ListArray::<i64>::default_datatype(DataType::UInt8),
        Buffer::from(
            (0..=rows)
                .into_iter()
                .map(|x| (x * items_per_row) as i64)
                .collect::<Vec<_>>(),
        ),
        Box::new(values),
        None,
    )
    .boxed();

    let sub_array = list_array.slice(1, 3);

    let list_array = list_array.as_any().downcast_ref::<ListArray<i64>>().unwrap();
    println!("list_array.len(): {}", list_array.len());
    println!("list_array.values().len(): {}", list_array.values().len());
    let sub_array = sub_array.as_any().downcast_ref::<ListArray<i64>>().unwrap();
    println!("sub_array.len(): {}", sub_array.len());
    println!("sub_array.values().len(): {}", sub_array.values().len());

list_array.len(): 10
list_array.values().len(): 200
sub_array.len(): 3
sub_array.values().len(): 200

tjwilson90 · 2023-01-12T04:38:38Z

This appears to have been fixed by #1326

tjwilson90 closed this as completed Jan 12, 2023

jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliced ListArrays written to parquet include more data than necessary #1251

Sliced ListArrays written to parquet include more data than necessary #1251

tjwilson90 commented Sep 14, 2022

freeformstu commented Dec 5, 2022

tjwilson90 commented Jan 12, 2023

Sliced ListArrays written to parquet include more data than necessary #1251

Sliced ListArrays written to parquet include more data than necessary #1251

Comments

tjwilson90 commented Sep 14, 2022

freeformstu commented Dec 5, 2022

tjwilson90 commented Jan 12, 2023