Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Sliced ListArrays written to parquet include more data than necessary #1251

Closed
tjwilson90 opened this issue Sep 14, 2022 · 2 comments
Closed
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@tjwilson90
Copy link

If a ListArray is sliced and the slice is written to parquet, the entire values array backing the list will be written out, not just the portion of the list in the slice.

Sliced primitive arrays do not exhibit this behavior.

Repro (on 0.14.0 and current head of main (commit 96df5e1))

use arrow2::array::{MutableListArray, MutableUtf8Array, TryPush};
use arrow2::chunk::Chunk;
use arrow2::datatypes::{DataType, Field, Schema};
use arrow2::io::parquet::write::{
    transverse, CompressionOptions, Encoding, FileWriter, RowGroupIterator, Version, WriteOptions,
};
use std::fs::File;

#[test]
fn write_list_slice_to_parquet() {
    let mut array = MutableListArray::<i32, MutableUtf8Array<i32>>::new();
    array
        .try_push(Some([Some("include me"), None, Some("include me too")]))
        .unwrap();
    array
        .try_push(Some([Some("skip this"), Some("should not be written")]))
        .unwrap();
    let array = array.into_box();
    let array = array.slice(0, 1);
    let schema = Schema::from(vec![Field::new(
        "lists",
        DataType::List(Box::new(Field::new("strings", DataType::Utf8, true))),
        false,
    )]);
    let chunk = Chunk::new(vec![array]);

    let options = WriteOptions {
        write_statistics: true,
        compression: CompressionOptions::Uncompressed,
        version: Version::V2,
    };

    let iter = vec![Ok(chunk)];

    let encodings = schema
        .fields
        .iter()
        .map(|f| transverse(&f.data_type, |_| Encoding::Plain))
        .collect();

    let row_groups =
        RowGroupIterator::try_new(iter.into_iter(), &schema, options, encodings).unwrap();

    let mut file = File::create("sliced-list.parquet").unwrap();
    let mut writer = FileWriter::try_new(&mut file, schema, options).unwrap();
    for group in row_groups {
        writer.write(group.unwrap()).unwrap();
    }
    writer.end(None).unwrap();
}

will create a file with this contents

$ strings sliced-list.parquet 
PAR1
skip this
include me
include me
include me too
skip this
should not be written
lists
list
strings
skip this
include me
include me
skip this
root
lists
list
strings%
lists
list
strings
skip this
include me

ARROW:schema
/////64AAAAEAAAA8v///xQAAAAEAAEAAAAKAAsACAAKAAQA+P///wwAAAAIAAgAAAAEAAEAAAAEAAAA7P///2wAAABgAAAAGAAAAAwAAAAQABEABAAAABAACAAAAAwAAQAAAAQAAADs////LAAAACAAAAAYAAAAAQUAABAAEgAEABAAEQAIAAAADAAAAAAA/P///wQABAAHAAAAc3RyaW5ncwD8////BAAEAAUAAABsaXN0cwA=
,Arrow2 - Native Rust implementation of Arrow
@freeformstu
Copy link

I had the same issue which I discovered because I kept running out of RAM.

The issue is surfaced here:
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/write/pages.rs#L170

But the actual problem is in the slice_unchecked() function or the values() function for ListArray.

slice_unchecked() just clones in the values when slicing. I think this makes sense because the offsets must be indices into this array. But that means that the values() function should be returning only the values contained within the slice.

Some alternative solutions:

  • Slice the values and adjust all of the offsets accordingly.
  • Provide and use a different method for accessing the values within the slice.

Here's a simple reproduction of the issue:

    let rows = 10;
    let items_per_row = 20;
    let mut data: Vec<u8> = vec![0_u8; rows * items_per_row];
    let values = PrimitiveArray::<u8>::from_data(DataType::UInt8, Buffer::from(data), None);

    let list_array = ListArray::<i64>::from_data(
        ListArray::<i64>::default_datatype(DataType::UInt8),
        Buffer::from(
            (0..=rows)
                .into_iter()
                .map(|x| (x * items_per_row) as i64)
                .collect::<Vec<_>>(),
        ),
        Box::new(values),
        None,
    )
    .boxed();

    let sub_array = list_array.slice(1, 3);

    let list_array = list_array.as_any().downcast_ref::<ListArray<i64>>().unwrap();
    println!("list_array.len(): {}", list_array.len());
    println!("list_array.values().len(): {}", list_array.values().len());
    let sub_array = sub_array.as_any().downcast_ref::<ListArray<i64>>().unwrap();
    println!("sub_array.len(): {}", sub_array.len());
    println!("sub_array.values().len(): {}", sub_array.values().len());
list_array.len(): 10
list_array.values().len(): 200
sub_array.len(): 3
sub_array.values().len(): 200

@tjwilson90
Copy link
Author

This appears to have been fixed by #1326

@jorgecarleitao jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants