Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow Flight Encodes a Slice's List Offsets If the slice offset is starts with zero #6803

Closed
HawaiianSpork opened this issue Nov 26, 2024 · 1 comment · Fixed by #6805
Closed
Labels
arrow Changes to the arrow crate bug

Comments

@HawaiianSpork
Copy link
Contributor

Describe the bug
If the arrow flight encodes a record batch slice where the first row of that slice has offset zero then it reuses the non-sliced data for offsets. This can not only cause offset arrays that are larger than the slice to be encoded but the offset encoding will be incorrect as there may be zero or more rows that were not removed from the slice.

To Reproduce

    fn generate_nested_list_data_starting_at_zero<O: OffsetSizeTrait>() -> GenericListArray<O> {
        let mut ls =
            GenericListBuilder::<O, _>::new(GenericListBuilder::<O, _>::new(UInt32Builder::new()));


        for _i in 0..999 {
            ls.values().append(true);
            ls.append(true);
        }

        for j in 0..10 {
            for value in [j, j, j, j] {
                ls.values().values().append_value(value);
            }
            ls.values().append(true)
        }
        ls.append(true);


        for i in 0..9_000 {
            for j in 0..10 {
                for value in [i+j, i+j, i+j, i+j] {
                    ls.values().values().append_value(value);
                }
                ls.values().append(true)
            }
            ls.append(true);
        }

        ls.finish()
    }

    #[test]
    fn encode_nested_lists_starting_at_zero() {
        let inner_int = Arc::new(Field::new("item", DataType::UInt32, true));
        let inner_list_field = Arc::new(Field::new("item", DataType::List(inner_int), true));
        let list_field = Field::new("val", DataType::List(inner_list_field), true);
        let schema = Arc::new(Schema::new(vec![list_field]));

        let values = Arc::new(generate_nested_list_data_starting_at_zero::<i32>());

        let in_batch = RecordBatch::try_new(schema, vec![values]).unwrap();
        roundtrip_ensure_sliced_smaller(in_batch, 1);
    }

will result in an error where all lists are empty.

Expected behavior
No error is thrown at list offsets are properly encoded.

Additional context
This line seems to be the problem:

0 => offsets.clone(),

Setting that line to instead 0 => offset_slice.iter().map(|x| *x).collect(), fixes the problem.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

label_issue.py automatically added labels {'arrow'} from #6805

emilk added a commit to rerun-io/rerun that referenced this issue Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
2 participants