Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Column with empty dictionary but contains values causes panic #976

Open
TurnOfACard opened this issue May 4, 2022 · 2 comments
Open

Column with empty dictionary but contains values causes panic #976

TurnOfACard opened this issue May 4, 2022 · 2 comments

Comments

@TurnOfACard
Copy link
Contributor

I found an issue where a small dictionary offset (length of 1) in arrow2::io::parquet::read::deserialize::binary::basic::State::OptionalDictionary causes the parquet reader to panic.

I imagine restricting the slice bound is a fix. In general, there doesn't appear to be a way to propagate errors without some API changes:

  • the function the code is located in (extend_from_state) does not return a Result
  • either does its caller: extend_from_new_page

I assume that we should modify the return values of extend_from_state and extend_from_new_page?

State::OptionalDictionary(page_validity, page_values) => {
let dict_values = page_values.dict.values();
let dict_offsets = page_values.dict.offsets();
let op = move |index: u32| {
let index = index as usize;
let dict_offset_i = dict_offsets[index] as usize;
let dict_offset_ip1 = dict_offsets[index + 1] as usize;
&dict_values[dict_offset_i..dict_offset_ip1]
};

@jorgecarleitao
Copy link
Owner

Thanks for the report!

The invariant is that dict_offsets.len() equals page.num_values() + 1, from here. So, for offset.len() == 1, the dict page is empty? In that case the values page should also also be empty?

Do you have the parquet file available?

@TurnOfACard
Copy link
Contributor Author

TurnOfACard commented May 5, 2022

Ah, it seems that helps explain part of it.

Long story short, I was using a file produced by arrow2. In a single column, the first PageType was PageType::DictionaryPage, and the second was PageType::DataPageV2. For the DictionaryPage, the compressed_page_size was == 9, but uncompressed_page_size was == 0 (which may be another problem). Encoding was RleDictionary.

As a result of this, parquet2::page::page_dict::deserialize was passed an empty buf with num_values == 0 (plausible if the uncompressed_page_size is zero) it will return (vec![], vec![0]).

This is ultimately passed to the code below, which will always try to slice dict_values which potentially may be zero-length.

State::OptionalDictionary(page_validity, page_values) => {
let dict_values = page_values.dict.values();
let dict_offsets = page_values.dict.offsets();
let op = move |index: u32| {
let index = index as usize;
let dict_offset_i = dict_offsets[index] as usize;
let dict_offset_ip1 = dict_offsets[index + 1] as usize;
&dict_values[dict_offset_i..dict_offset_ip1]
};

I am not sure why parquet2 exported a page header with a compressed_page_size of 9, and an uncompressed_page_size of 0, I will take a look at the code that produced it.

If I am interpreting this correctly, there is no guard for a malformed parquet file with an empty dictionary but provided values.

@TurnOfACard TurnOfACard changed the title ValueDictionary offset length < 2 potential panic Column with empty dictionary but values causes panic May 5, 2022
@TurnOfACard TurnOfACard changed the title Column with empty dictionary but values causes panic Column with empty dictionary but contains values causes panic May 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants