Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly Pessimistic RLE Size Estimation #2889

Closed
tustvold opened this issue Oct 18, 2022 · 0 comments · Fixed by #2890
Closed

Overly Pessimistic RLE Size Estimation #2889

tustvold opened this issue Oct 18, 2022 · 0 comments · Fixed by #2890
Labels
bug parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

tustvold commented Oct 18, 2022

Describe the bug

The size of RLE encoded data is routinely estimated as

RleEncoder::min_buffer_size(bit_width)
            + RleEncoder::max_buffer_size(bit_width, self.indices.len())

Where RleEncoder::min_buffer_size is defined as

let max_bit_packed_run_size = 1 + bit_util::ceil(
    (MAX_VALUES_PER_BIT_PACKED_RUN * bit_width as usize) as i64,
    8,
);
let max_rle_run_size =
    bit_util::MAX_VLQ_BYTE_LEN + bit_util::ceil(bit_width as i64, 8) as usize;
std::cmp::max(max_bit_packed_run_size as usize, max_rle_run_size)

In practice this will almost always be 64 * bit_width.

let bytes_per_run = bit_width;
let num_runs = bit_util::ceil(num_values as i64, 8) as usize;
let bit_packed_max_size = num_runs + num_runs * bytes_per_run as usize;

let min_rle_run_size = 1 + bit_util::ceil(bit_width as i64, 8) as usize;
let rle_max_size =
    bit_util::ceil(num_values as i64, 8) as usize * min_rle_run_size;
std::cmp::max(bit_packed_max_size, rle_max_size) as usize

To Reproduce

It is unclear why min_buffer_size is included in the size estimation at all

Expected behavior

A more accurate size estimation of written RLE encoded data

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant