Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Float32 Dictionary Encoding #777

Closed
renderbot opened this issue Jan 18, 2022 · 4 comments · Fixed by #778
Closed

Float32 Dictionary Encoding #777

renderbot opened this issue Jan 18, 2022 · 4 comments · Fixed by #778
Labels
feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@renderbot
Copy link

Hi,

First of all, awesome library! Way faster than the official parquet crate in my tests. One question--am I right in understanding that there's no support for Float32 dictionary pages?

My situation is that I'm writing a program to create large files of floats with many repeated values, so historically dictionary encoding has been super helpful. The main Parquet crate does support this, but is too slow for my use case.

New to Rust and Parquet, so I may be missing something. Also I'm not committed to using dictionary encoding if there's another way to save space with many repeated float values...

Thank you!

@jorgecarleitao jorgecarleitao added no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog question Further information is requested labels Jan 19, 2022
@jorgecarleitao
Copy link
Owner

Thanks!

It is indeed not supported - it slipped through the cracks of the feature matrix. Fixed in #778. Sorry about that. We can release a 0.9.1 with this fix if you need this in crates.io.

Note that dictionary arrays are the only arrays atm that we support writing to parquet using dictionary encoding.

We do not have a mutable dictionary array for floats because floats do not implement Hash. However, creating a dictionary array from scratch is not very difficult imo:

use arrow2::array::{PrimitiveArray, DictionaryArray};

let indices = PrimitiveArray::from_values((0..100u64).map(|x| x % 3));
let values = PrimitiveArray::from_slice([1.0f32, 3.0, 2.0]);
let array = DictionaryArray::from_data(indices, std::sync::Arc::new(values));
// [1.0, 3.0, 2.0, 1.0, 3.0, 2.0, 1.0, 3.0, 2.0, ...]

this can be plugged into the parquet writing API (after the PR above).

@jorgecarleitao jorgecarleitao added feature A new feature and removed question Further information is requested labels Jan 19, 2022
@renderbot
Copy link
Author

Wow, thanks for the fast and helpful response! A 0.9.1 version would be awesome if possible. And understood, no worries on creating a DictionaryArray...as you say above it's not so difficult.

Thanks again

@renderbot
Copy link
Author

Just commenting again that I tested the feature branch and it works great! thank you again

@jorgecarleitao
Copy link
Owner

Thanks for testing it out. Released on 0.9.1.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants