Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay deserialization of dictionary pages #160

Merged
merged 1 commit into from
Aug 4, 2022
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Jul 23, 2022

This crate:

  • assumes a specific in-memory format for (decoded) dictionary pages
  • decompresses and deserializes the dictionary pages during reading

These are problematic because:

  • in-memory formats can't deserialize parquet's dictionary pages directly to their own format, having to perform &[u8] -> this crate DictPage -> their format. Because this crate DictPage -> their format may depend on logical types, this has to be done on every array (instead of once per dictionary page).
  • reading pages is IO-bounded, but deserializing the dictionary page is CPU-bounded. In async, this mix causes the reader to be blocked by this (potentially very expensive) operation

This PR

This PR modifies the page readers to output the enum Page instead of a DataPage, no longer attaching a decoded dictionary page (Arc<dyn DictPage>) to every page. Users must now handle the deserialization of the dictionary page themselves by matching the enum variants. This allows them to

  • deserialize the dict page to their target in-memory format, thus replacing &[u8] -> this crate DictPage -> their in-memory format by &[u8] -> their in-memory format
  • no longer have blocking CPU-bounded on their async context.

The change to the example and integration tests shows the required migration

@jorgecarleitao jorgecarleitao changed the title Delayed dictionary deserialization Delay dictionary deserialization Jul 23, 2022
@jorgecarleitao jorgecarleitao changed the title Delay dictionary deserialization Delay deserialization of dictionary pages Jul 23, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #160 (3079ef8) into main (82b1115) will increase coverage by 0.34%.
The diff coverage is 76.53%.

@@            Coverage Diff             @@
##             main     #160      +/-   ##
==========================================
+ Coverage   73.55%   73.90%   +0.34%     
==========================================
  Files          78       74       -4     
  Lines        3778     3652     -126     
==========================================
- Hits         2779     2699      -80     
+ Misses        999      953      -46     
Impacted Files Coverage Δ
src/compression.rs 84.78% <ø> (ø)
src/read/mod.rs 82.14% <ø> (ø)
src/read/page/stream.rs 61.70% <0.00%> (-0.12%) ⬇️
src/write/compression.rs 58.33% <0.00%> (-2.33%) ⬇️
src/write/page.rs 75.58% <ø> (ø)
src/deserialize/boolean.rs 60.00% <33.33%> (ø)
src/read/page/indexed_reader.rs 62.12% <47.05%> (+6.71%) ⬆️
src/page/mod.rs 72.72% <91.66%> (+2.56%) ⬆️
src/deserialize/binary.rs 70.37% <100.00%> (-2.05%) ⬇️
src/deserialize/fixed_len.rs 69.23% <100.00%> (-1.51%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82b1115...3079ef8. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants