Delay deserialization of dictionary pages #160

jorgecarleitao · 2022-07-23T20:34:21Z

This crate:

assumes a specific in-memory format for (decoded) dictionary pages
decompresses and deserializes the dictionary pages during reading

These are problematic because:

in-memory formats can't deserialize parquet's dictionary pages directly to their own format, having to perform &[u8] -> this crate DictPage -> their format. Because this crate DictPage -> their format may depend on logical types, this has to be done on every array (instead of once per dictionary page).
reading pages is IO-bounded, but deserializing the dictionary page is CPU-bounded. In async, this mix causes the reader to be blocked by this (potentially very expensive) operation

This PR

This PR modifies the page readers to output the enum Page instead of a DataPage, no longer attaching a decoded dictionary page (Arc<dyn DictPage>) to every page. Users must now handle the deserialization of the dictionary page themselves by matching the enum variants. This allows them to

deserialize the dict page to their target in-memory format, thus replacing &[u8] -> this crate DictPage -> their in-memory format by &[u8] -> their in-memory format
no longer have blocking CPU-bounded on their async context.

The change to the example and integration tests shows the required migration

codecov-commenter · 2022-07-23T20:40:53Z

Codecov Report

Merging #160 (3079ef8) into main (82b1115) will increase coverage by 0.34%.
The diff coverage is 76.53%.

@@            Coverage Diff             @@
##             main     #160      +/-   ##
==========================================
+ Coverage   73.55%   73.90%   +0.34%     
==========================================
  Files          78       74       -4     
  Lines        3778     3652     -126     
==========================================
- Hits         2779     2699      -80     
+ Misses        999      953      -46

Impacted Files	Coverage Δ
src/compression.rs	`84.78% <ø> (ø)`
src/read/mod.rs	`82.14% <ø> (ø)`
src/read/page/stream.rs	`61.70% <0.00%> (-0.12%)`	⬇️
src/write/compression.rs	`58.33% <0.00%> (-2.33%)`	⬇️
src/write/page.rs	`75.58% <ø> (ø)`
src/deserialize/boolean.rs	`60.00% <33.33%> (ø)`
src/read/page/indexed_reader.rs	`62.12% <47.05%> (+6.71%)`	⬆️
src/page/mod.rs	`72.72% <91.66%> (+2.56%)`	⬆️
src/deserialize/binary.rs	`70.37% <100.00%> (-2.05%)`	⬇️
src/deserialize/fixed_len.rs	`69.23% <100.00%> (-1.51%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82b1115...3079ef8. Read the comment docs.

jorgecarleitao added the backwards-incompatible label Jul 23, 2022

jorgecarleitao changed the title ~~Delayed dictionary deserialization~~ Delay dictionary deserialization Jul 23, 2022

jorgecarleitao changed the title ~~Delay dictionary deserialization~~ Delay deserialization of dictionary pages Jul 23, 2022

Delayed dictionary

6610f7b

jorgecarleitao mentioned this pull request Jul 25, 2022

Delayed dict jorgecarleitao/arrow2#1185

Merged

jorgecarleitao force-pushed the delay_dict branch from 3079ef8 to 6610f7b Compare July 25, 2022 14:50

jorgecarleitao merged commit ac729ef into main Aug 4, 2022

jorgecarleitao deleted the delay_dict branch August 8, 2022 16:46

jorgecarleitao mentioned this pull request Aug 8, 2022

Some way to compare dictionary pages "by reference" #167

Closed

elferherrera mentioned this pull request Aug 14, 2022

Updating parquet-tools #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay deserialization of dictionary pages #160

Delay deserialization of dictionary pages #160

jorgecarleitao commented Jul 23, 2022 •

edited

Loading

codecov-commenter commented Jul 23, 2022

Delay deserialization of dictionary pages #160

Delay deserialization of dictionary pages #160

Conversation

jorgecarleitao commented Jul 23, 2022 • edited Loading

This PR

codecov-commenter commented Jul 23, 2022

Codecov Report

jorgecarleitao commented Jul 23, 2022 •

edited

Loading