From 72472e8ff575a37dd2d50e49880c37192ab36d1b Mon Sep 17 00:00:00 2001 From: "Jorge C. Leitao" Date: Sat, 25 Jun 2022 18:42:05 -0700 Subject: [PATCH] Improved internal docs --- src/io/parquet/read/deserialize/README.md | 70 +++++++++++++++++++++++ src/io/parquet/read/deserialize/simple.rs | 2 +- 2 files changed, 71 insertions(+), 1 deletion(-) create mode 100644 src/io/parquet/read/deserialize/README.md diff --git a/src/io/parquet/read/deserialize/README.md b/src/io/parquet/read/deserialize/README.md new file mode 100644 index 00000000000..971ed424784 --- /dev/null +++ b/src/io/parquet/read/deserialize/README.md @@ -0,0 +1,70 @@ +# Design + +## Non-nested types + +Let's start with the design used for non-nested arrays. The (private) entry point of this +module for non-nested arrays is `simple::page_iter_to_arrays`. + +This function expects + +* a (fallible) streaming iterator of decompressed and encoded pages, `DataPages` +* the source (parquet) column type, including its logical information +* the target (arrow) `DataType` +* the chunk size + +and returns an iterator of `Array`, `ArrayIter`. + +This design is shared among _all_ `(parquet, arrow)` implemented tuples. Their main +difference is how they are deserialized, which depends on the source and target types. + +When the array iterator is pulled the first time, the following happens: +* a page from `DataPages` is pulled +* a `PageState<'a>` is built from the page +* the `PageState` is consumed into a mutable array: + * if `chunk_size` is larger than the number of rows in the page, the mutable array state is preserved and a new page is pulled and the process repeated until we fill a chunk. + * if `chunk_size` is smaller than the number of rows in the page, the mutable array state + is returned and the remaining of the page is consumed into multiple mutable arrays of length `chunk_size` into a FIFO queue. + +Subsequent pulls of arrays will first try to pull from the FIFO queue. Once the queue is empty, the +a new page is pulled. + +### `PageState` + +As mentioned above, the iterator leverages the idea that we attach a state to a page. Recall +that a page is essentially `[header][data]`. The `data` part contains encoded +`[rep levels][def levels][non-null values]`. Some pages have an associated dictionary page, +in which case the `non-null values` represent the indices. + +Irrespectively of the physical type, the main idea is to split the page in two iterators: + +* An iterator over `def levels` +* An iterator over `non-null values` + +and progress the iterators as needed. In particular, for non-nested types, `def levels` is +a bitmap with the same representation as Arrow, in which case the validity is extended directly. + +The `non-null values` are "expanded" by filling null values with the default value of each physical +type. + +## Nested types + +For nested type with N+1 levels (1 is the primitive), we need to build the nest information of each +N levels + the non-nested Arrow array. + +This is done by first transversing the parquet types and using it to initialize, per chunk, the N levels. + +The per-chunk execution is then similar but `chunk_size` only drives the number of retrieved +rows from the outermost parquet group (the field). Each of these pulls knows how many items need +to be pulled from the inner groups, all the way to the primitive type. This works because +in parquet a row cannot be split between two pages and thus each page is guaranteed +to contain a full row. + +The `PageState` of nested types is composed by 4 iterators: + +* A (zipped) iterator over `rep levels` and `def levels` +* An iterator over `def levels` +* An iterator over `non-null values` + +The idea is that an iterator of `rep, def` contain all the information to decode the +nesting structure of an arrow array. The other two iterators are equivalent to the non-nested +types with the exception that `def levels` are no equivalent to arrow bitmaps. diff --git a/src/io/parquet/read/deserialize/simple.rs b/src/io/parquet/read/deserialize/simple.rs index 8a8ef653da6..aaab8aee9f5 100644 --- a/src/io/parquet/read/deserialize/simple.rs +++ b/src/io/parquet/read/deserialize/simple.rs @@ -55,7 +55,7 @@ where } /// An iterator adapter that maps an iterator of DataPages into an iterator of Arrays -/// of [`DataType`] `data_type` and `chunk_size`. +/// of [`DataType`] `data_type` and length `chunk_size`. pub fn page_iter_to_arrays<'a, I: 'a + DataPages>( pages: I, type_: &PrimitiveType,