Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
Improved internal docs (#1107)
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao authored Jun 26, 2022
1 parent 406eb3f commit aadc6fe
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 1 deletion.
70 changes: 70 additions & 0 deletions src/io/parquet/read/deserialize/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Design

## Non-nested types

Let's start with the design used for non-nested arrays. The (private) entry point of this
module for non-nested arrays is `simple::page_iter_to_arrays`.

This function expects

* a (fallible) streaming iterator of decompressed and encoded pages, `DataPages`
* the source (parquet) column type, including its logical information
* the target (arrow) `DataType`
* the chunk size

and returns an iterator of `Array`, `ArrayIter`.

This design is shared among _all_ `(parquet, arrow)` implemented tuples. Their main
difference is how they are deserialized, which depends on the source and target types.

When the array iterator is pulled the first time, the following happens:
* a page from `DataPages` is pulled
* a `PageState<'a>` is built from the page
* the `PageState` is consumed into a mutable array:
* if `chunk_size` is larger than the number of rows in the page, the mutable array state is preserved and a new page is pulled and the process repeated until we fill a chunk.
* if `chunk_size` is smaller than the number of rows in the page, the mutable array state
is returned and the remaining of the page is consumed into multiple mutable arrays of length `chunk_size` into a FIFO queue.

Subsequent pulls of arrays will first try to pull from the FIFO queue. Once the queue is empty, the
a new page is pulled.

### `PageState`

As mentioned above, the iterator leverages the idea that we attach a state to a page. Recall
that a page is essentially `[header][data]`. The `data` part contains encoded
`[rep levels][def levels][non-null values]`. Some pages have an associated dictionary page,
in which case the `non-null values` represent the indices.

Irrespectively of the physical type, the main idea is to split the page in two iterators:

* An iterator over `def levels`
* An iterator over `non-null values`

and progress the iterators as needed. In particular, for non-nested types, `def levels` is
a bitmap with the same representation as Arrow, in which case the validity is extended directly.

The `non-null values` are "expanded" by filling null values with the default value of each physical
type.

## Nested types

For nested type with N+1 levels (1 is the primitive), we need to build the nest information of each
N levels + the non-nested Arrow array.

This is done by first transversing the parquet types and using it to initialize, per chunk, the N levels.

The per-chunk execution is then similar but `chunk_size` only drives the number of retrieved
rows from the outermost parquet group (the field). Each of these pulls knows how many items need
to be pulled from the inner groups, all the way to the primitive type. This works because
in parquet a row cannot be split between two pages and thus each page is guaranteed
to contain a full row.

The `PageState` of nested types is composed by 4 iterators:

* A (zipped) iterator over `rep levels` and `def levels`
* An iterator over `def levels`
* An iterator over `non-null values`

The idea is that an iterator of `rep, def` contain all the information to decode the
nesting structure of an arrow array. The other two iterators are equivalent to the non-nested
types with the exception that `def levels` are no equivalent to arrow bitmaps.
2 changes: 1 addition & 1 deletion src/io/parquet/read/deserialize/simple.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ where
}

/// An iterator adapter that maps an iterator of DataPages into an iterator of Arrays
/// of [`DataType`] `data_type` and `chunk_size`.
/// of [`DataType`] `data_type` and length `chunk_size`.
pub fn page_iter_to_arrays<'a, I: 'a + DataPages>(
pages: I,
type_: &PrimitiveType,
Expand Down

0 comments on commit aadc6fe

Please sign in to comment.