Make Parquet read sync and async apis consistent #669

mdrach · 2021-12-09T07:14:55Z

In v0.7.0 I could stream in pages of a Parquet column chunk in an async context, then move the data into a dedicated thread pool to perform the CPU-intensive work.

        let mut reader = RangedHttpStreamer::new(http_client, url, shard_size);
        let stream = get_page_stream(&column_chunk_metadata, &mut reader, None, vec![])
            .await
            .map_err(Error::internal)?;
        let pages = stream.collect::<Vec<_>>().await;

        let array: Result<Box<dyn arrow2::array::Array>> = spawn_blocking(move || {
            let mut basic_decompressor = BasicDecompressor::new(pages.into_iter(), vec![]);
            page_iter_to_array(
                &mut basic_decompressor,
                &column_chunk_metadata,
                field.data_type.clone(),
            )
            .map_err(Error::internal)
        })
        .await

However, as of v0.8.0 page_iter_to_array has been replaced by column_iter_to_array while the async api does not expose a corresponding get_column_stream (only get_page_stream). Is there a better way to load and parse a parquet file from S3? Or, are APIs just out of sync?

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2021-12-09T07:42:24Z

The APIs are out of sync.

Note that the reason for the column_iter is that it allows for nested parquet types. An alternative is to offer a page stream per parquet column and have the users assemble the columns themselves into the corresponding Arrow type, but I think that that requires us to expose a larger (currently private) API and more documentation.

jorgecarleitao · 2021-12-14T20:18:08Z

Would you like to tackle this one, or, do you think I should prioritize it?

mdrach · 2021-12-16T21:37:37Z

If you could prioritize that would be great. I may be able to get to this, but likely not in the short term.

jorgecarleitao · 2022-01-03T22:17:54Z

I have started working on this. The first change is on parquet2, since there is where we declare these APIs.

jorgecarleitao added the enhancement An improvement to an existing feature label Dec 9, 2021

mdrach mentioned this issue Dec 13, 2021

Fixed error in reading negative decimals from parquet #679

Merged

jorgecarleitao self-assigned this Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Parquet read sync and async apis consistent #669

Make Parquet read sync and async apis consistent #669

mdrach commented Dec 9, 2021 •

edited

Loading

jorgecarleitao commented Dec 9, 2021

jorgecarleitao commented Dec 14, 2021

mdrach commented Dec 16, 2021

jorgecarleitao commented Jan 3, 2022

Make Parquet read sync and async apis consistent #669

Make Parquet read sync and async apis consistent #669

Comments

mdrach commented Dec 9, 2021 • edited Loading

jorgecarleitao commented Dec 9, 2021

jorgecarleitao commented Dec 14, 2021

mdrach commented Dec 16, 2021

jorgecarleitao commented Jan 3, 2022

mdrach commented Dec 9, 2021 •

edited

Loading