Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Make Parquet read sync and async apis consistent #669

Open
mdrach opened this issue Dec 9, 2021 · 4 comments
Open

Make Parquet read sync and async apis consistent #669

mdrach opened this issue Dec 9, 2021 · 4 comments
Assignees
Labels
enhancement An improvement to an existing feature

Comments

@mdrach
Copy link
Contributor

mdrach commented Dec 9, 2021

In v0.7.0 I could stream in pages of a Parquet column chunk in an async context, then move the data into a dedicated thread pool to perform the CPU-intensive work.

        let mut reader = RangedHttpStreamer::new(http_client, url, shard_size);
        let stream = get_page_stream(&column_chunk_metadata, &mut reader, None, vec![])
            .await
            .map_err(Error::internal)?;
        let pages = stream.collect::<Vec<_>>().await;

        let array: Result<Box<dyn arrow2::array::Array>> = spawn_blocking(move || {
            let mut basic_decompressor = BasicDecompressor::new(pages.into_iter(), vec![]);
            page_iter_to_array(
                &mut basic_decompressor,
                &column_chunk_metadata,
                field.data_type.clone(),
            )
            .map_err(Error::internal)
        })
        .await

However, as of v0.8.0 page_iter_to_array has been replaced by column_iter_to_array while the async api does not expose a corresponding get_column_stream (only get_page_stream). Is there a better way to load and parse a parquet file from S3? Or, are APIs just out of sync?

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Dec 9, 2021
@jorgecarleitao
Copy link
Owner

The APIs are out of sync.

Note that the reason for the column_iter is that it allows for nested parquet types. An alternative is to offer a page stream per parquet column and have the users assemble the columns themselves into the corresponding Arrow type, but I think that that requires us to expose a larger (currently private) API and more documentation.

@jorgecarleitao
Copy link
Owner

Would you like to tackle this one, or, do you think I should prioritize it?

@mdrach
Copy link
Contributor Author

mdrach commented Dec 16, 2021

If you could prioritize that would be great. I may be able to get to this, but likely not in the short term.

@jorgecarleitao
Copy link
Owner

I have started working on this. The first change is on parquet2, since there is where we declare these APIs.

@jorgecarleitao jorgecarleitao self-assigned this Jan 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants