Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump arrow2 to main #4212

Merged
merged 5 commits into from
Feb 23, 2022
Merged

bump arrow2 to main #4212

merged 5 commits into from
Feb 23, 2022

Conversation

youngsofun
Copy link
Member

@youngsofun youngsofun commented Feb 21, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

  • the first commit bump to arrow2 0.9.1 with parquet2 0.9, main changes
  1. replace RecordBatch by Chunk
  2. new traits for simd8 for compare ops
  • the second commit (todo) bump to HEAD of main, to use new API FileWriter/FileStreamer, based on which I am working on the implementation of encryption for parquet2.

is there any other consideration given test is passed?

Changelog

  • Not for changelog (changelog entry is not required)

Related Issues

Fixes #3746

Test Plan

@vercel
Copy link

vercel bot commented Feb 21, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/32KaRCUPcxR5pk8JCu46uTfDLpp7
✅ Preview: Canceled

[Deployment for 1db356f canceled]

@mergify
Copy link
Contributor

mergify bot commented Feb 21, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@youngsofun
Copy link
Member Author

youngsofun commented Feb 23, 2022

Some Note:

first, the new async reader interface read_columns_many_async has 2 improvements:
1. read the whole column chunk in one read_exact instead of page by page
2. use reader factory to allow reading column chunks in a row group in parrallel

however, current ParquetSource brings a reader with it, thus cannot use the function, so I wrote a similar version of read_columns_many_async for this situation.

maybe we can adopt the reader factory pattern later.

second, Some arrow2 functions returnChunk<Box<dyn Array>> while we needChunk<Arc>, I wrote a translation function for it
box_chunk_to_arc_chunk`. Any better ways?

@dantengsky
Copy link
Member

dantengsky commented Feb 23, 2022

second, Some arrow2 functions returnChunk<Box> while we needChunk, I wrote a translation function for it
box_chunk_to_arc_chunk`. Any better ways?

hi, there is a rough idea, FYI

  • the idea:

How about instead of converting Box<dyn Array> to Arc<dyn Array>, let the callee DataBlock::from_chunk be generic:

from pub fn from_chunk(schema: &DataSchemaRef, chuck: &Chunk<ArrayRef>) -> Result<DataBlock>
to pub fn from_chunk<A>(schema: &DataSchemaRef, chuck: &Chunk<A>) -> Result<DataBlock> where A: AsRef<dyn Array>

  • the code:

dantengsky@71e3fef

@youngsofun
Copy link
Member Author

youngsofun commented Feb 23, 2022

looks good to me,翻墙从中午一直挂,晚点再来改 @dantengsky

let col_metas = get_field_columns(columns, field_name);
let mut cols = Vec::with_capacity(col_metas.len());
for meta in col_metas {
cols.push((meta, _read_single_column_async(reader, meta).await?))
Copy link
Member

@sundy-li sundy-li Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to make IO run on parallel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about accepting an Operator or Object here to create different readers for different columns?

Copy link
Member Author

@youngsofun youngsofun Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my version of read_columns_many_async is not parallel.
need it because SourceFactory pass a Reader to ParquetSource

https://github.com/datafuselabs/databend/blob/b41d39b05dcf281db921377d8b027477d333f8b6/common/streams/src/sources/source_factory.rs#L32


the one in arrow2 is parallel, it is ok to use it in block_reader.

https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/src/io/parquet/read/row_group.rs#L166

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's export _read_single_column_async as pub, we can use it in block_reader ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, block_reader already used it, I mean maybe SourceFactory can accept something like (data_accessor, path). even use factory justlike arrow2, and struct(data_accessor, path) an impl of it

@@ -106,7 +106,7 @@ async fn test_interpreter_interceptor_for_insert() -> Result<()> {
"| log_type | handler_type | cpu_usage | scan_rows | scan_bytes | scan_partitions | written_rows | written_bytes | result_rows | result_bytes | query_kind | query_text | sql_user | sql_user_quota |",
"+----------+--------------+-----------+-----------+------------+-----------------+--------------+---------------+-------------+--------------+-----------------+----------------------------------------------------+----------+---------------------------------------------------------------------------+",
"| 1 | TestSession | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | CreateTablePlan | create table t as select number from numbers_mt(1) | root | UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 } |",
"| 2 | TestSession | 8 | 1 | 8 | 0 | 1 | 1090 | 0 | 0 | CreateTablePlan | create table t as select number from numbers_mt(1) | root | UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 } |",
"| 2 | TestSession | 8 | 1 | 8 | 0 | 1 | 1330 | 0 | 0 | CreateTablePlan | create table t as select number from numbers_mt(1) | root | UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 } |",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

written_bytes changed from 1090 to 1330

not known why

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's related to HashMap --> BTreeMap .

let fields_to_read = self
.projection
.clone()
.into_iter()
.map(|idx| arrow_fields[idx].clone())
.map(|idx| {
Copy link
Member Author

@youngsofun youngsofun Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test test_fuse_table_normal_case
failed because parquet is written with field names 'a', 'b' ... ( from select operator)
while the block reader later read with field name 'id' (from table schema)


let stream = futures::stream::iter(cols).map(|(col_meta, idx)| {
let factory = || {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the arrow2 parallel read_columns_many_async here

Copy link
Member Author

@youngsofun youngsofun Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xuanwo (data_accessor, path) as a impl of F: Fn() -> BoxFuture<'b, std::io::Result<R>> + Clone (from arrow2) https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/src/io/parquet/read/row_group.rs#L248

Copy link
Member

@dantengsky dantengsky Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts here:

  1. shall we control the number of futures that simultaneously read the data of columns?

  2. about the try_join_all used in read_columns_may_async
    at least, "join_all will switch to the more powerful FuturesOrdered for performance reasons" according to
    https://docs.rs/futures/0.3.21/futures/future/fn.join_all.html#see-also
    not sure if it's worth bothering to replace it.

  3. read column chunk in one go might be a huge benefit (as @youngsofun already mentioned)

now the read_page_header scatterly read bytes in an in-memory Cursor
https://github.com/jorgecarleitao/parquet2/blob/85d1f01597907c0cc30a234a6d6209c3d7ef17cf/src/read/page_iterator.rs#L65-L68

maybe we can eliminate the BufReader used in the reading of column
https://github.com/datafuselabs/databend/blob/b41d39b05dcf281db921377d8b027477d333f8b6/query/src/storages/fuse/io/block_reader.rs#L108-L111

Copy link
Member Author

@youngsofun youngsofun Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursor is only for file metadata

Oh, I mean the Cursor used in arrow2 to_deserializer

https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/src/io/parquet/read/row_group.rs#L173-L185

pr #4230

👍

@youngsofun youngsofun changed the title bump arrow2 to 0.9.1. bump arrow2 to main Feb 23, 2022
@youngsofun youngsofun marked this pull request as ready for review February 23, 2022 12:35
@dantengsky
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bump datafuse-extras/arrow2
5 participants