bump arrow2 to main #4212

youngsofun · 2022-02-21T14:30:24Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

the first commit bump to arrow2 0.9.1 with parquet2 0.9, main changes

replace RecordBatch by Chunk
new traits for simd8 for compare ops

the second commit (todo) bump to HEAD of main, to use new API FileWriter/FileStreamer, based on which I am working on the implementation of encryption for parquet2.

is there any other consideration given test is passed?

Changelog

Not for changelog (changelog entry is not required)

Related Issues

Fixes #3746

Test Plan

vercel · 2022-02-21T14:30:29Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/32KaRCUPcxR5pk8JCu46uTfDLpp7
✅ Preview: Canceled

[Deployment for 1db356f canceled]

mergify · 2022-02-21T14:30:58Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

youngsofun · 2022-02-23T05:03:42Z

Some Note:

first, the new async reader interface read_columns_many_async has 2 improvements:
1. read the whole column chunk in one read_exact instead of page by page
2. use reader factory to allow reading column chunks in a row group in parrallel

however, current ParquetSource brings a reader with it, thus cannot use the function, so I wrote a similar version of read_columns_many_async for this situation.

maybe we can adopt the reader factory pattern later.

second, Some arrow2 functions returnChunk<Box<dyn Array>> while we needChunk<Arc>, I wrote a translation function for it
box_chunk_to_arc_chunk`. Any better ways?

dantengsky · 2022-02-23T05:56:57Z

second, Some arrow2 functions returnChunk<Box> while we needChunk, I wrote a translation function for it
box_chunk_to_arc_chunk`. Any better ways?

hi, there is a rough idea, FYI

the idea:

How about instead of converting Box<dyn Array> to Arc<dyn Array>, let the callee DataBlock::from_chunk be generic:

from pub fn from_chunk(schema: &DataSchemaRef, chuck: &Chunk<ArrayRef>) -> Result<DataBlock>
to pub fn from_chunk<A>(schema: &DataSchemaRef, chuck: &Chunk<A>) -> Result<DataBlock> where A: AsRef<dyn Array>

the code:

dantengsky@71e3fef

youngsofun · 2022-02-23T06:23:16Z

looks good to me，翻墙从中午一直挂，晚点再来改 @dantengsky

sundy-li · 2022-02-23T09:54:18Z

common/arrow/src/parquet_read.rs

+    let col_metas = get_field_columns(columns, field_name);
+    let mut cols = Vec::with_capacity(col_metas.len());
+    for meta in col_metas {
+        cols.push((meta, _read_single_column_async(reader, meta).await?))


How to make IO run on parallel?

How about accepting an Operator or Object here to create different readers for different columns?

my version of read_columns_many_async is not parallel.
need it because SourceFactory pass a Reader to ParquetSource

https://github.com/datafuselabs/databend/blob/b41d39b05dcf281db921377d8b027477d333f8b6/common/streams/src/sources/source_factory.rs#L32

the one in arrow2 is parallel, it is ok to use it in block_reader.

https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/src/io/parquet/read/row_group.rs#L166

So let's export _read_single_column_async as pub, we can use it in block_reader ?

no, block_reader already used it, I mean maybe SourceFactory can accept something like (data_accessor, path). even use factory justlike arrow2, and struct(data_accessor, path) an impl of it

youngsofun · 2022-02-23T11:35:06Z

query/tests/it/interpreters/interpreter_factory_interceptor.rs

@@ -106,7 +106,7 @@ async fn test_interpreter_interceptor_for_insert() -> Result<()> {
            "| log_type | handler_type | cpu_usage | scan_rows | scan_bytes | scan_partitions | written_rows | written_bytes | result_rows | result_bytes | query_kind      | query_text                                         | sql_user | sql_user_quota                                                            |",
            "+----------+--------------+-----------+-----------+------------+-----------------+--------------+---------------+-------------+--------------+-----------------+----------------------------------------------------+----------+---------------------------------------------------------------------------+",
            "| 1        | TestSession  | 8         | 0         | 0          | 0               | 0            | 0             | 0           | 0            | CreateTablePlan | create table t as select number from numbers_mt(1) | root     | UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 } |",
-            "| 2        | TestSession  | 8         | 1         | 8          | 0               | 1            | 1090          | 0           | 0            | CreateTablePlan | create table t as select number from numbers_mt(1) | root     | UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 } |",
+            "| 2        | TestSession  | 8         | 1         | 8          | 0               | 1            | 1330          | 0           | 0            | CreateTablePlan | create table t as select number from numbers_mt(1) | root     | UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 } |",


written_bytes changed from 1090 to 1330

not known why

I think it's related to HashMap --> BTreeMap .

youngsofun · 2022-02-23T11:38:25Z

query/src/storages/fuse/io/block_reader.rs

        let fields_to_read = self
            .projection
            .clone()
            .into_iter()
-            .map(|idx| arrow_fields[idx].clone())
+            .map(|idx| {


the test test_fuse_table_normal_case
failed because parquet is written with field names 'a', 'b' ... ( from select operator)
while the block reader later read with field name 'id' (from table schema)

youngsofun · 2022-02-23T11:40:46Z

query/src/storages/fuse/io/block_reader.rs


-        let stream = futures::stream::iter(cols).map(|(col_meta, idx)| {
+        let factory = || {


we use the arrow2 parallel read_columns_many_async here

@Xuanwo (data_accessor, path) as a impl of F: Fn() -> BoxFuture<'b, std::io::Result<R>> + Clone (from arrow2) https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/src/io/parquet/read/row_group.rs#L248

some thoughts here:

shall we control the number of futures that simultaneously read the data of columns?

about the try_join_all used in read_columns_may_async
at least, "join_all will switch to the more powerful FuturesOrdered for performance reasons" according to
https://docs.rs/futures/0.3.21/futures/future/fn.join_all.html#see-also
not sure if it's worth bothering to replace it.

read column chunk in one go might be a huge benefit (as @youngsofun already mentioned)

now the read_page_header scatterly read bytes in an in-memory Cursor
https://github.com/jorgecarleitao/parquet2/blob/85d1f01597907c0cc30a234a6d6209c3d7ef17cf/src/read/page_iterator.rs#L65-L68

maybe we can eliminate the BufReader used in the reading of column
https://github.com/datafuselabs/databend/blob/b41d39b05dcf281db921377d8b027477d333f8b6/query/src/storages/fuse/io/block_reader.rs#L108-L111

cursor is only for file metadata https://github.com/jorgecarleitao/parquet2/search?q=cursor

but we can eliminate the BufReader because "we execute exactly 1 seek and 1 read on them." https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/examples/parquet_read_async.rs#L31

pr #4230

cursor is only for file metadata

Oh, I mean the Cursor used in arrow2 to_deserializer

https://github.com/jorgecarleitao/arrow2/blob/3d528c99589e96f0539de4c07b11843fa22f23ac/src/io/parquet/read/row_group.rs#L173-L185

pr #4230

👍

dantengsky · 2022-02-23T13:27:31Z

/lgtm

youngsofun requested a review from BohuTANG as a code owner February 21, 2022 14:30

databend-bot added the need-review label Feb 21, 2022

vercel bot temporarily deployed to Preview February 21, 2022 14:30 Inactive

mergify bot added the pr-not-for-changelog label Feb 21, 2022

youngsofun marked this pull request as draft February 21, 2022 14:35

youngsofun force-pushed the parquet branch from 541c5b8 to e953145 Compare February 21, 2022 14:51

vercel bot temporarily deployed to Preview February 21, 2022 14:52 Inactive

bump arrow2 to 0.9.1.

0e1c390

youngsofun force-pushed the parquet branch from e953145 to 20a343c Compare February 23, 2022 04:52

vercel bot temporarily deployed to Preview February 23, 2022 04:52 Inactive

youngsofun force-pushed the parquet branch from 20a343c to c957ce1 Compare February 23, 2022 05:02

vercel bot temporarily deployed to Preview February 23, 2022 05:02 Inactive

bump arrow2 to main

2456f73

youngsofun force-pushed the parquet branch from c957ce1 to 2456f73 Compare February 23, 2022 05:33

vercel bot temporarily deployed to Preview February 23, 2022 05:33 Inactive

sundy-li reviewed Feb 23, 2022

View reviewed changes

youngsofun added 2 commits February 23, 2022 19:07

fix test

4e2b298

refactor, thanks to dantengsky.

078a549

vercel bot temporarily deployed to Preview February 23, 2022 11:29 Inactive

youngsofun commented Feb 23, 2022

View reviewed changes

youngsofun changed the title ~~bump arrow2 to 0.9.1.~~ bump arrow2 to main Feb 23, 2022

youngsofun marked this pull request as ready for review February 23, 2022 12:35

sundy-li approved these changes Feb 23, 2022

View reviewed changes

databend-bot approved these changes Feb 23, 2022

View reviewed changes

Merge branch 'main' into parquet

1db356f

vercel bot temporarily deployed to Preview February 23, 2022 13:28 Inactive

mergify bot merged commit 6cb115a into databendlabs:main Feb 23, 2022

youngsofun mentioned this pull request Feb 23, 2022

remove buffer in block_reader of fuse store #4230

Merged

dantengsky mentioned this pull request Feb 26, 2022

optimize table ontime all returns Too many open files (os error 24) #4253

Closed

youngsofun deleted the parquet branch March 13, 2022 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump arrow2 to main #4212

bump arrow2 to main #4212

youngsofun commented Feb 21, 2022 •

edited

Loading

vercel bot commented Feb 21, 2022 •

edited

Loading

mergify bot commented Feb 21, 2022

youngsofun commented Feb 23, 2022 •

edited

Loading

dantengsky commented Feb 23, 2022 •

edited

Loading

youngsofun commented Feb 23, 2022 •

edited

Loading

sundy-li Feb 23, 2022 •

edited

Loading

Xuanwo Feb 23, 2022

youngsofun Feb 23, 2022 •

edited

Loading

sundy-li Feb 23, 2022

youngsofun Feb 23, 2022

youngsofun Feb 23, 2022

sundy-li Feb 23, 2022

youngsofun Feb 23, 2022 •

edited

Loading

youngsofun Feb 23, 2022

youngsofun Feb 23, 2022 •

edited

Loading

dantengsky Feb 23, 2022 •

edited

Loading

youngsofun Feb 23, 2022 •

edited

Loading

dantengsky Feb 23, 2022

dantengsky commented Feb 23, 2022


		let stream = futures::stream::iter(cols).map(\|(col_meta, idx)\| {
		let factory = \|\| {

bump arrow2 to main #4212

bump arrow2 to main #4212

Conversation

youngsofun commented Feb 21, 2022 • edited Loading

Summary

Changelog

Related Issues

Test Plan

vercel bot commented Feb 21, 2022 • edited Loading

mergify bot commented Feb 21, 2022

youngsofun commented Feb 23, 2022 • edited Loading

dantengsky commented Feb 23, 2022 • edited Loading

youngsofun commented Feb 23, 2022 • edited Loading

sundy-li Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Xuanwo Feb 23, 2022

Choose a reason for hiding this comment

youngsofun Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

sundy-li Feb 23, 2022

Choose a reason for hiding this comment

youngsofun Feb 23, 2022

Choose a reason for hiding this comment

youngsofun Feb 23, 2022

Choose a reason for hiding this comment

sundy-li Feb 23, 2022

Choose a reason for hiding this comment

youngsofun Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

youngsofun Feb 23, 2022

Choose a reason for hiding this comment

youngsofun Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

dantengsky Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

youngsofun Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

dantengsky Feb 23, 2022

Choose a reason for hiding this comment

dantengsky commented Feb 23, 2022

youngsofun commented Feb 21, 2022 •

edited

Loading

vercel bot commented Feb 21, 2022 •

edited

Loading

youngsofun commented Feb 23, 2022 •

edited

Loading

dantengsky commented Feb 23, 2022 •

edited

Loading

youngsofun commented Feb 23, 2022 •

edited

Loading

sundy-li Feb 23, 2022 •

edited

Loading

youngsofun Feb 23, 2022 •

edited

Loading

youngsofun Feb 23, 2022 •

edited

Loading

youngsofun Feb 23, 2022 •

edited

Loading

dantengsky Feb 23, 2022 •

edited

Loading

youngsofun Feb 23, 2022 •

edited

Loading