[SparkLoad] dpp read parquet/orc optimization #14292

wyb · 2022-11-29T10:10:54Z

Enhancement

Current SparkDpp will execute spark.read().orc/parquet (single_file) and sourceData.count() for each parquet file, each file will generate a job, and the jobs are serialized. If there are many files, it will take a long time.

Suggested optimizations:
read parquet files under a file group together, and need to consider the scenario of reading partition column from the path.

The text was updated successfully, but these errors were encountered:

wyb · 2022-11-29T10:47:49Z

Anyone who are interested can optimize it.

github-actions · 2023-05-29T11:00:45Z

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

MaxWk · 2023-07-20T08:25:59Z

assign it to me pls.

wyb added the type/enhancement Make an enhancement to StarRocks label Nov 29, 2022

wyb changed the title ~~[SparkLoad] read parquet/orc optimization~~ [SparkLoad] dpp read parquet/orc optimization Nov 29, 2022

github-actions bot added the no-issue-activity label May 29, 2023

github-actions bot added the X-stale label Jun 12, 2023

github-actions bot closed this as completed Jun 12, 2023

wyb reopened this Nov 13, 2023

wyb mentioned this issue Nov 13, 2023

[Enhancement] combined count when loadfile in dpp #34787

Merged

22 tasks

wyb assigned MaxWk Nov 13, 2023

github-actions bot removed X-stale no-issue-activity labels Nov 13, 2023

wyb closed this as completed in #34787 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SparkLoad] dpp read parquet/orc optimization #14292

[SparkLoad] dpp read parquet/orc optimization #14292

wyb commented Nov 29, 2022 •

edited

Loading

wyb commented Nov 29, 2022

github-actions bot commented May 29, 2023

MaxWk commented Jul 20, 2023

[SparkLoad] dpp read parquet/orc optimization #14292

[SparkLoad] dpp read parquet/orc optimization #14292

Comments

wyb commented Nov 29, 2022 • edited Loading

Enhancement

wyb commented Nov 29, 2022

github-actions bot commented May 29, 2023

MaxWk commented Jul 20, 2023

wyb commented Nov 29, 2022 •

edited

Loading