Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SparkLoad] dpp read parquet/orc optimization #14292

Closed
wyb opened this issue Nov 29, 2022 · 3 comments · Fixed by #34787
Closed

[SparkLoad] dpp read parquet/orc optimization #14292

wyb opened this issue Nov 29, 2022 · 3 comments · Fixed by #34787
Assignees
Labels
type/enhancement Make an enhancement to StarRocks

Comments

@wyb
Copy link
Contributor

wyb commented Nov 29, 2022

Enhancement

Current SparkDpp will execute spark.read().orc/parquet (single_file) and sourceData.count() for each parquet file, each file will generate a job, and the jobs are serialized. If there are many files, it will take a long time.

Suggested optimizations:
read parquet files under a file group together, and need to consider the scenario of reading partition column from the path.

@wyb wyb added the type/enhancement Make an enhancement to StarRocks label Nov 29, 2022
@wyb wyb changed the title [SparkLoad] read parquet/orc optimization [SparkLoad] dpp read parquet/orc optimization Nov 29, 2022
@wyb
Copy link
Contributor Author

wyb commented Nov 29, 2022

Anyone who are interested can optimize it.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

@MaxWk
Copy link
Contributor

MaxWk commented Jul 20, 2023

assign it to me pls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Make an enhancement to StarRocks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants