You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current SparkDpp will execute spark.read().orc/parquet (single_file) and sourceData.count() for each parquet file, each file will generate a job, and the jobs are serialized. If there are many files, it will take a long time.
Suggested optimizations:
read parquet files under a file group together, and need to consider the scenario of reading partition column from the path.
The text was updated successfully, but these errors were encountered:
We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!
Enhancement
Current SparkDpp will execute
spark.read().orc/parquet (single_file)
andsourceData.count()
for each parquet file, each file will generate a job, and the jobs are serialized. If there are many files, it will take a long time.Suggested optimizations:
read parquet files under a file group together, and need to consider the scenario of reading partition column from the path.
The text was updated successfully, but these errors were encountered: