Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] combined count when loadfile in dpp #34787

Merged
merged 1 commit into from
Nov 21, 2023

Conversation

MaxWk
Copy link
Contributor

@MaxWk MaxWk commented Nov 10, 2023

Fixes #14292

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@CLAassistant
Copy link

CLAassistant commented Nov 10, 2023

CLA assistant check
All committers have signed the CLA.

@@ -1297,7 +1318,7 @@ private void writeDppResult(DppResult dppResult) throws Exception {
URI uri = new URI(outputPath);
Path filePath = new Path(resultFilePath);
try (FileSystem fs = FileSystem.get(uri, serializableHadoopConf.value());
FSDataOutputStream outputStream = fs.create(filePath)) {
FSDataOutputStream outputStream = fs.create(filePath)) {
Gson gson = new Gson();
outputStream.write(gson.toJson(dppResult).getBytes());
outputStream.write('\n');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
There could be unexpected behavior when counting the number of rows with scannedRowsAcc.add(fileGroupDataframe.count()); due to a costly action being triggered on the Spark DataFrame within a transformation process. This line will force the evaluation of all transformations applied to the fileGroupDataframe up to this point, which can be highly inefficient if the DataFrame is large as it triggers a full scan of the data.

You can modify the code like this:

// Move the row count until after the transformations are complete and an action is required.
// Thus, avoiding triggering multiple actions unnecessarily.
// The exact spot to place the count action depends on the broader context of how the produced DataFrame is utilized downstream.
// Ensure there's a valid action where the result of the dataframe needs to be materialized before counting the rows.

Note that without additional context on how fileGroupDataframe is used later on, I cannot provide an exact location in the code where .count() should be placed. It's important that you review the logic to ensure scannedRowsAcc.add(...) is only called at a point where evaluating the DataFrame is unavoidable or beneficial for subsequent operations (like caching or writing to disk).

@MaxWk MaxWk changed the title combined count when loadfile [Enhancement] combined count when loadfile in dpp Nov 10, 2023
@MaxWk MaxWk requested a review from wyb November 14, 2023 02:06
} catch (Exception e) {
LOG.warn("parse path failed:" + filePath);
throw e;
}
}

if (fileGroup.fileFormat != null &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Set<String> formats = new HashSet<>(Arrays.asList("orc", "parquet"));
if (fileGroup.fileFormat != null && formats.contains(fileGroup.fileFormat.toLowerCase())) {
    scannedRowsAcc.add(fileGroupDataframe.count());
}

@Astralidea Astralidea enabled auto-merge (squash) November 15, 2023 04:33
auto-merge was automatically disabled November 16, 2023 07:23

Head branch was pushed to by a user without write access

@MaxWk MaxWk force-pushed the combine-count-opt-in-dpp branch 2 times, most recently from 7f2b4f0 to 0f4ef16 Compare November 16, 2023 07:30
@wyb wyb enabled auto-merge (squash) November 20, 2023 10:36
Copy link

sonarcloud bot commented Nov 20, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

0.0% 0.0% Coverage
0.0% 0.0% Duplication

warning The version of Java (11.0.21) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@wyb wyb merged commit 154ae93 into StarRocks:main Nov 21, 2023
37 checks passed
@wanpengfei-git
Copy link
Collaborator

@Mergifyio backport branch-3.2

@github-actions github-actions bot removed the 3.2 label Nov 21, 2023
@wanpengfei-git
Copy link
Collaborator

@Mergifyio backport branch-3.1

@github-actions github-actions bot removed the 3.1 label Nov 21, 2023
@wanpengfei-git
Copy link
Collaborator

@Mergifyio backport branch-3.0

@github-actions github-actions bot removed the 3.0 label Nov 21, 2023
@wanpengfei-git
Copy link
Collaborator

@Mergifyio backport branch-2.5

@github-actions github-actions bot removed the 2.5 label Nov 21, 2023
Copy link
Contributor

mergify bot commented Nov 21, 2023

backport branch-3.2

✅ Backports have been created

Copy link
Contributor

mergify bot commented Nov 21, 2023

backport branch-3.1

✅ Backports have been created

Copy link
Contributor

mergify bot commented Nov 21, 2023

backport branch-3.0

✅ Backports have been created

Copy link
Contributor

mergify bot commented Nov 21, 2023

backport branch-2.5

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
mergify bot pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
mergify bot pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
mergify bot pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

wanpengfei-git pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
wanpengfei-git pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
wanpengfei-git pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
wanpengfei-git pushed a commit that referenced this pull request Nov 21, 2023
Signed-off-by: kevin wan <[email protected]>
Co-authored-by: mingge <[email protected]>
(cherry picked from commit 154ae93)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SparkLoad] dpp read parquet/orc optimization
5 participants