Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Fix column overflow in RowsetUpdateState and CompactionState #33246

Merged
merged 2 commits into from
Oct 24, 2023

Conversation

decster
Copy link
Contributor

@decster decster commented Oct 20, 2023

Fixes #33247

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.2
    • 3.1
    • 3.0
    • 2.5

*pcolumn = std::make_unique<LargeBinaryColumn>();
} else {
*pcolumn = std::make_unique<BinaryColumn>();
}
}
return Status::OK();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to be good overall, but there are few potential improvements that could be made:

  1. Code Redundancy: The condition on whether large_column is true of false is repeated in your code for both TYPE_VARCHAR and some other column type(s). You could extract the assignment into a separate function or method to reduce redundancy.

  2. Function Documentation/Comments: There are no comments or documentations for the modified functions. It's important to explain what each parameter does and what the function returns. This helps others who read your code understand it more quickly and thoroughly.

  3. Error Handling: In the function 'create_column', it's unclear what happens if the type is not one of the pre-specified ones. Having a default case or an additional return status for unsupported type might be useful.

  4. Variable Naming: Although this is subjective, bool variables often are more readable if named as a question or statement of state. e.g., is_large_column might be clearer than large_column.

  5. Magic Constants: The constants like TYPE_INT128, TYPE_VARCHAR, etc. - it would be great if they're documented at the place where they are declared so developers can know what these values mean and what they're used for.

  6. Unit Testing: Ensure that these changes are adequately tested, particularly around handling of large_columns.

Here's an example refactor addressing point 1:

std::unique_ptr<Column> define_type(bool large_column){
    if (large_column) {
        return std::make_unique<LargeBinaryColumn>();
    } else {
        return std::make_unique<BinaryColumn>();
    }
}

//In the switch-case and the else block:
*pcolumn = define_type(large_column);

Remember that code review suggestions can vary between reviewers based on their experiences, coding styles, and the specific requirements of the project.

static Status create_column(const Schema& schema, std::unique_ptr<Column>* pcolumn,
const std::vector<ColumnId>& key_idxes);
const std::vector<ColumnId>& key_idxes, bool large_column = false);

static void encode(const Schema& schema, const Chunk& chunk, size_t offset, size_t len, Column* dest);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you've provided seems well-structured and follows decent practices. However, there are a few suggestions that might improve its readability, maintenance, and potential for future changes. Here are a few observations:

  1. Consistent documentation: While the addition of comments is excellent, it could be made neater if the documentation style were consistent. Right now, comments describe inputs inline but use new lines for the static method descriptions. You might also consider using a documentation standard like Doxygen or Javadoc to make comments more structured, clear, and usable.

  2. Overloads of create_column: There are two versions of create_column having different parameters. If the functionality is mostly the same between them, consider refactoring your code to remove extra duplication while preserving the flexibility provided by the overloading. You also need to ensure that both methods behave in a consistent way for the same initial set of parameters, to prevent bugs and confusion.

  3. Boolean trap: In the create_column function, the last parameter is a boolean. This can lead to "boolean trap" where the meaning of true/false is unclear when reading the code where this method is used. An enum could be used instead with clearly named values to denote what this flag represents.

  4. Error checking: There's no clear sign of how potential errors are being managed. It would be handy to have some error-checking procedures, at least with critical parts. For instance, before creating the column, it might be useful to verify whether the given schema is valid or not.

  5. Usage of std::unique_ptr<Column>* pcolumn: Passing a smart pointer by reference might be preferable than passing by pointer. This signifies that this argument can be modified.

Remember, these points are meta-suggestions based on the part of the code you have shared. The actual improvements might be different based on complete context and actual implementation of these methods. Further assessment should also consider performance and testing behaviors.

@decster decster force-pushed the bugfix-compaction-state-overflow branch from 2f9786a to 2ff0c20 Compare October 20, 2023 08:08
@decster decster force-pushed the bugfix-compaction-state-overflow branch 2 times, most recently from 1fae91c to da81594 Compare October 23, 2023 05:12
Signed-off-by: Binglin Chang <[email protected]>
@decster decster force-pushed the bugfix-compaction-state-overflow branch from da81594 to 9082438 Compare October 23, 2023 06:55
@wanpengfei-git
Copy link
Collaborator

[FE Incremental Coverage Report]

😍 pass : 0 / 0 (0%)

@wanpengfei-git
Copy link
Collaborator

[BE Incremental Coverage Report]

😞 fail : 98 / 166 (59.04%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/storage/update_compaction_state.cpp 1 7 14.29% [130, 131, 132, 138, 139, 140]
🔵 be/src/storage/primary_key_encoder.cpp 89 151 58.94% [516, 517, 518, 519, 520, 521, 522, 525, 528, 529, 530, 531, 532, 534, 535, 538, 571, 572, 573, 574, 575, 576, 577, 578, 580, 652, 653, 654, 655, 656, 657, 658, 664, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 719, 720]
🔵 be/src/storage/persistent_index.cpp 2 2 100.00% []
🔵 be/src/storage/lake/rowset_update_state.cpp 1 1 100.00% []
🔵 be/src/storage/lake/update_compaction_state.cpp 1 1 100.00% []
🔵 be/src/storage/primary_index.cpp 2 2 100.00% []
🔵 be/src/column/binary_column.cpp 1 1 100.00% []
🔵 be/src/storage/rowset_update_state.cpp 1 1 100.00% []

@decster decster merged commit 4a2c85a into StarRocks:main Oct 24, 2023
43 of 44 checks passed
@github-actions
Copy link

@Mergifyio backport branch-3.2

@github-actions github-actions bot removed the 3.2 label Oct 24, 2023
@github-actions
Copy link

@Mergifyio backport branch-3.1

@github-actions
Copy link

@Mergifyio backport branch-3.0

@github-actions
Copy link

@Mergifyio backport branch-2.5

@github-actions github-actions bot removed the 2.5 label Oct 24, 2023
@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

backport branch-3.2

✅ Backports have been created

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

backport branch-3.1

✅ Backports have been created

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

backport branch-3.0

✅ Backports have been created

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

backport branch-2.5

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Oct 24, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
mergify bot pushed a commit that referenced this pull request Oct 24, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)

# Conflicts:
#	be/src/storage/lake/update_compaction_state.cpp
mergify bot pushed a commit that referenced this pull request Oct 24, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)

# Conflicts:
#	be/src/storage/lake/update_compaction_state.cpp
mergify bot pushed a commit that referenced this pull request Oct 24, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)

# Conflicts:
#	be/src/storage/lake/rowset_update_state.cpp
#	be/src/storage/lake/update_compaction_state.cpp
#	be/src/storage/primary_key_encoder.cpp
#	be/src/storage/primary_key_encoder.h
#	be/src/storage/rowset_update_state.cpp
#	be/src/storage/update_compaction_state.cpp
wanpengfei-git pushed a commit that referenced this pull request Oct 25, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
decster added a commit that referenced this pull request Oct 26, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
wanpengfei-git pushed a commit that referenced this pull request Oct 26, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
decster added a commit that referenced this pull request Oct 26, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
wanpengfei-git pushed a commit that referenced this pull request Oct 26, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
decster added a commit that referenced this pull request Oct 26, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
wanpengfei-git pushed a commit that referenced this pull request Oct 27, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
@luohaha
Copy link
Contributor

luohaha commented Nov 28, 2023

https://github.com/Mergifyio backport branch-3.1-cs

Copy link
Contributor

mergify bot commented Nov 28, 2023

backport branch-3.1-cs

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Nov 28, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)

# Conflicts:
#	be/src/storage/lake/update_compaction_state.cpp
mergify bot pushed a commit that referenced this pull request Nov 28, 2023
…#33246)

When the compaction output file is too large and contains many rows, the resulting buffer containing all encoded pk in this segment file may be huge and overflow. This PR fixes this by changing BinaryColumn<uint32> to BinaryColumn<int64>, adding some checks for overflow, and refactoring some compaction memory logs.

Fixes #33247

Signed-off-by: Binglin Chang <[email protected]>
(cherry picked from commit 4a2c85a)
(cherry picked from commit 707c816)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

actual row size changed after compaction
5 participants