-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] ManagedLedger: move to FENCED state in case of BadVersionException #17736
Conversation
I am adding test cases in order to have examples of the problems and to prevent regressions in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I didn't realize that we can leverage the fence state to let the topic reload :)
It is a great way to resolve the bad version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work
LGTM but it broke testTruncateCorruptDataLedger
|
a2c39b8
to
136839c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @eolivelli !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dlg99 I have fixed the test. I removed "delete", the test is still meaningful, because it tests that the truncation works in case of missing ledgers (due to concurrent execution or previous deletion) |
We already handled the BadVersion in this way before https://github.com/apache/pulsar/pull/17736/files#diff-f6a849bd8fdb782ef6c17a2e07a2c54c3dc7d1655c00ec3546d5f3b3fc61e970L1537 Looks good to me. Approved the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good solution for handling the ZK BadVersion exception.
@@ -3639,6 +3664,7 @@ private void checkManagedLedgerIsOpen() throws ManagedLedgerException { | |||
} | |||
|
|||
synchronized void setFenced() { | |||
log.info("{} Moving to Fenced state", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change the log level to warn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a "problem".
it may happen, and we are handling it safely.
there is nothing that the sysadmin should be afraid of.
we should log "WARN" or "ERROR" when there is something bad, and you have to take extra care
@@ -3842,12 +3868,21 @@ private void scheduleTimeoutTask() { | |||
? Math.max(config.getAddEntryTimeoutSeconds(), config.getReadEntryTimeoutSeconds()) | |||
: timeoutSec; | |||
this.timeoutTask = this.scheduledExecutor.scheduleAtFixedRate(safeRun(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest change
safeRun(() -> {
checkTimeouts();
})
to safeRun(this::checkTimeouts)
final State state = STATE_UPDATER.get(this); | ||
if (state == State.Closed | ||
|| state == State.Fenced) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a warn
log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are already logs that say that we fenced or closed the topic.
I don't think it is worth to add something more
log.debug("[{}] Ignoring trimming request since the managed ledger was already closed", name); | ||
trimmerMutex.unlock(); | ||
promise.completeExceptionally(new ManagedLedgerAlreadyClosedException("Can't trim closed ledger")); | ||
return; | ||
} | ||
if (currentState == State.Fenced) { | ||
log.debug("[{}] Ignoring trimming request since the managed ledger was already fenced", name); | ||
trimmerMutex.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to unlock the trimmerMutex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure.
I prefer to keep the ML in a clean status.
I did the same it works for a "closed" ML.
it is very like to being "Closed" in this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, after checking the code, I think we should release the lock, or else the scheduled task will always try to get the lock per 100 milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…tion (apache#17915) (#152) (cherry picked from commit 0854032) Fixes apache#9962 ### Motivation Offloaded ledgers can be orphaned on topic deletion. This is a redo of apache#15914 which conflicted with concurrently merged apache#17736 thus resulting in apache#17889 . apache#17736 made a decision to not allow managed ledger trimming for the fenced mledgers because in many case fencing indicates a problems that should stop all operations on mledger. At the same time fencing is used before deletion starts, so trimming added to the deletion process cannot proceed. After discussion with @eolivelli I introduced new state, FencedForDeletion, which acts as Fenced state except for the trimming/deletion purposes. ### Modifications Topic to be truncated before deletion to delete offloaded ledgers properly and fail if truncation fails. ### Verifying this change local fork tests: dlg99#1 - [ ] Make sure that the change passes the CI checks. This change added integration tests ### Does this pull request potentially affect one of the following parts: *If `yes` was chosen, please highlight the changes* Nothing changed in the options but admin CLI will implicitly run truncate before topic delete. - Dependencies (does it add or upgrade a dependency): (yes / no) - The public API: (yes / no) - The schema: (yes / no / don't know) - The default values of configurations: (yes / no) - The wire protocol: (yes / no) - The rest endpoints: (yes / no) - The admin cli options: (yes / no) - Anything that affects deployment: (yes / no / don't know) ### Documentation Check the box below or label this PR directly. Need to update docs? - [ ] `doc-required` (Your PR needs to update docs and you will update later) - [x] `doc-not-needed` (Please explain why) - [ ] `doc` (Your PR contains doc changes) - [ ] `doc-complete` (Docs have been already added)
…est license header (#18640) 1. cherry-pick: #17736 problem fix 2. fix cherry-pick 17526 bugs https://github.com/poorbarcode/pulsar/blob/4db04f509451dc1f17b96009b2f5e268c1ea644b/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L2015 it should change to !createTopicFuture.isPresent() not createTopicFuture.isPresent() 3. and fix import 4. fix cherry-pick https://github.com/apache/pulsar/pull18283 testDeleteTopicAndSchemaForV1, throw exception messages error 5. run branch-2.9 test ### Motivation fix branch-2.9 test ### Documentation - [x] `doc-not-needed` <!-- Your PR changes do not impact docs --> ### Matching PR in forked repository
… race conditions - solution was introduced in apache#17526 - however, it was accidentially replaced with a call to the incorrect method signature in apache#17736
Motivation
It may happen that a ManagedLedger continues to work even in case of seeing a BadVersionException.
For instance background activities like trimming ledgers, or offloading will continue to work.
This is very dangerous as it leads to some kind of split brain (it is not actually a split brain): two brokers continue to process the metadata (and data) of the same ledger.
Modifications
Verifying this change
I added tests that cover the changes
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc-required
(Your PR needs to update docs and you will update later)
doc-not-needed
(Please explain why)
doc
(Your PR contains doc changes)
doc-complete
(Docs have been already added)
Matching PR in forked repository
PR in forked repository: eolivelli#16
After opening this PR, the build in apache/pulsar will fail and instructions will
be provided for opening a PR in the PR author's forked repository.
apache/pulsar pull requests should be first tested in your own fork since the
apache/pulsar CI based on GitHub Actions has constrained resources and quota.
GitHub Actions provides separate quota for pull requests that are executed in
a forked repository.
The tests will be run in the forked repository until all PR review comments have
been handled, the tests pass and the PR is approved by a reviewer.
-->