Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] ManagedLedger: move to FENCED state in case of BadVersionException #17736

Merged
merged 2 commits into from
Sep 22, 2022

Conversation

eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Sep 20, 2022

Motivation

It may happen that a ManagedLedger continues to work even in case of seeing a BadVersionException.
For instance background activities like trimming ledgers, or offloading will continue to work.
This is very dangerous as it leads to some kind of split brain (it is not actually a split brain): two brokers continue to process the metadata (and data) of the same ledger.

Modifications

  • Force moving to Fenced state the ManagedLedger in case of BadVersionException
  • Fail the "offload loop" in case of Fenced state
  • Better cleanup of the PersistentTopic

Verifying this change

I added tests that cover the changes

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

Matching PR in forked repository

PR in forked repository: eolivelli#16

After opening this PR, the build in apache/pulsar will fail and instructions will
be provided for opening a PR in the PR author's forked repository.

apache/pulsar pull requests should be first tested in your own fork since the
apache/pulsar CI based on GitHub Actions has constrained resources and quota.
GitHub Actions provides separate quota for pull requests that are executed in
a forked repository.

The tests will be run in the forked repository until all PR review comments have
been handled, the tests pass and the PR is approved by a reviewer.

-->

@eolivelli
Copy link
Contributor Author

I am adding test cases in order to have examples of the problems and to prevent regressions in the future

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 20, 2022
@eolivelli eolivelli added type/bug The PR fixed a bug or issue reported a bug area/tieredstorage area/broker labels Sep 20, 2022
@eolivelli eolivelli self-assigned this Sep 20, 2022
@eolivelli eolivelli changed the title [bugfix] ManagedLedger: move to FENCED state in case of BadVersionException (DRAFT) [bugfix] ManagedLedger: move to FENCED state in case of BadVersionException Sep 20, 2022
Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I didn't realize that we can leverage the fence state to let the topic reload :)

It is a great way to resolve the bad version

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work

@dlg99
Copy link
Contributor

dlg99 commented Sep 20, 2022

LGTM but it broke testTruncateCorruptDataLedger

  Error:  Tests run: 9, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 45.675 s <<< FAILURE! - in org.apache.pulsar.broker.service.BrokerBkEnsemblesTests
  Error:  testTruncateCorruptDataLedger(org.apache.pulsar.broker.service.BrokerBkEnsemblesTests)  Time elapsed: 3.13 s  <<< FAILURE!
  org.apache.pulsar.client.admin.PulsarAdminException$ServerSideErrorException: 
  
   --- An unexpected error occurred in the server ---
  
  Message: Can't trim fenced ledger
  
  Stacktrace:
  
  org.apache.bookkeeper.mledger.ManagedLedgerException$ManagedLedgerAlreadyClosedException: Can't trim fenced ledger

@eolivelli eolivelli force-pushed the fix/badversion-fence-ml branch from a2c39b8 to 136839c Compare September 20, 2022 19:30
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @eolivelli !

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eolivelli
Copy link
Contributor Author

@dlg99 I have fixed the test.
The test was trying do something that cannot happen.
it was calling "ledgerManager.delete()" and then "truncate()".

I removed "delete", the test is still meaningful, because it tests that the truncation works in case of missing ledgers (due to concurrent execution or previous deletion)

@codelipenghui
Copy link
Contributor

We already handled the BadVersion in this way before https://github.com/apache/pulsar/pull/17736/files#diff-f6a849bd8fdb782ef6c17a2e07a2c54c3dc7d1655c00ec3546d5f3b3fc61e970L1537

Looks good to me. Approved the PR

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good solution for handling the ZK BadVersion exception.

@@ -3639,6 +3664,7 @@ private void checkManagedLedgerIsOpen() throws ManagedLedgerException {
}

synchronized void setFenced() {
log.info("{} Moving to Fenced state", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the log level to warn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "problem".
it may happen, and we are handling it safely.
there is nothing that the sysadmin should be afraid of.

we should log "WARN" or "ERROR" when there is something bad, and you have to take extra care

@@ -3842,12 +3868,21 @@ private void scheduleTimeoutTask() {
? Math.max(config.getAddEntryTimeoutSeconds(), config.getReadEntryTimeoutSeconds())
: timeoutSec;
this.timeoutTask = this.scheduledExecutor.scheduleAtFixedRate(safeRun(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest change

safeRun(() -> {
                checkTimeouts();
            })

to safeRun(this::checkTimeouts)

final State state = STATE_UPDATER.get(this);
if (state == State.Closed
|| state == State.Fenced) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a warn log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are already logs that say that we fenced or closed the topic.
I don't think it is worth to add something more

log.debug("[{}] Ignoring trimming request since the managed ledger was already closed", name);
trimmerMutex.unlock();
promise.completeExceptionally(new ManagedLedgerAlreadyClosedException("Can't trim closed ledger"));
return;
}
if (currentState == State.Fenced) {
log.debug("[{}] Ignoring trimming request since the managed ledger was already fenced", name);
trimmerMutex.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to unlock the trimmerMutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure.
I prefer to keep the ML in a clean status.
I did the same it works for a "closed" ML.
it is very like to being "Closed" in this point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, after checking the code, I think we should release the lock, or else the scheduled task will always try to get the lock per 100 milliseconds.

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eolivelli eolivelli merged commit 63d4cf2 into apache:master Sep 22, 2022
Jason918 pushed a commit that referenced this pull request Sep 26, 2022
dlg99 added a commit to datastax/pulsar that referenced this pull request Oct 31, 2022
…tion (apache#17915) (#152)

(cherry picked from commit 0854032)

Fixes apache#9962 

### Motivation

Offloaded ledgers can be orphaned on topic deletion. 

This is a redo of apache#15914 which conflicted with concurrently merged apache#17736 thus resulting in apache#17889 .

apache#17736 made a decision to not allow managed ledger trimming for the fenced mledgers because in many case fencing indicates a problems that should stop all operations on mledger. At the same time fencing is used before deletion starts, so trimming added to the deletion process cannot proceed.
After discussion with @eolivelli I introduced new state, FencedForDeletion, which acts as Fenced state except for the trimming/deletion purposes.

### Modifications

Topic to be truncated before deletion to delete offloaded ledgers properly and fail if truncation fails.

### Verifying this change

local fork tests: dlg99#1

- [ ] Make sure that the change passes the CI checks.

This change added integration tests

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the changes*

Nothing changed in the options but admin CLI will implicitly run truncate before topic delete. 

  - Dependencies (does it add or upgrade a dependency): (yes / no)
  - The public API: (yes / no)
  - The schema: (yes / no / don't know)
  - The default values of configurations: (yes / no)
  - The wire protocol: (yes / no)
  - The rest endpoints: (yes / no)
  - The admin cli options: (yes / no)
  - Anything that affects deployment: (yes / no / don't know)

### Documentation

Check the box below or label this PR directly.

Need to update docs? 

- [ ] `doc-required` 
(Your PR needs to update docs and you will update later)
  
- [x] `doc-not-needed` 
(Please explain why)
  
- [ ] `doc` 
(Your PR contains doc changes)

- [ ] `doc-complete`
(Docs have been already added)
congbobo184 pushed a commit that referenced this pull request Nov 14, 2022
@congbobo184 congbobo184 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 14, 2022
congbobo184 pushed a commit that referenced this pull request Nov 26, 2022
congbobo184 pushed a commit that referenced this pull request Nov 26, 2022
congbobo184 added a commit that referenced this pull request Nov 27, 2022
…est license header (#18640)

1. cherry-pick: #17736 problem fix

2. fix cherry-pick 17526 bugs
https://github.com/poorbarcode/pulsar/blob/4db04f509451dc1f17b96009b2f5e268c1ea644b/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L2015
it should change to !createTopicFuture.isPresent() not createTopicFuture.isPresent()


3. and fix import
4. fix cherry-pick https://github.com/apache/pulsar/pull18283 testDeleteTopicAndSchemaForV1, throw exception messages error
5. run branch-2.9 test

### Motivation
fix branch-2.9 test

### Documentation

- [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->

### Matching PR in forked repository
lhotari added a commit to lhotari/pulsar that referenced this pull request Jun 7, 2023
… race conditions

- solution was introduced in apache#17526
- however, it was accidentially replaced with a call to the incorrect
  method signature in apache#17736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.