Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][metadata] Cleanup state when lock revalidation gets LockBusyException #17700

Merged
merged 17 commits into from
Sep 20, 2022
Merged

[fix][metadata] Cleanup state when lock revalidation gets LockBusyException #17700

merged 17 commits into from
Sep 20, 2022

Conversation

mattisonchao
Copy link
Member

@mattisonchao mattisonchao commented Sep 17, 2022

Motivation

In the production environment, we found two brokers holding the same valid locks. and one has an exceptional revalidate future with lockBusyException. after reading the code, there may forget the reset the cache and complete expire exception when getting lockBusyException.

Snapshot: broker A

image

Snapshot: broker B

image

Modifications

  • Cleanup state when lock revalidation gets LockBusyException
  • Introduce retryWhenConnectionLost parameter to avoid revalidating unused lock when acquiring fail.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@mattisonchao mattisonchao self-assigned this Sep 17, 2022
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 17, 2022
@mattisonchao mattisonchao changed the title [fix][metadata] Complete expire future when revalidate got LockBusyException [fix][metadata] Cleanup state when lock revalidation gets LockBusyException Sep 17, 2022
Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this right, the root cause this case is that we missed handling revalidate failure in revalidateIfNeededAfterReconnection. Can we just add it there, like lockWasInvalidated did?

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have suggestions, but I think that we must cover this kind of changes with tests

@michaeljmarshall @lhotari you were working on this part last week PTAL

@mattisonchao
Copy link
Member Author

If I understand this right, the root cause this case is that we missed handling revalidate failure in revalidateIfNeededAfterReconnection. Can we just add it there, like lockWasInvalidated did?

I don't want to invoke exception handling everywhere.

@mattisonchao mattisonchao marked this pull request as draft September 19, 2022 01:13
@mattisonchao
Copy link
Member Author

Add testing, convert to draft.

@mattisonchao mattisonchao marked this pull request as ready for review September 19, 2022 02:52
@mattisonchao mattisonchao marked this pull request as draft September 19, 2022 03:01

ResourceLock<String> lock1 = lm1.acquireLock(path1, "value-1").join();
AtomicReference<ResourceLock<String>> lock2 = new AtomicReference<>();
// lock 2 will steal the distributed lock first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need the steal lock operation.
But this is a potential risk. It will introduce too many small ledgers if we encounter this problem

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codelipenghui codelipenghui merged commit 955ae34 into apache:master Sep 20, 2022
@codelipenghui codelipenghui added this to the 2.12.0 milestone Sep 20, 2022
@codelipenghui codelipenghui added release/2.9.4 release/2.11.1 release/2.10.3 type/bug The PR fixed a bug or issue reported a bug labels Sep 20, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Sep 20, 2022
…ception` (apache#17700)

### Motivation

In the production environment,  we found two brokers holding the same valid locks. and one has an exceptional revalidate future with `lockBusyException`. after reading the code, there may forget the reset the cache and complete expire exception when getting lockBusyException.

(cherry picked from commit 955ae34)
codelipenghui pushed a commit that referenced this pull request Sep 20, 2022
…ception` (#17700)

### Motivation

In the production environment,  we found two brokers holding the same valid locks. and one has an exceptional revalidate future with `lockBusyException`. after reading the code, there may forget the reset the cache and complete expire exception when getting lockBusyException.

(cherry picked from commit 955ae34)
codelipenghui pushed a commit that referenced this pull request Sep 20, 2022
…ception` (#17700)

### Motivation

In the production environment,  we found two brokers holding the same valid locks. and one has an exceptional revalidate future with `lockBusyException`. after reading the code, there may forget the reset the cache and complete expire exception when getting lockBusyException.

(cherry picked from commit 955ae34)
codelipenghui pushed a commit that referenced this pull request Sep 20, 2022
…ception` (#17700)

### Motivation

In the production environment,  we found two brokers holding the same valid locks. and one has an exceptional revalidate future with `lockBusyException`. after reading the code, there may forget the reset the cache and complete expire exception when getting lockBusyException.

(cherry picked from commit 955ae34)
@codelipenghui codelipenghui modified the milestones: 2.12.0, 2.11.0 Sep 20, 2022
aloyszhang pushed a commit to aloyszhang/pulsar that referenced this pull request Jan 9, 2023
…est !188)

Squash merge branch 'release_2.8.1.4_fix_metadata' into 'release-2.8.1.4'
Fixes #<xyz>

### Motivation

chery pick了三个PR:

PIP-45: Handle session events and invalidations from single thread (apache#12184)
[fix][metadata] Set revalidateAfterReconnection true for certain failures apache#17664
[fix][metadata] Cleanup state when lock revalidation gets LockBusyException apache#17700

TAPD: --story=881235137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants