Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix unexpected subscription deletion caused by the cursor last active time not updated in time #17573

Merged
merged 2 commits into from
Sep 16, 2022

Conversation

dragonls
Copy link
Contributor

@dragonls dragonls commented Sep 9, 2022

Fixes #17572

Motivation

The lastActive in ManagedCursorImpl only be updated in 3 places:

  • org.apache.pulsar.broker.service.persistent.PersistentSubscription#addConsumer
  • org.apache.pulsar.broker.service.persistent.PersistentSubscription#removeConsumer
  • org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#internalResetCursor

If there are brokers shutdown with OOM or other situation, which make the broker not shutdown gracefully, the new broker load the topics and then do subscription expiry check soon before consumer reconnected, then the subscription may be deleted unexpectedly.

Need to update lastActive in ManagedCursorImpl while consuming stably, such as consumer acknowledged messages, which makes lastActive closer to the last active meaning.

Note

This PR mainly fix the bug that subscription may be deleted unexpectedly caused by lastActive not being updated in time, which depends on the lastActive persistence in zk.

lastActive is saved in zk in org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#persistPositionMetaStore:

private void persistPositionMetaStore(long cursorsLedgerId, PositionImpl position, Map<String, Long> properties,
MetaStoreCallback<Void> callback, boolean persistIndividualDeletedMessageRanges) {
if (state == State.Closed) {
ledger.getExecutor().execute(safeRun(() -> callback.operationFailed(new MetaStoreException(
new CursorAlreadyClosedException(name + " cursor already closed")))));
return;
}
// When closing we store the last mark-delete position in the z-node itself, so we won't need the cursor ledger,
// hence we write it as -1. The cursor ledger is deleted once the z-node write is confirmed.
ManagedCursorInfo.Builder info = ManagedCursorInfo.newBuilder() //
.setCursorsLedgerId(cursorsLedgerId) //
.setMarkDeleteLedgerId(position.getLedgerId()) //
.setMarkDeleteEntryId(position.getEntryId()) //
.setLastActive(lastActive); //

However, the persistence is triggered only in 4 places:
image

All of them are not high frequency triggered. The new broker may recover the cursor with lastActive from zk which is not the real last active time. The unexpectedly subscription deletion still can appear during two persistence having broker shutdown not gracefully.

The better solution is to save the lastActive into cursor ledger instead of only save to zk, but this is not the problem this PR solving, and need further discussion.

Modifications

  • Add cursor.updateLastActive() in org.apache.pulsar.broker.service.persistent.PersistentSubscription#acknowledgeMessage.
  • Add some logs about cursor last active timestamp.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • org.apache.pulsar.broker.service.persistent.PersistentTopicTest#testUpdateCursorLastActive
  • org.apache.pulsar.broker.service.persistent.PersistentSubscriptionTest#testAcknowledgeUpdateCursorLastActive

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 9, 2022
Copy link
Contributor

@liangyepianzhou liangyepianzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this issue does not relate to the broker restart, right?
It is that the subscriptionExpirationTimeMinutes should be the last ack time of the consumers in the subscription.
Instead of the last time that the consumer was modified (consumer add or remove) in the subscription.

@dragonls
Copy link
Contributor Author

dragonls commented Sep 13, 2022

If I understand correctly, this issue does not relate to the broker restart, right? It is that the subscriptionExpirationTimeMinutes should be the last ack time of the consumers in the subscription. Instead of the last time that the consumer was modified (consumer add or remove) in the subscription.

If the broker shutdown gracefully, everything is fine. Because it will go through the RemoveConsumer, the lastActive will be updated.

As for the check of subscriptionExpirationTimeMinutes, see

if (sub.dispatcher != null && sub.dispatcher.isConsumerConnected() || sub.isReplicated()) {
return;
}
if (System.currentTimeMillis() - sub.cursor.getLastActive() > expirationTimeMillis) {
sub.delete().thenAccept(v -> log.info("[{}][{}] The subscription was deleted due to expiration",
topic, subName));
}
});

If the consumer is connected, it will skip the check.

@dragonls dragonls requested review from HQebupt and liangyepianzhou and removed request for HQebupt and liangyepianzhou September 15, 2022 03:26
@mattisonchao mattisonchao added this to the 2.12.0 milestone Sep 15, 2022
@mattisonchao mattisonchao added area/broker type/bug The PR fixed a bug or issue reported a bug labels Sep 15, 2022
@mattisonchao mattisonchao requested review from codelipenghui, eolivelli and nicoloboschi and removed request for HQebupt September 15, 2022 11:32
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are brokers shutdown with OOM or other situation, which make the broker not shutdown gracefully, the new broker load the topics and then do subscription expiry check soon before consumer reconnected, then the subscription may be deleted unexpectedly.

If the topic is loaded by another broker, the lastActive will be reset to clock.millis()

How the above situation can happen?

@dragonls
Copy link
Contributor Author

If there are brokers shutdown with OOM or other situation, which make the broker not shutdown gracefully, the new broker load the topics and then do subscription expiry check soon before consumer reconnected, then the subscription may be deleted unexpectedly.

If the topic is loaded by another broker, the lastActive will be reset to clock.millis()

How the above situation can happen?

lastActive is set to clock.millis() while initilized, but will be covered by recover() soon.

lastActive = info.getLastActive() != 0 ? info.getLastActive() : lastActive;

@codelipenghui
Copy link
Contributor

lastActive is set to clock.millis() while initilized, but will be covered by recover() soon.

I see.

Actually, only ledger rollover and updating the cursor Ledger failed will cause the lastActive update to zookeeper right?
It also has a chance to delete the subscription unexpectedly if no ledger rollover and cursor ledger update failure happens.

@codelipenghui
Copy link
Contributor

If we can persistent the lastActive to cursor Ledger, we will get better for this situation.

@dragonls
Copy link
Contributor Author

If we can persistent the lastActive to cursor Ledger, we will get better for this situation.

I will take a look at this and try to persistent the lastActive to cursor ledger.

@dragonls
Copy link
Contributor Author

dragonls commented Sep 16, 2022

After discussed with @codelipenghui , this PR mainly fix the bug that subscription may be deleted unexpectedly caused by lastActive not being updated in time, which depends on the lastActive persistence in zk.

lastActive is saved in zk in org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#persistPositionMetaStore:

private void persistPositionMetaStore(long cursorsLedgerId, PositionImpl position, Map<String, Long> properties,
MetaStoreCallback<Void> callback, boolean persistIndividualDeletedMessageRanges) {
if (state == State.Closed) {
ledger.getExecutor().execute(safeRun(() -> callback.operationFailed(new MetaStoreException(
new CursorAlreadyClosedException(name + " cursor already closed")))));
return;
}
// When closing we store the last mark-delete position in the z-node itself, so we won't need the cursor ledger,
// hence we write it as -1. The cursor ledger is deleted once the z-node write is confirmed.
ManagedCursorInfo.Builder info = ManagedCursorInfo.newBuilder() //
.setCursorsLedgerId(cursorsLedgerId) //
.setMarkDeleteLedgerId(position.getLedgerId()) //
.setMarkDeleteEntryId(position.getEntryId()) //
.setLastActive(lastActive); //

However, the persistence is triggered only in 4 places:
image

All of them are not high frequency triggered. The new broker may recover the cursor with lastActive from zk which is not the real last active time. The unexpectedly subscription deletion still can appear during two persistence having broker shutdown not gracefully.

The better solution is to save the lastActive into cursor ledger instead of only save to zk, but this is not the problem this PR solving, and need further discussion.

@dragonls dragonls changed the title [fix][broker] Fix unexpected subscription deletion [fix][broker] Fix unexpected subscription deletion caused by the cursor last active time not updated in time Sep 16, 2022
@codelipenghui
Copy link
Contributor

@dragonls I have approved the PR. Please also update your last comment to the PR description. So that we don't to go through all the comment to understand what is the scope of this PR and what is not the scope of this PR

@codelipenghui codelipenghui merged commit 8a9d70a into apache:master Sep 16, 2022
@dragonls dragonls deleted the fix-subscription-last-active branch September 19, 2022 10:05
nodece pushed a commit to ascentstream/pulsar that referenced this pull request Sep 4, 2024
…or last active time not updated in time (apache#17573)

(cherry picked from commit 8a9d70a)
Signed-off-by: Zixuan Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Subscription is deleted unexpectedly
6 participants