Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] Prevent long deduplication cursor backlog so that topic loading wouldn't timeout #22479

Merged
merged 8 commits into from
Apr 14, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Apr 11, 2024

Motivation

Background 1

  • When a topic is loading up, it will wait for the deduplication recovery to be finished if deduplication is enabled.
    • users can enable deduplication on the broker level, namespace level or topic level.
  • There is a scheduled task to acknowledge messages for the deduplication cursor.

Issue 1
(Highlight) the scheduled task that acknowledges messages for the deduplication cursor can only be created when the broker level deduplication is enabled, leading to an infinite backlog of deduplication cursor; the next deduplication recovery after a topic unloading will take a long time, which leads to topic load timeout.


Background 2

  • MessageDeduplication maintains a counter that indicates how many messages have not been taken snapshot yet, it increases after per add entry. Broker triggers a snapshot building once the counter is greater than the config brokerDeduplicationEntriesInterval(default value is 1000).

Issue 2, thanks for @mattisonchao 's reminding ❤️

  • send 999 messages, the counter increases to 999, does not reach brokerDeduplicationEntriesInterval(1000) now.
  • unload the topic, the counter will be reset to 0,
  • send 999 messages, the counter increases to 999, does not reach brokerDeduplicationEntriesInterval(1000) now.
  • ...
  • So the snapshot will never be built, resulting the same issue as the scenario 1.

A topic internal stats
topic-internal-stats.txt

Modifications

  • Always start the scheduled task.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/2.10.7 release/2.11.5 release/3.2.3 release/3.0.5 labels Apr 11, 2024
@poorbarcode poorbarcode added this to the 3.3.0 milestone Apr 11, 2024
@poorbarcode poorbarcode self-assigned this Apr 11, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 11, 2024
@lhotari
Copy link
Member

lhotari commented Apr 11, 2024

The title "Deduplication cursor has infinite backlog causing topic load timeout" is misleading since the backlog isn't infinite.
This fix doesn't really address the case when there's a long backlog at the time of topic loading. It's trying to prevent the long backlog from building up. Perhaps "Prevent long deduplication cursor backlog so that topic loading wouldn't timeout" would be a better title?

@poorbarcode poorbarcode changed the title [fix] [broker] Deduplication cursor has infinite backlog causing topic load timeout [fix] [broker] Prevent long deduplication cursor backlog so that topic loading wouldn't timeout Apr 11, 2024
@poorbarcode
Copy link
Contributor Author

The title "Deduplication cursor has infinite backlog causing topic load timeout" is misleading since the backlog isn't infinite.
This fix doesn't really address the case when there's a long backlog at the time of topic loading. It's trying to prevent the long backlog from building up. Perhaps "Prevent long deduplication cursor backlog so that topic loading wouldn't timeout" would be a better title?

Done

@kristgpt
Copy link

👍

@poorbarcode poorbarcode merged commit 837f8bc into apache:master Apr 14, 2024
50 of 52 checks passed
poorbarcode added a commit that referenced this pull request Apr 15, 2024
…c loading wouldn't timeout (#22479)

(cherry picked from commit 837f8bc)
poorbarcode added a commit that referenced this pull request Apr 15, 2024
…c loading wouldn't timeout (#22479)

(cherry picked from commit 837f8bc)
poorbarcode added a commit that referenced this pull request Apr 15, 2024
…c loading wouldn't timeout (#22479)

(cherry picked from commit 837f8bc)
poorbarcode added a commit that referenced this pull request Apr 15, 2024
…c loading wouldn't timeout (#22479)

(cherry picked from commit 837f8bc)
poorbarcode added a commit that referenced this pull request Apr 15, 2024
…hat topic loading wouldn't timeout (#22479)"

This reverts commit 7cec82f.
poorbarcode added a commit that referenced this pull request Apr 15, 2024
…c loading wouldn't timeout (#22479)

(cherry picked from commit 837f8bc)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 16, 2024
…c loading wouldn't timeout (apache#22479)

(cherry picked from commit 837f8bc)
(cherry picked from commit 0fbcbb2)
@poorbarcode
Copy link
Contributor Author

This may leave BrokerDeduplicationEnabled not working, and we also have admin-API for getting/setting this value

Do not worry, it only affects the scheduled task.

mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…c loading wouldn't timeout (apache#22479)

(cherry picked from commit 837f8bc)
(cherry picked from commit 0fbcbb2)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…c loading wouldn't timeout (apache#22479)

(cherry picked from commit 837f8bc)
(cherry picked from commit 0fbcbb2)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 19, 2024
…c loading wouldn't timeout (apache#22479)

(cherry picked from commit 837f8bc)
(cherry picked from commit 0fbcbb2)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 23, 2024
…c loading wouldn't timeout (apache#22479)

(cherry picked from commit 837f8bc)
(cherry picked from commit 0fbcbb2)
nodece pushed a commit to ascentstream/pulsar that referenced this pull request May 13, 2024
…c loading wouldn't timeout (apache#22479)

(cherry picked from commit 837f8bc)
@lhotari
Copy link
Member

lhotari commented May 14, 2024

@poorbarcode DeduplicationDisabledBrokerLevelTest.testSnapshotCounterAfterUnload fails in branch-3.2 and blocks 3.2.3 release. do you have a chance to fix the problem in branch-3.2 ?

@poorbarcode
Copy link
Contributor Author

@poorbarcode DeduplicationDisabledBrokerLevelTest.testSnapshotCounterAfterUnload fails in branch-3.2 and blocks 3.2.3 release. do you have a chance to fix the problem in branch-3.2 ?

@lhotari found the issue. PR 22034 exists in branch-3.2 so the test is wrong. resolved with commit c2532b9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants