Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Optimize high CPU usage when consuming from topics with ongoing txn #23189

Merged
merged 2 commits into from
Aug 20, 2024

Conversation

coderzc
Copy link
Member

@coderzc coderzc commented Aug 17, 2024

Motivation

We found the CPU of the broker busy with calling ManagedLedgerImpl.internalReadFromLedger and broker checked the readPosition > maxPosition and then triggered the readMoreEntries call again leading to the broker looping to call readMoreEntries. But in ManagedCursorImpl.asyncReadEntriesWithSkipOrWait we have checked no more data to read via hasMoreEntries(). I think this case may be caused by maxReadPosition < lastConfirmedPosition when the topic exists ongoing Txn. So I think maxPosition <= readPosition we should not read entries immediately, instead, we delay calling read entries.

image

Modifications

If maxPosition < readPosition then delayed trigger readEntries.

Test Code:

    @Test
    public void testSlowTxn() throws Exception {
        String topic = NAMESPACE1 + "/testSlowTxn";
        @Cleanup
        ProducerImpl<byte[]> producer = (ProducerImpl<byte[]>) pulsarClient.newProducer()
                .topic(topic)
                .sendTimeout(1, TimeUnit.SECONDS)
                .create();

        @Cleanup
        Consumer<byte[]> consumer = pulsarClient.newConsumer()
                .topic(topic)
                .subscriptionName("test")
                .subscriptionType(SubscriptionType.Shared)
                .subscribe();

        Transaction transaction = pulsarClient.newTransaction().withTransactionTimeout(10, TimeUnit.MINUTES)
                .build().get();

        producer.newMessage(transaction).value("Hello Pulsar!".getBytes()).send();

        Thread.sleep(10 * 60 * 1000);

        transaction.commit().get();
        producer.close();
        admin.topics().delete(topic, true);
    }

CPU usage before applying this change:
image

flamegraph: https://drive.google.com/file/d/1nNb4MOdbZB7mO4fWitts2UpjzutdT22O/view?usp=sharing

CPU usage after applying this change:
image

flamegraph: https://drive.google.com/file/d/1AndMJuSMXhOImf3T0hg_E7YeCyslTaNI/view?usp=sharing

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@coderzc coderzc marked this pull request as draft August 17, 2024 11:25
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 17, 2024
@coderzc coderzc marked this pull request as ready for review August 19, 2024 02:26
@coderzc coderzc requested review from lhotari and shibd August 19, 2024 02:38
@thetumbled
Copy link
Member

Same problem as #22944?

@coderzc
Copy link
Member Author

coderzc commented Aug 19, 2024

Same problem as #22944?

Looks like yes, I will review #22944

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Good catch @coderzc!

@coderzc
Copy link
Member Author

coderzc commented Aug 20, 2024

This is a quick fix, it is only effective for managedLedgerNewEntriesCheckDelayInMillis > 0 we can merge pr first and we can continue to review #22944

@coderzc coderzc merged commit 94e1341 into apache:master Aug 20, 2024
54 of 57 checks passed
@coderzc coderzc added type/bug The PR fixed a bug or issue reported a bug area/broker release/3.0.7 release/3.3.2 labels Aug 20, 2024
coderzc added a commit that referenced this pull request Aug 21, 2024
coderzc added a commit that referenced this pull request Aug 21, 2024
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 22, 2024
…with ongoing txn (apache#23189)

(cherry picked from commit 94e1341)
(cherry picked from commit b7ffa73)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 23, 2024
…with ongoing txn (apache#23189)

(cherry picked from commit 94e1341)
(cherry picked from commit b7ffa73)
grssam pushed a commit to grssam/pulsar that referenced this pull request Sep 4, 2024
@lhotari lhotari added this to the 4.0.0 milestone Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants