Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvement][client-java] Avoid too large memory preallocation for batch message. #15033

Merged
merged 10 commits into from
Sep 19, 2022

Conversation

tjiuming
Copy link
Contributor

@tjiuming tjiuming commented Apr 6, 2022

Fixes (#14943)

Motivation

Allocate small memory for batch message first, and let it grow(with no memory resizes and memory copies).

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 6, 2022
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjiuming
Copy link
Contributor Author

@codelipenghui seems the PR is able to merge

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jun 26, 2022
@tjiuming
Copy link
Contributor Author

tjiuming commented Jul 2, 2022

@codelipenghui Please help merge the PR

@github-actions github-actions bot removed the Stale label Jul 3, 2022
@Jason918
Copy link
Contributor

Jason918 commented Jul 4, 2022

@codelipenghui Please help merge the PR

Please fix the CI failure @tjiuming

@codelipenghui codelipenghui modified the milestones: 2.11.0, 2.12.0 Jul 26, 2022
@github-actions github-actions bot added doc-label-missing and removed doc-not-needed Your PR changes do not impact docs labels Sep 17, 2022
@github-actions
Copy link

@tjiuming Please provide a correct documentation label for your PR.
Instructions see Pulsar Documentation Label Guide.

@Jason918 Jason918 added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Sep 17, 2022
@github-actions github-actions bot added doc-label-missing and removed doc-not-needed Your PR changes do not impact docs labels Sep 17, 2022
@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 18, 2022

This patch broken org.apache.pulsar.client.impl.BatchMessageContainerImplTest.recoveryAfterOom test. Please help fix it , thanks @tjiuming

@tjiuming
Copy link
Contributor Author

This patch broken org.apache.pulsar.client.impl.BatchMessageContainerImplTest.recoveryAfterOom test. Please help fix it , thanks @tjiuming

@AnonHxy Yes, I've fixed, PTAL

@AnonHxy AnonHxy closed this Sep 19, 2022
@AnonHxy AnonHxy reopened this Sep 19, 2022
@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 19, 2022

This patch broken org.apache.pulsar.client.impl.BatchMessageContainerImplTest.recoveryAfterOom test. Please help fix it , thanks @tjiuming

@AnonHxy Yes, I've fixed, PTAL

Also have some checkstyle failure. Please make sure that CI could be pass in your own fork repo first~ @tjiuming

@AnonHxy AnonHxy merged commit 996666a into apache:master Sep 19, 2022
@tjiuming tjiuming deleted the dev/batch_msg_allocator branch September 19, 2022 07:51
@merlimat
Copy link
Contributor

@tjiuming @Jason918 @AnonHxy

Allocate small memory for batch message first, and let it grow(with no memory resizes and memory copies).

How can you say there are no memory resizes or copies?

I'm still not convinced that this change is not bringing a significant performance regression.

@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 20, 2022

Hi @merlimat , compositeBuffer will not lead to memory copies until the componentCount reached the maxNumComponents, which default value is 16.

If we want to avoid any memory copies, I think we could set the valule of maxNumComponents equals to batchingMaxMessages, which defalut is 1000. WDYT

@Jason918
Copy link
Contributor

@AnonHxy Can we do a quick test to validate this? Focus on the case with large number of messages in one batch.

@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 20, 2022

@AnonHxy Can we do a quick test to validate this? Focus on the case with large number of messages in one batch.

Sure

@merlimat
Copy link
Contributor

@AnonHxy If we are writing bytes into the composite buffer, it either needs to have space into its own last buffer or to allocate a buffer and copy data into it.

@tjiuming
Copy link
Contributor Author

tjiuming commented Sep 20, 2022

@merlimat @AnonHxy @Jason918
For CompositeByteBuf, if the number of components in the buffer not exceeds maxComponent, it wound not resize/mem_copy. But if the capital is not enough and need a new component, all the components will merge(copy) into a new component.
Here I use the default value of maxComponent, which is 16. Because I was not sure what exact value to set.
But on the whole, the buffer will be allocated lazily, which means fewer wasted memory. To some extent, it achieves the purpose of saving memory.
To reduce mem_copy/resize, maybe we can calculate the certain value of maxComponent with maxBatchSize, is it required?

@merlimat
Copy link
Contributor

Here I use the default value of maxComponent, which is 16. Because I was not sure what exact value to set.
But on the whole, the buffer will be allocated lazily, which means fewer wasted memory. To some extent, it achieves the purpose of saving memory.

Allocating many small buffers will have a significant large impact compared to allocating 1 single large buffer:

  1. Netty allocator synchronization
  2. Scattered write on the socket from many small buffers

@tjiuming
Copy link
Contributor Author

tjiuming commented Sep 20, 2022

@merlimat
Yes, there is allocator synchronization, but in ProducerImpl#sendAsync(Message<?> message, SendCallback callback):

         try {
         synchronized (this) {
           ......
         }
     } catch (PulsarClientException e) {
          ......
     } catch (Throwable t) {
         ......
     }

which means for a Producer, only 1 thread could send message at the same time, I think even there is allocator synchronization, it wouldn't have a significant large impact.

Scattered write on the socket from many small buffers

For this part, I don't think it will make a big difference. Yes, read bytes from CompositeByteBuf is truly slower than 1 single large buffer, because it need to calculate the readerIndex, but when write it to channel, CompositeByteBuf will transform to ByteBuffer[], it would not too much slower than ByteBuffer.

I think the only problem is when the batch is too large, it will lead to mem_copy/resize.

@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 25, 2022

I have tested this feature using standalone cluster. It seems that this PR may bring a performance regression. The reason is that:
When using ByteBufAllocator#buffer(int), the Bytebuf writing took only a little cpu time:
image

And when we using ByteBufAllocator#compositeBuffer, the Bytebuf wrte took more cpu time:
image

I think another solution is that we can introduce a way to shrink the maxBatchSize. For example, if current batch size is less that 3/4 * maxBatchSize we will let maxBatchSize = 3/4 * maxBatchSize

@merlimat @tjiuming @Jason918 @codelipenghui

@Jason918
Copy link
Contributor

@AnonHxy Is there a significant throughput difference between these two?

@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 26, 2022

Is there a significant throughput difference between these two?

OK. @Jason918

  1. First I setup up a standalone cluster using apache-pulsar-2.10.1
  2. Then started a producer like below to publish messages, starting with command ~/houxiaoyu/jdk-17.0.4/bin/java -Xms2000m -Xmx2000m -XX:MaxDirectMemorySize=4000m -jar pulsar-buffer.jar:
  public static void main(String[] args) throws Exception {
        String topic = "hxy-test-mem";
        byte[] smallMsg = new byte[64];
        PulsarClient client = PulsarClient.builder()
                .serviceUrl("pulsar://127.0.0.1:6650")
                .memoryLimit(2000, SizeUnit.MEGA_BYTES)  // memory limit 2000M
                .build();
        Producer<byte[]> producer = client.newProducer()
                .topic(topic)
                .enableBatching(true)
                .create();
        for (int i = 0; i < Integer.MAX_VALUE; ++i) {
            producer.sendAsync(smallMsg);
        }
    }
  1. I have compared the publish throughput three times, and all of them came to the same conclusion that throughput will be descended when using compositeBuffer. Here is the result:
  • buffer(msg/s): means before applying this patch
  • compositeBuffer(msg/s): means after applying this patch
buffer (msg/s) compositeBuffer (msg/s)
first test 1,034,217.59 930,746.97
1,024,092.14 939,542.21
1,024,092.14 940,831.47
second test 977,871.08 907,644.68
972,081.04 903,371.45
1,000,225.66 903,853.17
third test 1,031,195.02 890,254.75
1,036,956.93 890,347.82
1,024,760.17 889,036.17
avg 1,013,943.52 910,625.4

@AnonHxy
Copy link
Contributor

AnonHxy commented Sep 26, 2022

And here is the attachment of Flame Graph for the first time test
buffer.html.zip
compositeBuffer.html.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-label-missing type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants