[improve][pip] PIP-393: Improve performance of Negative Acknowledgement #23601

thetumbled · 2024-11-15T07:53:00Z

PIP: 393
Implementation PR: #23600.

Motivation

There are many issues with the current implementation of Negative Acknowledgement in Pulsar:

the memory occupation is high.
the code execution efficiency is low.
the redelivery time is not accurate.
multiple negative ack for messages in the same entry(batch) will interfere with each other.
All of these problem is severe and need to be solved.

Modifications

Refactor the NegativeAcksTracker to solve the above problems.

Space complexity of new data structure

I will show you how great the new data structure it is with theorectical space complexity analysis.

Space complexity of `ConcurrentLongLongPairHashMap`

Before analyzing the new data structure, we need to know how much space it take before this pip. We need to store 4 long field for (ledgerId, entryId, partitionIndex, timestamp) for each entry, which takes 4*8=32byte.
As ConcurrentLongLongPairHashMap use open hash addressing and linear probe to handle hash confliction, there are rebundunt spaces to avoid high confliction rate. There are two configurations that control how much rebundunt space to reserver: fill factor and idle factor. When the space utility rate soar high to fill factor, the size of backing array will be double, when the space utility rate reduce to idle factor, the size of backing array will reduce by half.
The default value of fill factor is 0.66, idle factor is 0.15, which means the min space occupation of ConcurrentLongLongPairHashMap is 32/0.66N byte = 48N byte, the max space occupation is 32/0.15N byte=213N byte, where N is the number of entries.

List some test data to verify this:

There are 100w entries in the map, which take up 32*1000000/1024/1024byte=30MB, the space utility rate is 30/64=0.46, in the range of [0.15, 0.66].

Space complexity of new data structure

New data structure:

// timestamp -> ledgerId -> entryId
Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> map2 = new Long2ObjectAVLTreeMap<>();

The space used by new data structure is related to several factors: message rate, the time deviation user accepted, the max entries written in one ledger.

Pulsar conf managedLedgerMaxEntriesPerLedger=50000 determine the max entries can be wriitten into one ledger, we use the default value to analyze.
the time deviation user accepted: when user accept 1024ms delivery time deviation, we can trim the lower 10 bit of the timestamp in ms, which can bucket 1024 timestamp.

We will analyze the space used by one bucket, and calculate the average space used by one entry.
Assuming that the message rate is x msg/ms, and we trim y bit of the timestamp, one bucket will contains 2**x ms, M=2**x*y msgs in one bucket.

For one single bucket, we only need to store one timestamp, which takes 8byte.
Then, we need to store the ledgerId, when M is greater than 5w(managedLedgerMaxEntriesPerLedger), the ledger will switch. There are L=ceil(M/50000) ledgers, which take 8*L byte.
Further, we analyze how much space the entry id takes. As there are L=ceil(M/50000) ledgers, there will be L bitmap to store, which take L*size(bitmap). The total space consumed by new data structure is 8byte + 8L byte + L*size(bitmap).

As the size(bitmap) is far more greater than 8byte, we can ignore the first two items. Then we get the formular of space consumed one bucket: D=L*size(bitmap)=ceil(M/50000)*size(bitmap).

Entry id is stored in a Roaring64Bitmap, for simplicity we can replace it with RoaringBitmap, as the max entry id is 49999, which is smaller than 4294967296 (2 * Integer.MAX_VALUE)(the max value can be stored in RoaringBitmap). The space consume by RoaringBitmap depends on how many elements it contains, when the size of bitmap < 4096, the space is 4N btye, when the size of bitmap > 4096, the consumed space is a fixed value 8KB.
Then we get the final result:

when M>50000, D = ceil(M/50000)*size(bitmap) ~= M/50000 * 8KB = M/50000 * 8 * 1024 byte = 0.163M byte, each entry takes 0.163byte by average.
when 4096<M<50000, D = ceil(M/50000)*size(bitmap) = 1 * 8KB = 8KB, each entry takes 8*1024/M=8192/M byte by average.
when M<4096, D = ceil(M/50000)*size(bitmap) = 1 * 4Mbyte = 4Mbyte, each entry take 4 byte by average.

Conclusion

The space complexity of ConcurrentLongLongPairHashMap is 48N byte in best case, 213N byte in worst case, where N is the number of entries.
The space complexity of new data structure is determined by the total number of messages in one bucket M.
- when M>50000, space complexity is 0.163N byte.
- when 4096<M<50000, space complexity is 8192/M * N byte .
- when M<4096, space complexity is 4N byte.

test data

List some experiment data to verify the analysis above.
Test code:

static long trimLowerBit(long timestamp, int bits) {
        return timestamp & (-1L << bits);
    }
    public static void main(String[] args) throws IOException {
        ConcurrentLongLongPairHashMap map1 = ConcurrentLongLongPairHashMap.newBuilder()
                .autoShrink(true)
                .concurrencyLevel(16)
                .build();
        // timestamp -> ledgerId -> entryId, no need to batch index, if different messages have
        // different timestamp, there will be multiple entries in the map
        // AVL Tree -> LongOpenHashMap -> Roaring64Bitmap
        // there are many timestamp, a few ledgerId, many entryId
        Long2ObjectSortedMap<Long2ObjectMap<Roaring64Bitmap>> map2 = new Long2ObjectAVLTreeMap<>();
        
        int trimLowerBits = 10;
        long numMessages = 1000000, entriesPerLedger = 1000, numLedgers = numMessages / entriesPerLedger;
        long ledgerId, entryId, timestamp=System.currentTimeMillis(), tmp=0;
        for (long i = 0; i < numLedgers; i++) {
            ledgerId = 10000+i;
            for (long j = 0; j < entriesPerLedger; j++) {
                entryId = j;
                // 1ms per message
                timestamp++;
                // queue.add(timestamp, ledgerId, entryId);
                map1.put(ledgerId, entryId, 0L, timestamp);
                
                tmp = trimLowerBit(timestamp, trimLowerBits);
                map2.computeIfAbsent(tmp, k -> new Long2ObjectOpenHashMap<>())
                    .computeIfAbsent(ledgerId, k -> new Roaring64Bitmap())
                    .add(entryId);
            }
        }
    }

x=1, y=10

Let x=1, that is 1msg/ms, y=10, we will trim 10 bit of the timestamp. Then M=1*2**10=1024<4096. According to the reslut above, we predict that the space consume by 100w entries is 4*1000000/1024/1024=3.81MB.

The actual space consumed is 3.35MB, which is quite near to the theorectical value.

x=50, y=10

We try to reach to the best space complexity. M=50*2**10=51200>50000, we predict that average space consume by one entry is 0.163 byte.

int trimLowerBits = 10, messagePerMs = 50, tick=0;
long numMessages = 1000000, entriesPerLedger = 50000, numLedgers = numMessages / entriesPerLedger;

But the experiment result is 0.33*1024*1024/1000000=0.34byte, almost twice of the theorectal value 0.163.

We can print the size of bitmap to know why.

There are still many bitmaps whose size is far more smaller than 5w, which result into the lower space utility rate.

x=500, y=10

int trimLowerBits = 10, messagePerMs = 500, tick=0;
long numMessages = 1000000, entriesPerLedger = 50000, numLedgers = numMessages / entriesPerLedger;

All bitmaps contains almost 5w entries.

Each entry take 0.18*1024*1024/1000000=0.18byte, which is quite near to the the theorectical value.

Documentation

doc
doc-required
doc-not-needed
doc-complete

pip/pip-393.md

thetumbled · 2024-11-26T04:14:23Z

I add the space complexity analysis of the new data structure, please review it again, thanks. @lhotari @nodece @BewareMyPower @poorbarcode @codelipenghui @dao-jun

lhotari · 2024-11-26T08:30:16Z

I add the space complexity analysis of the new data structure, please review it again, thanks. @lhotari @nodece @BewareMyPower @poorbarcode @codelipenghui @dao-jun

Great analysis @thetumbled . Please move the analysis from the PR description to the PIP document itself.

One small detail (which doesn't impact the analysis or solution): "Entry id is stored in a Roaring64Bitmap, for simplicity we can replace it with RoaringBitmap, as the max entry id is 49999, which is smaller than 65535."
Isn't the value 65535 irrelevant since RoaringBitmap supports storing 4294967296 (2 * Integer.MAX_VALUE) integers, explained in https://github.com/RoaringBitmap/RoaringBitmap/blob/cca90c986d5c0096bbeabb5f968833bf12c28c0e/roaringbitmap/src/main/java/org/roaringbitmap/RoaringBitmap.java#L46-L49 . Roaring64Bitmap can store up to 9223372036854775807 long integers (2 * Long.MAX_VALUE).

lhotari · 2024-11-26T08:47:52Z

@thetumbled The title of any PR containing PIP documentation should include [pip] to distinguish it from other types of PRs. I made that change to the title.

lhotari

The PIP-393 document should include the high level plan of avoiding to increase the size of the Pulsar client by the size of fastutil jar file. The fastutil jar file is very large, 23MB. We use only a few classes of fastutil. There's fastutil-core library which is smaller, about ≅6MB. However, that is also relatively large and using fastutil-core will introduce another problem on the broker side since there's already fastutil jar which also includes fastutil-core jar classes. It's necessary to design a proper shading solution as part of this PIP design and implementation.
More details in the thread #23600 (comment)

thetumbled · 2024-11-26T09:19:46Z

I add the space complexity analysis of the new data structure, please review it again, thanks. @lhotari @nodece @BewareMyPower @poorbarcode @codelipenghui @dao-jun

Great analysis @thetumbled . Please move the analysis from the PR description to the PIP document itself.

One small detail (which doesn't impact the analysis or solution): "Entry id is stored in a Roaring64Bitmap, for simplicity we can replace it with RoaringBitmap, as the max entry id is 49999, which is smaller than 65535." Isn't the value 65535 irrelevant since RoaringBitmap supports storing 4294967296 (2 * Integer.MAX_VALUE) integers, explained in https://github.com/RoaringBitmap/RoaringBitmap/blob/cca90c986d5c0096bbeabb5f968833bf12c28c0e/roaringbitmap/src/main/java/org/roaringbitmap/RoaringBitmap.java#L46-L49 . Roaring64Bitmap can store up to 9223372036854775807 long integers (2 * Long.MAX_VALUE).

You are right, not 65535, but 4294967296 (2 * Integer.MAX_VALUE).

thetumbled · 2024-11-26T10:11:00Z

The PIP-393 document should include the high level plan of avoiding to increase the size of the Pulsar client by the size of fastutil jar file. The fastutil jar file is very large, 23MB. We use only a few classes of fastutil. There's fastutil-core library which is smaller, about ≅6MB. However, that is also relatively large and using fastutil-core will introduce another problem on the broker side since there's already fastutil jar which also includes fastutil-core jar classes. It's necessary to design a proper shading solution as part of this PIP design and implementation. More details in the thread #23600 (comment)

Thanks for review, i add it in high level design.

thetumbled · 2024-12-02T08:11:45Z

The vote is completed, please review this pr again, thanks. @lhotari @nodece @eolivelli

pip/pip-393.md

thetumbled added 2 commits November 15, 2024 15:05

add pip-393.

d3f58ea

add doc.

ab9e6aa

github-actions bot added PIP doc-not-needed Your PR changes do not impact docs labels Nov 15, 2024

This comment was marked as outdated.

Sign in to view

thetumbled mentioned this pull request Nov 15, 2024

[improve][client] PIP-393: Improve performance of Negative Acknowledgement #23600

Open

15 tasks

add discussion link.

61898ae

lhotari reviewed Nov 15, 2024

View reviewed changes

pip/pip-393.md Show resolved Hide resolved

thetumbled added 3 commits November 15, 2024 18:00

add doc.

38c7dca

add effect.

6dc7daa

update doc.

b023eed

thetumbled mentioned this pull request Nov 19, 2024

[improve][broker] Reduce memory occupation of the delayed message queue #23611

Merged

15 tasks

thetumbled added doc-required Your PR changes impact docs and you will update later. release/4.0.1 labels Nov 22, 2024

github-actions bot removed the doc-required Your PR changes impact docs and you will update later. label Nov 22, 2024

update doc.

b414f6f

lhotari changed the title ~~[improve][client] PIP-393: Improve performance of Negative Acknowledgement~~ [improve][pip] PIP-393: Improve performance of Negative Acknowledgement Nov 26, 2024

lhotari requested changes Nov 26, 2024

View reviewed changes

thetumbled added 2 commits November 26, 2024 17:37

add space complexity analysis.

dd32203

add High-Level Design to handle dependency.

e08f5ea

add vote link.

22e51d6

thetumbled requested a review from lhotari December 2, 2024 08:08

lhotari reviewed Dec 2, 2024

View reviewed changes

pip/pip-393.md Outdated Show resolved Hide resolved

lhotari reviewed Dec 2, 2024

View reviewed changes

pip/pip-393.md Outdated Show resolved Hide resolved

lhotari reviewed Dec 2, 2024

View reviewed changes

pip/pip-393.md Outdated Show resolved Hide resolved

fix unit.

f725d42

thetumbled requested a review from lhotari December 2, 2024 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][pip] PIP-393: Improve performance of Negative Acknowledgement #23601

[improve][pip] PIP-393: Improve performance of Negative Acknowledgement #23601

thetumbled commented Nov 15, 2024 •

edited

Loading

This comment was marked as outdated.

thetumbled commented Nov 26, 2024 •

edited

Loading

lhotari commented Nov 26, 2024

lhotari commented Nov 26, 2024

lhotari left a comment •

edited

Loading

thetumbled commented Nov 26, 2024

thetumbled commented Nov 26, 2024

thetumbled commented Dec 2, 2024

[improve][pip] PIP-393: Improve performance of Negative Acknowledgement #23601

Are you sure you want to change the base?

[improve][pip] PIP-393: Improve performance of Negative Acknowledgement #23601

Conversation

thetumbled commented Nov 15, 2024 • edited Loading

Motivation

Modifications

Space complexity of new data structure

Space complexity of ConcurrentLongLongPairHashMap

Space complexity of new data structure

Conclusion

test data

x=1, y=10

x=50, y=10

x=500, y=10

Documentation

This comment was marked as outdated.

thetumbled commented Nov 26, 2024 • edited Loading

lhotari commented Nov 26, 2024

lhotari commented Nov 26, 2024

lhotari left a comment • edited Loading

Choose a reason for hiding this comment

thetumbled commented Nov 26, 2024

thetumbled commented Nov 26, 2024

thetumbled commented Dec 2, 2024

thetumbled commented Nov 15, 2024 •

edited

Loading

Space complexity of `ConcurrentLongLongPairHashMap`

thetumbled commented Nov 26, 2024 •

edited

Loading

lhotari left a comment •

edited

Loading