Introduce a slab-based memory allocator and optimize SHAMapItem #4218

nbougalis · 2022-06-29T02:27:53Z

Introduce support for a slabbed allocator:

When instantiating a large amount of fixed-sized objects on the heap the overhead that dynamic memory allocation APIs impose will quickly become significant.

In some cases, allocating a large amount of memory at once and using a slabbing allocator to carve the large block into fixed-sized units that are used to service requests for memory out will help to reduce memory fragmentation significantly and, potentially, improve overall performance.

This commit introduces a new SlabAllocator<> class that wraps around the complexity and exposes an API that is similar to the C++ concept of an Allocator.

This class is not a general-purpose allocator and should not be used unless profiling and analysis of specific memory allocation patterns indicates that the additional complexity introduced will improve the performance of the system overall and subsequenting profiling proves that.

Optimize SHAMapItem and leverage new slab allocator:

The SHAMapItem class contains a variable-sized buffer that holds the serialized data associated with a particular item inside a SHAMap.

Prior to this commit, the buffer for the serialized data was allocated separately. Coupled with the fact that most instances of SHAMapItem were wrapped around a std::shared_ptr meant that an instantiation might result in up to three separate memory allocations.

This commit switches away from std::shared_ptr for SHAMapItem and uses boost::intrusive_ptr instead, allowing the reference count for an instance to live inside the instance itself. Coupled
with using a slab-based allocator to optimize memory allocation for the most commonly sized buffers, the net result is significant memory savings. In testing, the reduction in memory usage hovers between 400MB and 650MB. Other scenarios might result in larger savings.

I expect a reduction in overall memory usage of at least 15% on mainnet servers, but it depends on a number of factors. On my test bed, the reduction is closer to 22%. Performance should also improve.

High Level Overview of Change

Context of Change

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (non-breaking change that only restructures code)
Tests (You added tests for code that already exists, or your new feature included in this PR)
Documentation Updates
Release

mtrippled

In perftesting with NFTs, this commit reduces memory size by about 15% sustained over long duration.
LGTM
👍

mtrippled · 2022-07-05T17:10:59Z

src/ripple/shamap/SHAMapItem.h

+    assert(data.size() <= megabytes<std::size_t>(64));
+
+    std::uint8_t* raw = [&data]() -> std::uint8_t* {
+        if (data.size() < 128)


Should these be <= instead of < ? If not, what's the extra byte in the slab for?

scottschurr

Just a couple of notes:

The clang macOS library does not support std::aligned_alloc yet. As you noted, that can be patched by #include <stdlib.h> and using aligned_alloc. That patch is required in two files: SlabAllocator.h and SHAMapItem.h
Once that patch is made, you can remove the #include <cstdlib> in SlabAllocator.h, at least with clang on macOS.
Clang warns me that the std::move on line 101 of SHAMapSync_test.cpp is a pessimization. I removed that std::move, since I build with warnings as errors.
Once those changes are made, Immediately upon invoking the unit tests I get an assert:

Assertion failed: (p_ != nullptr), function SlabAllocator, file /Users/scott/Development/rippled/src/ripple/basics/SlabAllocator.h, line 126.
Abort trap: 6

That assert suggests using non-std aligned_alloc may not work as well as hoped.

That's all I have for now. I'll look more once those comments are addressed.

src/ripple/basics/SlabAllocator.h

src/ripple/nodestore/impl/DecodedBlob.cpp

src/test/shamap/SHAMapSync_test.cpp

src/ripple/basics/SlabAllocator.h

HowardHinnant

Not yet working on macOS, see comments.

scottschurr · 2022-08-09T17:19:38Z

I thought I saw a fairly recent push to this pull request. So I tried building it. The previously observed build failures on macOS are still present.

greg7mdp · 2022-08-09T19:50:49Z

@scottschurr the following is Nik's branch with my my additional changes which should build https://github.com/greg7mdp/rippled/tree/slab_alloc

mtrippled · 2022-08-09T20:11:42Z

@greg7mdp I think you should open a new PR with all of the changes, and @nbougalis should close this one. What do you think, @HowardHinnant & @scottschurr ?

HowardHinnant · 2022-08-09T20:27:49Z

I have a slight preference for keeping this one open so I can remember what I said 19 days ago. But I won't object if everyone wants to move to a new PR.

greg7mdp · 2022-08-09T21:07:28Z

@mtrippled I also think it would make sense to create a new PR. I suggested it a few days back, but @nbougalis said he would take care of it, probably he'd like to review the branch first.
@scottschurr I just verified that this branch builds fine on mac (with some unrelated warnings).

scottschurr · 2022-08-11T00:05:39Z

@greg7mdp, I'm fine either way, either updating this pull request or creating a new one.

scottschurr · 2022-08-11T01:33:24Z

And, @greg7mdp, I tried building your branch locally. Yup, your version works way better on macOS than the previous version did. Thanks! I think I will wait until your version is in an official pull request, however, before I start a full fledged review of the revised code.

greg7mdp · 2022-08-11T02:49:37Z

Thanks @scottschurr for trying the mac version. I did build my branch on all 3 platforms (linux, windows, mac).
@nbougalis , we should either update that PR or start a new one, so the review can proceed.

nbougalis · 2022-12-19T04:14:42Z

I've addressed the concerns and issues highlighted by reviewers. I have also incorporated some of the ideas by @greg7mdp into SlabAllocator, and refined the memory allocation engine.

I believe that this is ready for another review and, provided it's signed-off, a good candidate for merging.

I expect a reduction in overall memory usage of at least 15% on mainnet servers, but it depends on a number of factors. On my test bed, the reduction is closer to 22%. Performance should also improve.

Thanks.

@greg7mdp

When instantiating a large amount of fixed-sized objects on the heap the overhead that dynamic memory allocation APIs impose will quickly become significant. In some cases, allocating a large amount of memory at once and using a slabbing allocator to carve the large block into fixed-sized units that are used to service requests for memory out will help to reduce memory fragmentation significantly and, potentially, improve overall performance. This commit introduces a new `SlabAllocator<>` class that exposes an API that is _similar_ to the C++ concept of an `Allocator` but it is not meant to be a general-purpose allocator. A helper class, `SlabAllocatorSet<>` simplifies handling of variably sized objects that benefit from slab allocations. This commit contains significant improvements over the original, all of which were suggested by Greg Popovich (@greg7mdp on GitHub).

The `SHAMapItem` class contains a variable-sized buffer that holds the serialized data associated with a particular item inside a `SHAMap`. Prior to this commit, the buffer for the serialized data was allocated separately. Coupled with the fact that most instances of `SHAMapItem` were wrapped around a `std::shared_ptr` meant that an instantiation might result in up to three separate memory allocations. This commit switches away from `std::shared_ptr` for `SHAMapItem` and uses `boost::intrusive_ptr` instead, allowing the reference count for an instance to live inside the instance itself. Coupled with using a slab-based allocator to optimize memory allocation for the most commonly sized buffers, the net result is significant memory savings. In testing, the reduction in memory usage hovers between 400MB and 650MB. Other scenarios might result in larger savings.

The `Ledger` class contains two `SHAMap` instances: the state and transaction maps. Prior to this commit, the maps were dynamically allocated using `std::make_shared` despite the fact that they did not require lifetime management separate from the lifetime of the `Ledger` instance to which they belong. The two `SHAMap` instances are now regular member variables and a number of other instances where smart pointers and dynamic memory allocation could be avoided by using stack-based alternatives are also changed.

@scottschurr

@scottschurr indicated he has no particular need to review this PR, and that it would be best to assign a different reviewer

HowardHinnant

All of my comments are resolved.

This builds and unittests for me using the latest macOS tools.

…ougalis-shamapitemslab

intelliot · 2023-04-05T06:21:45Z

I merged develop and applied clang-format.patch. If acceptable to you, @nbougalis , this PR will be squashed when it is merged, so it's OK to leave the merge commit for the moment. Here is the commit message I suggest; @nbougalis please confirm that this looks good (or suggest an alternative).

Optimize SHAMapItem with slabbed allocator: (#4218)

* Introduce support for a slab-based memory allocator:

When instantiating a large amount of fixed-sized objects on the heap the
overhead that dynamic memory allocation APIs impose will quickly become
significant.

In some cases, allocating a large amount of memory at once and using a
slabbing allocator to carve the large block into fixed-sized units that
are used to service requests for memory out will help to reduce memory
fragmentation significantly and, potentially, improve overall
performance.

The new `SlabAllocator<>` class exposes an API that is _similar_ to the
C++ concept of an `Allocator` but it is not meant to be a
general-purpose allocator.

It should not be used unless profiling and analysis of specific memory
allocation patterns indicates that the additional complexity introduced
will improve the performance of the system overall, and subsequent
profiling proves it.

A helper class, `SlabAllocatorSet<>` simplifies handling of variably
sized objects that benefit from slab allocations.

This commit incorporates improvements suggested by Greg Popovitch
(@greg7mdp).

* Optimize `SHAMapItem` and leverage new slab allocator:

The `SHAMapItem` class contains a variable-sized buffer that holds the
serialized data associated with a particular item inside a `SHAMap`.

Previously, the buffer for the serialized data was allocated separately.
The fact that most instances of `SHAMapItem` were wrapped around a
`std::shared_ptr` meant that an instantiation might result in up to
three separate memory allocations.

This commit switches away from `std::shared_ptr` for `SHAMapItem` and
uses `boost::intrusive_ptr` instead, allowing the reference count for an
instance to live inside the instance itself. Coupled with using a
slab-based allocator to optimize memory allocation for the most commonly
sized buffers, the net result is significant memory savings. In testing,
the reduction in memory usage hovers between 400MB and 650MB. Other
scenarios might result in larger savings.

In performance testing with NFTs, this commit reduces memory size by
about 15% sustained over long duration.

* Avoid using std::shared_ptr when not necessary:

The `Ledger` class contains two `SHAMap` instances: the state and
transaction maps. Previously, the maps were dynamically allocated using
`std::make_shared` despite the fact that they did not require lifetime
management separate from the lifetime of the `Ledger` instance to which
they belong.

The two `SHAMap` instances are now regular member variables. Some smart
pointers and dynamic memory allocation was avoided by using stack-based
alternatives.

nbougalis · 2023-04-06T23:21:27Z

I think that the three commits are functionally separate, and each is self-contained. IMO keeping them as-is is cleaner, more professional and makes it easier to review and, if needed, run git bisect in the future.

intelliot · 2023-04-08T05:50:02Z

We had a comparable situation with #4192: the PR author recommended keeping the commits as-is, and so we did. Later, we learned there was some discontent with that, and that we should have squashed all the commits together.

I don't personally have any opinion either way, but if we're going to recommend merging this as three separate commits, then we should at least confirm that each commit stands on its own and passes unit tests & clang-format. It may make sense to open 2-3 new PRs for this purpose.

HowardHinnant · 2023-04-08T12:48:02Z

What we decided in #4192 is that if reviewers feel that a commit should be merged, and the commit message does not begin with [FOLD], then the reviewer should recommend a change to the commit message. And vice-versa.

intelliot · 2023-04-10T19:05:43Z

blocked: let's get some agreement on whether this PR should end up as 3 commits, or 1.

If it's 3 commits, I think it would be best to open a couple of separate PRs the individual parts, so that we can confirm they are mergable on their own (they make sense and pass unit tests & clang-format).

@greg7mdp

When instantiating a large amount of fixed-sized objects on the heap the overhead that dynamic memory allocation APIs impose will quickly become significant. In some cases, allocating a large amount of memory at once and using a slabbing allocator to carve the large block into fixed-sized units that are used to service requests for memory out will help to reduce memory fragmentation significantly and, potentially, improve overall performance. This commit introduces a new `SlabAllocator<>` class that exposes an API that is _similar_ to the C++ concept of an `Allocator` but it is not meant to be a general-purpose allocator. It should not be used unless profiling and analysis of specific memory allocation patterns indicates that the additional complexity introduced will improve the performance of the system overall, and subsequent profiling proves it. A helper class, `SlabAllocatorSet<>` simplifies handling of variably sized objects that benefit from slab allocations. This commit incorporates improvements suggested by Greg Popovitch (@greg7mdp). Commit 1 of 3 in #4218.

@greg7mdp

When instantiating a large amount of fixed-sized objects on the heap the overhead that dynamic memory allocation APIs impose will quickly become significant. In some cases, allocating a large amount of memory at once and using a slabbing allocator to carve the large block into fixed-sized units that are used to service requests for memory out will help to reduce memory fragmentation significantly and, potentially, improve overall performance. This commit introduces a new `SlabAllocator<>` class that exposes an API that is _similar_ to the C++ concept of an `Allocator` but it is not meant to be a general-purpose allocator. It should not be used unless profiling and analysis of specific memory allocation patterns indicates that the additional complexity introduced will improve the performance of the system overall, and subsequent profiling proves it. A helper class, `SlabAllocatorSet<>` simplifies handling of variably sized objects that benefit from slab allocations. This commit incorporates improvements suggested by Greg Popovitch (@greg7mdp). Commit 1 of 3 in #4218.

The `SHAMapItem` class contains a variable-sized buffer that holds the serialized data associated with a particular item inside a `SHAMap`. Prior to this commit, the buffer for the serialized data was allocated separately. Coupled with the fact that most instances of `SHAMapItem` were wrapped around a `std::shared_ptr` meant that an instantiation might result in up to three separate memory allocations. This commit switches away from `std::shared_ptr` for `SHAMapItem` and uses `boost::intrusive_ptr` instead, allowing the reference count for an instance to live inside the instance itself. Coupled with using a slab-based allocator to optimize memory allocation for the most commonly sized buffers, the net result is significant memory savings. In testing, the reduction in memory usage hovers between 400MB and 650MB. Other scenarios might result in larger savings. In performance testing with NFTs, this commit reduces memory size by about 15% sustained over long duration. Commit 2 of 3 in #4218.

intelliot · 2023-04-11T00:19:24Z

@seelabs confirmed 3 commits for this is fine. I am rebasing the commits individually to ensure they pass CI on GitHub. If they pass, I will merge them no earlier than 2023-04-11 at 8:00 AM PT. If you have any objections, please comment here on this PR.

intelliot · 2023-04-11T17:31:42Z

Suggested commit message:

Avoid using std::shared_ptr when not necessary: (#4218)

The `Ledger` class contains two `SHAMap` instances: the state and
transaction maps. Previously, the maps were dynamically allocated using
`std::make_shared` despite the fact that they did not require lifetime
management separate from the lifetime of the `Ledger` instance to which
they belong.

The two `SHAMap` instances are now regular member variables. Some smart
pointers and dynamic memory allocation was avoided by using stack-based
alternatives.

Commit 3 of 3 in #4218.

intelliot · 2023-04-11T17:48:40Z

will merge after github checks pass, if I don't hear any objections

Misaligned load and store operations are supported by both Intel and ARM CPUs. However, in C++, these operations are undefined behavior (UB). Substituting these operations with a `memcpy` fixes this UB. The compiled assembly code is equivalent to the original, so there is no performance penalty to using memcpy. For context: The unaligned load and store operations fixed here were originally introduced in the slab allocator (#4218).

intelliot · 2023-06-05T23:23:53Z

@manojsdoshi to try to corroborate the 15-22% memory use reduction

Released in 1.11.0.

nbougalis requested review from seelabs, HowardHinnant, mtrippled and scottschurr June 29, 2022 02:27

mtrippled approved these changes Jul 5, 2022

View reviewed changes

scottschurr previously requested changes Jul 12, 2022

View reviewed changes