Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a slab-based memory allocator and optimize SHAMapItem #4218

Merged
merged 6 commits into from
Apr 11, 2023
Merged

Introduce a slab-based memory allocator and optimize SHAMapItem #4218

merged 6 commits into from
Apr 11, 2023

Conversation

nbougalis
Copy link
Contributor

@nbougalis nbougalis commented Jun 29, 2022

Introduce support for a slabbed allocator:

When instantiating a large amount of fixed-sized objects on the heap the overhead that dynamic memory allocation APIs impose will quickly become significant.

In some cases, allocating a large amount of memory at once and using a slabbing allocator to carve the large block into fixed-sized units that are used to service requests for memory out will help to reduce memory fragmentation significantly and, potentially, improve overall performance.

This commit introduces a new SlabAllocator<> class that wraps around the complexity and exposes an API that is similar to the C++ concept of an Allocator.

This class is not a general-purpose allocator and should not be used unless profiling and analysis of specific memory allocation patterns indicates that the additional complexity introduced will improve the performance of the system overall and subsequenting profiling proves that.

Optimize SHAMapItem and leverage new slab allocator:

The SHAMapItem class contains a variable-sized buffer that holds the serialized data associated with a particular item inside a SHAMap.

Prior to this commit, the buffer for the serialized data was allocated separately. Coupled with the fact that most instances of SHAMapItem were wrapped around a std::shared_ptr meant that an instantiation might result in up to three separate memory allocations.

This commit switches away from std::shared_ptr for SHAMapItem and uses boost::intrusive_ptr instead, allowing the reference count for an instance to live inside the instance itself. Coupled
with using a slab-based allocator to optimize memory allocation for the most commonly sized buffers, the net result is significant memory savings. In testing, the reduction in memory usage hovers between 400MB and 650MB. Other scenarios might result in larger savings.

I expect a reduction in overall memory usage of at least 15% on mainnet servers, but it depends on a number of factors. On my test bed, the reduction is closer to 22%. Performance should also improve.

High Level Overview of Change

Context of Change

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Copy link
Collaborator

@mtrippled mtrippled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In perftesting with NFTs, this commit reduces memory size by about 15% sustained over long duration.
LGTM
👍

assert(data.size() <= megabytes<std::size_t>(64));

std::uint8_t* raw = [&data]() -> std::uint8_t* {
if (data.size() < 128)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be <= instead of < ? If not, what's the extra byte in the slab for?

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of notes:

  1. The clang macOS library does not support std::aligned_alloc yet. As you noted, that can be patched by #include <stdlib.h> and using aligned_alloc. That patch is required in two files: SlabAllocator.h and SHAMapItem.h
  2. Once that patch is made, you can remove the #include <cstdlib> in SlabAllocator.h, at least with clang on macOS.
  3. Clang warns me that the std::move on line 101 of SHAMapSync_test.cpp is a pessimization. I removed that std::move, since I build with warnings as errors.
  4. Once those changes are made, Immediately upon invoking the unit tests I get an assert:
Assertion failed: (p_ != nullptr), function SlabAllocator, file /Users/scott/Development/rippled/src/ripple/basics/SlabAllocator.h, line 126.
Abort trap: 6

That assert suggests using non-std aligned_alloc may not work as well as hoped.

That's all I have for now. I'll look more once those comments are addressed.

Copy link
Contributor

@HowardHinnant HowardHinnant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet working on macOS, see comments.

@scottschurr
Copy link
Collaborator

I thought I saw a fairly recent push to this pull request. So I tried building it. The previously observed build failures on macOS are still present.

@greg7mdp
Copy link
Contributor

greg7mdp commented Aug 9, 2022

@scottschurr the following is Nik's branch with my my additional changes which should build https://github.com/greg7mdp/rippled/tree/slab_alloc

@mtrippled
Copy link
Collaborator

@greg7mdp I think you should open a new PR with all of the changes, and @nbougalis should close this one. What do you think, @HowardHinnant & @scottschurr ?

@HowardHinnant
Copy link
Contributor

I have a slight preference for keeping this one open so I can remember what I said 19 days ago. But I won't object if everyone wants to move to a new PR.

@greg7mdp
Copy link
Contributor

greg7mdp commented Aug 9, 2022

@mtrippled I also think it would make sense to create a new PR. I suggested it a few days back, but @nbougalis said he would take care of it, probably he'd like to review the branch first.
@scottschurr I just verified that this branch builds fine on mac (with some unrelated warnings).

@scottschurr
Copy link
Collaborator

@greg7mdp, I'm fine either way, either updating this pull request or creating a new one.

@scottschurr
Copy link
Collaborator

And, @greg7mdp, I tried building your branch locally. Yup, your version works way better on macOS than the previous version did. Thanks! I think I will wait until your version is in an official pull request, however, before I start a full fledged review of the revised code.

@greg7mdp
Copy link
Contributor

Thanks @scottschurr for trying the mac version. I did build my branch on all 3 platforms (linux, windows, mac).
@nbougalis , we should either update that PR or start a new one, so the review can proceed.

@nbougalis
Copy link
Contributor Author

I've addressed the concerns and issues highlighted by reviewers. I have also incorporated some of the ideas by @greg7mdp into SlabAllocator, and refined the memory allocation engine.

I believe that this is ready for another review and, provided it's signed-off, a good candidate for merging.

I expect a reduction in overall memory usage of at least 15% on mainnet servers, but it depends on a number of factors. On my test bed, the reduction is closer to 22%. Performance should also improve.

Thanks.

When instantiating a large amount of fixed-sized objects on the heap
the overhead that dynamic memory allocation APIs impose will quickly
become significant.

In some cases, allocating a large amount of memory at once and using
a slabbing allocator to carve the large block into fixed-sized units
that are used to service requests for memory out will help to reduce
memory fragmentation significantly and, potentially, improve overall
performance.

This commit introduces a new `SlabAllocator<>` class that exposes an
API that is _similar_ to the C++ concept of an `Allocator` but it is
not meant to be a general-purpose allocator.

A helper class, `SlabAllocatorSet<>` simplifies handling of variably
sized objects that benefit from slab allocations.

This commit contains significant improvements over the original, all
of which were suggested by Greg Popovich (@greg7mdp on GitHub).
The `SHAMapItem` class contains a variable-sized buffer that
holds the serialized data associated with a particular item
inside a `SHAMap`.

Prior to this commit, the buffer for the serialized data was
allocated separately. Coupled with the fact that most instances
of `SHAMapItem` were wrapped around a `std::shared_ptr` meant
that an instantiation might result in up to three separate
memory allocations.

This commit switches away from `std::shared_ptr` for `SHAMapItem`
and uses `boost::intrusive_ptr` instead, allowing the reference
count for an instance to live inside the instance itself. Coupled
with using a slab-based allocator to optimize memory allocation
for the most commonly sized buffers, the net result is significant
memory savings. In testing, the reduction in memory usage hovers
between 400MB and 650MB. Other scenarios might result in larger
savings.
The `Ledger` class contains two `SHAMap` instances: the state and
transaction maps. Prior to this commit, the maps were dynamically
allocated using `std::make_shared` despite the fact that they did
not require lifetime management separate from the lifetime of the
`Ledger` instance to which they belong.

The two `SHAMap` instances are now regular member variables and a
number of other instances where smart pointers and dynamic memory
allocation could be avoided by using stack-based alternatives are
also changed.
@intelliot intelliot dismissed scottschurr’s stale review April 3, 2023 17:23

@scottschurr indicated he has no particular need to review this PR, and that it would be best to assign a different reviewer

@intelliot intelliot requested review from HowardHinnant and removed request for RichardAH, scottschurr and drlongle April 3, 2023 17:40
Copy link
Contributor

@HowardHinnant HowardHinnant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my comments are resolved.

This builds and unittests for me using the latest macOS tools.

@intelliot
Copy link
Collaborator

I merged develop and applied clang-format.patch. If acceptable to you, @nbougalis , this PR will be squashed when it is merged, so it's OK to leave the merge commit for the moment. Here is the commit message I suggest; @nbougalis please confirm that this looks good (or suggest an alternative).

Optimize SHAMapItem with slabbed allocator: (#4218)

* Introduce support for a slab-based memory allocator:

When instantiating a large amount of fixed-sized objects on the heap the
overhead that dynamic memory allocation APIs impose will quickly become
significant.

In some cases, allocating a large amount of memory at once and using a
slabbing allocator to carve the large block into fixed-sized units that
are used to service requests for memory out will help to reduce memory
fragmentation significantly and, potentially, improve overall
performance.

The new `SlabAllocator<>` class exposes an API that is _similar_ to the
C++ concept of an `Allocator` but it is not meant to be a
general-purpose allocator.

It should not be used unless profiling and analysis of specific memory
allocation patterns indicates that the additional complexity introduced
will improve the performance of the system overall, and subsequent
profiling proves it.

A helper class, `SlabAllocatorSet<>` simplifies handling of variably
sized objects that benefit from slab allocations.

This commit incorporates improvements suggested by Greg Popovitch
(@greg7mdp).

* Optimize `SHAMapItem` and leverage new slab allocator:

The `SHAMapItem` class contains a variable-sized buffer that holds the
serialized data associated with a particular item inside a `SHAMap`.

Previously, the buffer for the serialized data was allocated separately.
The fact that most instances of `SHAMapItem` were wrapped around a
`std::shared_ptr` meant that an instantiation might result in up to
three separate memory allocations.

This commit switches away from `std::shared_ptr` for `SHAMapItem` and
uses `boost::intrusive_ptr` instead, allowing the reference count for an
instance to live inside the instance itself. Coupled with using a
slab-based allocator to optimize memory allocation for the most commonly
sized buffers, the net result is significant memory savings. In testing,
the reduction in memory usage hovers between 400MB and 650MB. Other
scenarios might result in larger savings.

In performance testing with NFTs, this commit reduces memory size by
about 15% sustained over long duration.

* Avoid using std::shared_ptr when not necessary:

The `Ledger` class contains two `SHAMap` instances: the state and
transaction maps. Previously, the maps were dynamically allocated using
`std::make_shared` despite the fact that they did not require lifetime
management separate from the lifetime of the `Ledger` instance to which
they belong.

The two `SHAMap` instances are now regular member variables. Some smart
pointers and dynamic memory allocation was avoided by using stack-based
alternatives.

@nbougalis
Copy link
Contributor Author

nbougalis commented Apr 6, 2023

I think that the three commits are functionally separate, and each is self-contained. IMO keeping them as-is is cleaner, more professional and makes it easier to review and, if needed, run git bisect in the future.

@intelliot
Copy link
Collaborator

We had a comparable situation with #4192: the PR author recommended keeping the commits as-is, and so we did. Later, we learned there was some discontent with that, and that we should have squashed all the commits together.

I don't personally have any opinion either way, but if we're going to recommend merging this as three separate commits, then we should at least confirm that each commit stands on its own and passes unit tests & clang-format. It may make sense to open 2-3 new PRs for this purpose.

@HowardHinnant
Copy link
Contributor

What we decided in #4192 is that if reviewers feel that a commit should be merged, and the commit message does not begin with [FOLD], then the reviewer should recommend a change to the commit message. And vice-versa.

@intelliot
Copy link
Collaborator

blocked: let's get some agreement on whether this PR should end up as 3 commits, or 1.

If it's 3 commits, I think it would be best to open a couple of separate PRs the individual parts, so that we can confirm they are mergable on their own (they make sense and pass unit tests & clang-format).

intelliot pushed a commit that referenced this pull request Apr 10, 2023
When instantiating a large amount of fixed-sized objects on the heap
the overhead that dynamic memory allocation APIs impose will quickly
become significant.

In some cases, allocating a large amount of memory at once and using
a slabbing allocator to carve the large block into fixed-sized units
that are used to service requests for memory out will help to reduce
memory fragmentation significantly and, potentially, improve overall
performance.

This commit introduces a new `SlabAllocator<>` class that exposes an
API that is _similar_ to the C++ concept of an `Allocator` but it is
not meant to be a general-purpose allocator.

It should not be used unless profiling and analysis of specific memory
allocation patterns indicates that the additional complexity introduced
will improve the performance of the system overall, and subsequent
profiling proves it.

A helper class, `SlabAllocatorSet<>` simplifies handling of variably
sized objects that benefit from slab allocations.

This commit incorporates improvements suggested by Greg Popovitch
(@greg7mdp).

Commit 1 of 3 in #4218.
intelliot pushed a commit that referenced this pull request Apr 10, 2023
When instantiating a large amount of fixed-sized objects on the heap
the overhead that dynamic memory allocation APIs impose will quickly
become significant.

In some cases, allocating a large amount of memory at once and using
a slabbing allocator to carve the large block into fixed-sized units
that are used to service requests for memory out will help to reduce
memory fragmentation significantly and, potentially, improve overall
performance.

This commit introduces a new `SlabAllocator<>` class that exposes an
API that is _similar_ to the C++ concept of an `Allocator` but it is
not meant to be a general-purpose allocator.

It should not be used unless profiling and analysis of specific memory
allocation patterns indicates that the additional complexity introduced
will improve the performance of the system overall, and subsequent
profiling proves it.

A helper class, `SlabAllocatorSet<>` simplifies handling of variably
sized objects that benefit from slab allocations.

This commit incorporates improvements suggested by Greg Popovitch
(@greg7mdp).

Commit 1 of 3 in #4218.
intelliot pushed a commit that referenced this pull request Apr 11, 2023
The `SHAMapItem` class contains a variable-sized buffer that
holds the serialized data associated with a particular item
inside a `SHAMap`.

Prior to this commit, the buffer for the serialized data was
allocated separately. Coupled with the fact that most instances
of `SHAMapItem` were wrapped around a `std::shared_ptr` meant
that an instantiation might result in up to three separate
memory allocations.

This commit switches away from `std::shared_ptr` for `SHAMapItem`
and uses `boost::intrusive_ptr` instead, allowing the reference
count for an instance to live inside the instance itself. Coupled
with using a slab-based allocator to optimize memory allocation
for the most commonly sized buffers, the net result is significant
memory savings. In testing, the reduction in memory usage hovers
between 400MB and 650MB. Other scenarios might result in larger
savings.

In performance testing with NFTs, this commit reduces memory size by
about 15% sustained over long duration.

Commit 2 of 3 in #4218.
@intelliot
Copy link
Collaborator

@seelabs confirmed 3 commits for this is fine. I am rebasing the commits individually to ensure they pass CI on GitHub. If they pass, I will merge them no earlier than 2023-04-11 at 8:00 AM PT. If you have any objections, please comment here on this PR.

@intelliot
Copy link
Collaborator

Suggested commit message:

Avoid using std::shared_ptr when not necessary: (#4218)

The `Ledger` class contains two `SHAMap` instances: the state and
transaction maps. Previously, the maps were dynamically allocated using
`std::make_shared` despite the fact that they did not require lifetime
management separate from the lifetime of the `Ledger` instance to which
they belong.

The two `SHAMap` instances are now regular member variables. Some smart
pointers and dynamic memory allocation was avoided by using stack-based
alternatives.

Commit 3 of 3 in #4218.

@intelliot
Copy link
Collaborator

will merge after github checks pass, if I don't hear any objections

@intelliot intelliot merged commit 066f91c into XRPLF:develop Apr 11, 2023
intelliot pushed a commit that referenced this pull request May 31, 2023
Misaligned load and store operations are supported by both Intel and ARM
CPUs. However, in C++, these operations are undefined behavior (UB).
Substituting these operations with a `memcpy` fixes this UB. The
compiled assembly code is equivalent to the original, so there is no
performance penalty to using memcpy.

For context: The unaligned load and store operations fixed here were
originally introduced in the slab allocator (#4218).
@intelliot
Copy link
Collaborator

intelliot commented Jun 5, 2023

@manojsdoshi to try to corroborate the 15-22% memory use reduction

Released in 1.11.0.

@nbougalis nbougalis deleted the shamapitemslab branch October 16, 2023 05:58
@intelliot intelliot added Perf Test Desired (Optional) RippleX Perf Team should look at this PR. The PR will not necessarily wait for testing to finish and removed Performance/Resource Improvement labels Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Perf Test Desired (Optional) RippleX Perf Team should look at this PR. The PR will not necessarily wait for testing to finish Tech Debt Non-urgent improvements
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants