Improve lifetime management of ledger objects (`SLE`s) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

ximinez · 2023-11-20T21:34:13Z

High Level Overview of Change

This PR, if merged, changes CachedView to only cache the key needed to find each ledger object (SLE) in the global CachedSLEs, instead of a strong reference to the SLE itself. Further, it releases a hard reference to the last Ledger used for path finding work as soon as there is no more work to be done. Additionally, a bunch of logging is added to monitor cache sizes before and after sweeping for expired objects. The logging is not strictly necessary for the solution, but it doesn't appear to hurt, and can be useful if there are similar issues in the future. Finally, several more objects, and some statistics for the CachedSLEs, are tracked for the get_counts admin command.

Resolves #4224

Context of Change

Currently, the CachedView object holds and keeps strong references (std::shared_ptr) to objects being loaded from the ledger. This prevents those objects from being completely released from the global cache (CachedSLEs, which is a TaggedCache) and releasing their memory until the owning Ledger is released.

This is not a significant problem during typical server operations, but can lead to memory exhaustion in some circumstances, including path finding. This looks like a memory leak, but isn't really, because the memory would be recovered eventually if rippled was able to continue.

It needs to be noted that this is not a guaranteed complete solution for all path finding memory issues. It's still possible that sufficiently large path_search configuration values, or a sufficiently large set of steps, accounts, and trust lines could still consume enough memory to cause problems.

Problems with memory usage during path finding has been an ongoing issue. See also:

I hope that this is the last word on this issue, but real world performance remains to be seen.

Type of Change

[X ] Bug fix (non-breaking change which fixes an issue)
[X ] Refactor (non-breaking change that only restructures code)

API Impact

[X ] Public API: New feature (new methods and/or new fields)
- Potentially adds several new key-value pairs to the output of get_counts.

Before / After

Before

Certain combinations of path finding requests could cause rippled to exhaust available memory and crash, or be killed by the operating system.

After

The same scenarios will not cause rippled to crash, or at worst will allow rippled to service many more requests and run a lot longer before crashing.

Test Plan

This is an optimization that shouldn't affect anything other than this particular edge case. Thus existing test cases should cover it.

Beta testers are encouraged to build this branch, put it on public facing servers with path finding configured, and report back on their results.

* Only store SLE digest in CachedView; get SLEs from CachedSLEs * Also force release of last ledger used for path finding if there are no path finding requests to process * Count more ST objects (derive from `CountedObject`) * Track CachedView stats in CountedObjects

…emory * upstream/develop: Set version to 2.0.0-rc3 docs(API-CHANGELOG): add extra bullet about DeliverMax (4784) Update API-CHANGELOG.md for release 2.0 (4828) Add Debian 12 Bookworm; ignore core-utils in almalinux (4836) Set version to 2.0.0-rc2 Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (4760) Fix 2.0 regression in tx method with binary output (4812) Update Linux smoketest distros (4813) Promote API version 2 to supported (4803) Support for the mold linker (4807) Proposed 2.0.0-rc2 (4818) Set version to 2.0.0-rc1

intelliot · 2023-11-29T19:42:09Z

Requires additional testing since this makes changes that aren't necessarily specific to pathfinding.
There's a smaller / more targeted change that could be made, which cleans out the pointers: just the changes in src/ripple/app/ledger/impl/LedgerMaster.cpp.
Would like node operators (supporting pathfinding) to run this change to confirm that it resolves the problem.

…emory * upstream/develop: APIv2: show DeliverMax in submit, submit_multisigned (4827) APIv2: consistently return ledger_index as integer (4820)

shortthefomo · 2023-12-04T21:48:55Z

Feed back so far.

follow up responses to the initial path_find request seem slower than what we have today.. the production build seems to return results much more frequently.
memory seems to behave for the 2-3 hrs I let node run on "normal" requests and some light pathing requests.

3. asked few a few people to hit a front end that adds a bunch of load on the box and managed to bring it down. Here is error log...

2023-Dec-04 21:38:43.756954454 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.757092579 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.758713948 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.759163464 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.829825677 UTC Resource:WRN Consumer entry 127.0.0.1 dropped with balance 15010 at or above drop threshold 15000
2023-Dec-04 21:38:43.830112890 UTC Resource:WRN Consumer entry 127.0.0.1 dropped with balance 15104 at or above drop threshold 15000
rippled: /root/.conan/data/boost/1.82.0/_/_/package/3a52104b31a9c302344831090b42e066cefe22fe/include/boost/beast/websocket/detail/soft_mutex.hpp:89: bool boost::beast::websocket::detail::soft_mutex::try_lock(const T*) [with T = boost::beast::websocket::stream<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy> >::close_op<boost::asio::executor_binder<ripple::BaseWSPeer<ripple::ServerHandler, ripple::PlainWSPeer<ripple::ServerHandler> >::close(const boost::beast::websocket::close_reason&)::<lambda(const error_code&)>, boost::asio::strand<boost::asio::executor> > >]: Assertion `id_ != T::id' failed.
Aborted

ximinez · 2023-12-04T23:08:40Z

asked few a few people to hit a front end that adds a bunch of load on the box and managed to bring it down. Here is error log...

That looks like a problem at the websocket layer, which path finding shouldn't affect. Can you tell if the same issue occurs under heavy load without this change?

shortthefomo · 2023-12-05T00:34:53Z

Can you tell if the same issue occurs under heavy load without this change?

(Read the request wrong) *edited

I will build develop tomorrow and check latest less this, but I run that interface off the 1.12.0 prod build and it does not fail with this error under decent load.

shortthefomo · 2023-12-05T14:31:35Z

right did some more testing off the develop branch this morning @ximinez

can validate that the issue is present outside of this change as well.

…emory * upstream/develop: Set version to 2.0.0-rc5 Workarounds for gcc-13 compatibility (4817) Revert 4505, 4760 (4842) Set version to 2.0.0-rc4

ximinez · 2023-12-05T18:39:31Z

I run that interface off the 1.12.0 prod build and it does not fail with this error under decent load.

When you say "prod build", do you mean the published packages? Those have asserts disabled, so even if the issue exists there, you wouldn't see it. Could you build https://github.com/XRPLF/rippled/tree/1.12.0 the same way you built develop, and see if the issue exists there, too?

intelliot · 2023-12-05T22:53:33Z

right did some more testing off the develop branch this morning @ximinez

can validate that the issue is present outside of this change as well.

Given the issue appears on develop (without #4822 - and therefore doesn't seem to be related to this PR), can you open a new issue so we can track/discuss your findings there? @lathanbritz

…emory * upstream/develop: Set version to 2.0.0-rc6 Revert "Apply transaction batches in periodic intervals (4504)" (4852) Revert "Add ProtocolStart and GracefulClose P2P protocol messages (3839)" (4850) docs(API-CHANGELOG): clarify changes for V2 (4773) fix typo: 'of' instead of 'on' (4821)

src/ripple/ledger/impl/CachedView.cpp

* Rename the CachedView counters * Fix the scope of the digest lookup lock

ximinez · 2024-01-05T00:26:04Z

@Bronek @thejohnfreeman I pushed a commit to change the "miss" label to be lower-case. I don't think it requires re-review, but I wanted to give you a heads up.

intelliot · 2024-01-08T22:37:00Z

@sophiax851 this may require perf signoff

mtrippled

LGTM 👍

intelliot · 2024-01-08T23:17:50Z

Internal tracker: RPFC-79

…emory * upstream/develop: Set version to 2.0.0-rc7

…emory * upstream/develop: Set version to 2.0.0

@lathanbritz

Prevent WebSocket connections from trying to close twice. The issue only occurs in debug builds (assertions are disabled in release builds, including published packages), and when the WebSocket connections are unprivileged. The assert (and WRN log) occurs when a client drives up the resource balance enough to be forcibly disconnected while there are still messages pending to be sent. Thanks to @lathanbritz for discovering this issue in #4822.

…emory * upstream/develop: WebSocket should only call async_close once (4848)

intelliot · 2024-02-02T19:44:31Z

Perf SignedOff: Examined the tagged cache sweeping during the test during lab testing and using a Mainnet node, didn’t observe abnormal behavior. No performance regression observed either.

Released in 2.0.1.

sophiax851 · 2024-02-02T20:58:56Z

It's worth mentioning that this "Perf SignedOff" testing was conducted to ensure that there is no performance regression in the normal transaction processing with the changes made in this PR which was intended to address the root cause of the issue that I identified for #4224 . There was separate testing conducted in verifying the #4224.

@lathanbritz

Prevent WebSocket connections from trying to close twice. The issue only occurs in debug builds (assertions are disabled in release builds, including published packages), and when the WebSocket connections are unprivileged. The assert (and WRN log) occurs when a client drives up the resource balance enough to be forcibly disconnected while there are still messages pending to be sent. Thanks to @lathanbritz for discovering this issue in XRPLF#4822.

…away memory usage: (XRPLF#4822) * Add logging for Application.cpp sweep() * Improve lifetime management of ledger objects (`SLE`s) * Only store SLE digest in CachedView; get SLEs from CachedSLEs * Also force release of last ledger used for path finding if there are no path finding requests to process * Count more ST objects (derive from `CountedObject`) * Track CachedView stats in CountedObjects * Rename the CachedView counters * Fix the scope of the digest lookup lock Before this patch, if you asked "is it caching?" It was always caching.

scottschurr and others added 2 commits November 20, 2023 15:30

[LOGGING] Add logging for Application.cpp sweep()

3e317d7

ximinez added Bug Performance/Resource Improvement labels Nov 20, 2023

ximinez requested review from thejohnfreeman, seelabs, HowardHinnant, mtrippled and scottschurr and removed request for seelabs, HowardHinnant and scottschurr November 20, 2023 21:34

intelliot modified the milestones: 2024 release, 2.0.1 Nov 27, 2023

ximinez mentioned this pull request Nov 29, 2023

Rippled locks up when client use pathing (all versions) #4224

Closed

Merge remote-tracking branch 'upstream/develop' into fix-cachedview-m…

7bfdc8f

…emory * upstream/develop: APIv2: show DeliverMax in submit, submit_multisigned (4827) APIv2: consistently return ledger_index as integer (4820)

Merge remote-tracking branch 'upstream/develop' into fix-cachedview-m…

6a8e838

…emory * upstream/develop: Set version to 2.0.0-rc5 Workarounds for gcc-13 compatibility (4817) Revert 4505, 4760 (4842) Set version to 2.0.0-rc4

ximinez mentioned this pull request Dec 7, 2023

Websocket should only call async_close once #4848

Merged

intelliot requested a review from Bronek January 2, 2024 18:56

Bronek requested changes Jan 2, 2024

View reviewed changes

src/ripple/ledger/impl/CachedView.cpp Outdated Show resolved Hide resolved

src/ripple/ledger/impl/CachedView.cpp Show resolved Hide resolved

[FOLD] Review feedback from @Bronek:

cb1356d

* Rename the CachedView counters * Fix the scope of the digest lookup lock

Bronek approved these changes Jan 4, 2024

View reviewed changes

thejohnfreeman approved these changes Jan 4, 2024

View reviewed changes

ximinez added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Jan 5, 2024

[FOLD, TRIVIAL] Fix case of "miss" label

ff09266

intelliot added Perf Attn Needed Attention needed from RippleX Performance Team and removed Performance/Resource Improvement labels Jan 8, 2024

intelliot requested a review from sophiax851 January 8, 2024 22:36

intelliot assigned sophiax851 Jan 8, 2024

mtrippled approved these changes Jan 8, 2024

View reviewed changes

ximinez added 2 commits January 8, 2024 19:28

Merge remote-tracking branch 'upstream/develop' into fix-cachedview-m…

bf3f757

…emory * upstream/develop: Set version to 2.0.0-rc7

Merge remote-tracking branch 'upstream/develop' into fix-cachedview-m…

1d882ea

…emory * upstream/develop: Set version to 2.0.0

Merge remote-tracking branch 'upstream/develop' into fix-cachedview-m…

997626a

…emory * upstream/develop: WebSocket should only call async_close once (4848)

seelabs merged commit d9f90c8 into XRPLF:develop Jan 12, 2024
16 checks passed

ximinez deleted the fix-cachedview-memory branch January 12, 2024 17:43

This was referenced Jan 18, 2024

Proposed 2.0.1-b1 #4888

Merged

Proposed 2.0.1-rc1 #4895

Merged

intelliot added Perf SignedOff RippleX Performance Team has approved and removed Perf Attn Needed Attention needed from RippleX Performance Team labels Feb 2, 2024

dangell7 mentioned this pull request Mar 14, 2024

pathing fixes merged to rippled should be merged here as well (Version: 2024.1.25-release+738) Xahau/xahaud#292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve lifetime management of ledger objects (`SLE`s) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

Improve lifetime management of ledger objects (`SLE`s) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

ximinez commented Nov 20, 2023

intelliot commented Nov 29, 2023

shortthefomo commented Dec 4, 2023 •

edited

Loading

ximinez commented Dec 4, 2023

shortthefomo commented Dec 5, 2023 •

edited

Loading

shortthefomo commented Dec 5, 2023

ximinez commented Dec 5, 2023

intelliot commented Dec 5, 2023

ximinez commented Jan 5, 2024

intelliot commented Jan 8, 2024

mtrippled left a comment

intelliot commented Jan 8, 2024

intelliot commented Feb 2, 2024

sophiax851 commented Feb 2, 2024

Improve lifetime management of ledger objects (SLEs) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

Improve lifetime management of ledger objects (SLEs) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

Conversation

ximinez commented Nov 20, 2023

High Level Overview of Change

Context of Change

Type of Change

API Impact

Before / After

Before

After

Test Plan

intelliot commented Nov 29, 2023

shortthefomo commented Dec 4, 2023 • edited Loading

ximinez commented Dec 4, 2023

shortthefomo commented Dec 5, 2023 • edited Loading

shortthefomo commented Dec 5, 2023

ximinez commented Dec 5, 2023

intelliot commented Dec 5, 2023

ximinez commented Jan 5, 2024

intelliot commented Jan 8, 2024

mtrippled left a comment

Choose a reason for hiding this comment

intelliot commented Jan 8, 2024

intelliot commented Feb 2, 2024

sophiax851 commented Feb 2, 2024

Improve lifetime management of ledger objects (`SLE`s) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

Improve lifetime management of ledger objects (`SLE`s) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

shortthefomo commented Dec 4, 2023 •

edited

Loading

shortthefomo commented Dec 5, 2023 •

edited

Loading