-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve lifetime management of ledger objects (SLE
s) to prevent runaway memory usage. AKA "Is it caching? It's always caching."
#4822
Conversation
* Only store SLE digest in CachedView; get SLEs from CachedSLEs * Also force release of last ledger used for path finding if there are no path finding requests to process * Count more ST objects (derive from `CountedObject`) * Track CachedView stats in CountedObjects
…emory * upstream/develop: Set version to 2.0.0-rc3 docs(API-CHANGELOG): add extra bullet about DeliverMax (4784) Update API-CHANGELOG.md for release 2.0 (4828) Add Debian 12 Bookworm; ignore core-utils in almalinux (4836) Set version to 2.0.0-rc2 Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (4760) Fix 2.0 regression in tx method with binary output (4812) Update Linux smoketest distros (4813) Promote API version 2 to supported (4803) Support for the mold linker (4807) Proposed 2.0.0-rc2 (4818) Set version to 2.0.0-rc1
|
…emory * upstream/develop: APIv2: show DeliverMax in submit, submit_multisigned (4827) APIv2: consistently return ledger_index as integer (4820)
That looks like a problem at the websocket layer, which path finding shouldn't affect. Can you tell if the same issue occurs under heavy load without this change? |
(Read the request wrong) *edited I will build develop tomorrow and check latest less this, but I run that interface off the 1.12.0 prod build and it does not fail with this error under decent load. |
right did some more testing off the develop branch this morning @ximinez can validate that the issue is present outside of this change as well. |
…emory * upstream/develop: Set version to 2.0.0-rc5 Workarounds for gcc-13 compatibility (4817) Revert 4505, 4760 (4842) Set version to 2.0.0-rc4
When you say "prod build", do you mean the published packages? Those have asserts disabled, so even if the issue exists there, you wouldn't see it. Could you build https://github.com/XRPLF/rippled/tree/1.12.0 the same way you built |
Given the issue appears on |
…emory * upstream/develop: Set version to 2.0.0-rc6 Revert "Apply transaction batches in periodic intervals (4504)" (4852) Revert "Add ProtocolStart and GracefulClose P2P protocol messages (3839)" (4850) docs(API-CHANGELOG): clarify changes for V2 (4773) fix typo: 'of' instead of 'on' (4821)
* Rename the CachedView counters * Fix the scope of the digest lookup lock
@Bronek @thejohnfreeman I pushed a commit to change the |
@sophiax851 this may require perf signoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Internal tracker: RPFC-79 |
…emory * upstream/develop: Set version to 2.0.0-rc7
…emory * upstream/develop: Set version to 2.0.0
Prevent WebSocket connections from trying to close twice. The issue only occurs in debug builds (assertions are disabled in release builds, including published packages), and when the WebSocket connections are unprivileged. The assert (and WRN log) occurs when a client drives up the resource balance enough to be forcibly disconnected while there are still messages pending to be sent. Thanks to @lathanbritz for discovering this issue in #4822.
…emory * upstream/develop: WebSocket should only call async_close once (4848)
Released in 2.0.1. |
It's worth mentioning that this "Perf SignedOff" testing was conducted to ensure that there is no performance regression in the normal transaction processing with the changes made in this PR which was intended to address the root cause of the issue that I identified for #4224 . There was separate testing conducted in verifying the #4224. |
Prevent WebSocket connections from trying to close twice. The issue only occurs in debug builds (assertions are disabled in release builds, including published packages), and when the WebSocket connections are unprivileged. The assert (and WRN log) occurs when a client drives up the resource balance enough to be forcibly disconnected while there are still messages pending to be sent. Thanks to @lathanbritz for discovering this issue in XRPLF#4822.
…away memory usage: (XRPLF#4822) * Add logging for Application.cpp sweep() * Improve lifetime management of ledger objects (`SLE`s) * Only store SLE digest in CachedView; get SLEs from CachedSLEs * Also force release of last ledger used for path finding if there are no path finding requests to process * Count more ST objects (derive from `CountedObject`) * Track CachedView stats in CountedObjects * Rename the CachedView counters * Fix the scope of the digest lookup lock Before this patch, if you asked "is it caching?" It was always caching.
High Level Overview of Change
This PR, if merged, changes
CachedView
to only cache the key needed to find each ledger object (SLE
) in the globalCachedSLEs
, instead of a strong reference to theSLE
itself. Further, it releases a hard reference to the lastLedger
used for path finding work as soon as there is no more work to be done. Additionally, a bunch of logging is added to monitor cache sizes before and after sweeping for expired objects. The logging is not strictly necessary for the solution, but it doesn't appear to hurt, and can be useful if there are similar issues in the future. Finally, several more objects, and some statistics for theCachedSLEs
, are tracked for theget_counts
admin command.Resolves #4224
Context of Change
Currently, the
CachedView
object holds and keeps strong references (std::shared_ptr
) to objects being loaded from the ledger. This prevents those objects from being completely released from the global cache (CachedSLEs
, which is aTaggedCache
) and releasing their memory until the owningLedger
is released.This is not a significant problem during typical server operations, but can lead to memory exhaustion in some circumstances, including path finding. This looks like a memory leak, but isn't really, because the memory would be recovered eventually if rippled was able to continue.
It needs to be noted that this is not a guaranteed complete solution for all path finding memory issues. It's still possible that sufficiently large
path_search
configuration values, or a sufficiently large set of steps, accounts, and trust lines could still consume enough memory to cause problems.Problems with memory usage during path finding has been an ongoing issue. See also:
RippleState
#4080I hope that this is the last word on this issue, but real world performance remains to be seen.
Type of Change
API Impact
get_counts
.Before / After
Before
Certain combinations of path finding requests could cause rippled to exhaust available memory and crash, or be killed by the operating system.
After
The same scenarios will not cause rippled to crash, or at worst will allow rippled to service many more requests and run a lot longer before crashing.
Test Plan
This is an optimization that shouldn't affect anything other than this particular edge case. Thus existing test cases should cover it.
Beta testers are encouraged to build this branch, put it on public facing servers with path finding configured, and report back on their results.