Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically pause copying ledger nodes during online_delete #4907

Draft
wants to merge 59 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Jan 31, 2024

High Level Overview of Change

Mitigates disk write contention while old ledgers are being deleted, and specifically while the full ledger is being copied to the "new" node store.

Context of Change

PR #4503 (reverted by #4882) attempted to improve rippled performance by writing batches to NuDB asynchronously. However, it had an unintended side effect that when online_delete writes the entire ledger to disk, it tends to cause the buffer to fill up, which results in blocking new ledgers from being persisted.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Performance (increase or change in throughput and/or latency)

Before / After

Cost: This change will cause online_delete to take significantly longer to copy the full ledger from the old node store to the new one, which is the last significant step in the process.
Benefit: rippled should be much less likely to desync, and put less load on the disk during online_delete.

online_delete is still a demanding process, so this won't be a panacea, but it should be a significant improvment. (Significant improvement to be measured.)

Performance details

  1. This is an improvement to existing functionality, which could be considered a bug fix.
  2. The change impacts node store writes. Specifically, it should reduce contention between online_delete and writing new ledgers.
  3. The impact should be measured in a couple of different ways.
    1. rippled should struggle less during online_delete to stay synced, and other functions.
    2. rippled should put less strain/demand on the disk during online_delete.
  4. This change affects concurrent processing, in the sense that multiple threads are writing to the node store, especially during online_delete.

Note that back_off_milliseconds is configurable, defaulting to 100. Node operators can de-prioritize online_delete operations more by increasing this value to whatever they are comfortable with.

* Gives other processes (notable ledger persistence during consensus)
  more time to complete their writes.
* The period is set as half the node store batch write limit size, so
  that there is plenty of room for other processes.
* Reuses the `back_off_milliseconds` configuration value, which
  is used for other database delays.
@ximinez ximinez added Bug Perf Attn Needed Attention needed from RippleX Performance Team labels Jan 31, 2024
@ximinez ximinez added this to the 2.1.0 (Mar 2024) milestone Jan 31, 2024
@ximinez
Copy link
Collaborator Author

ximinez commented Jan 31, 2024

Internal tracker: https://ripplelabs.atlassian.net/browse/RPFC-107

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2024

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 77.9%. Comparing base (ea1fffe) to head (b6ed609).

Files with missing lines Patch % Lines
src/xrpld/app/misc/SHAMapStoreImp.cpp 25.0% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #4907   +/-   ##
=======================================
  Coverage     77.9%   77.9%           
=======================================
  Files          783     783           
  Lines        66677   66682    +5     
  Branches      8108    8106    -2     
=======================================
+ Hits         51921   51929    +8     
+ Misses       14756   14753    -3     
Files with missing lines Coverage Δ
src/xrpld/app/misc/SHAMapStoreImp.h 96.6% <100.0%> (ø)
src/xrpld/app/misc/SHAMapStoreImp.cpp 76.6% <25.0%> (+<0.1%) ⬆️

... and 5 files with indirect coverage changes

Impacted file tree graph

Copy link
Collaborator

@mtrippled mtrippled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

* upstream/develop:
  Set version to 2.1.0-rc1
  `fixInnerObjTemplate`: set inner object template (4906)
  feat: allow `port_grpc` to be specified in `[server]` stanza (4728)
  build: add headers needed in Conan package for libxrpl (4885)
  `fixNFTokenReserve`: ensure NFT tx fails when reserve is not met (4767)
  fix(libxrpl): change library names in Conan recipe (4831)
  test: check for success/failure of Windows CI unit tests (4871)
* upstream/develop:
  Set version to 2.1.0
  test: guarantee proper lifetime for temporary Rules object: (4917)
Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM!

* upstream/develop:
  fix compile error on gcc 13: (4932)
  Price Oracle (XLS-47d): (4789) (4789)
* upstream/develop:
  build: add STCurrency.h to xrpl_core to fix clio build (4939)
  feat: add user version of `feature` RPC (4781)
  Fast base58 codec: (4327)
  Remove default ctors from SecretKey and PublicKey: (4607)
* upstream/develop:
  Update remaining actions (4949)
  Upgrade to xxhash 0.8.2 as a Conan requirement, enable SIMD hashing (4893)
  Fix workflows (4948)
  fix: order book update variable swap: (4890)
  Embed patched recipe for RocksDB 6.29.5 (4947)
* upstream/develop:
  Install more public headers (4940)
* upstream/develop:
  test: Env unit test RPC errors return a unique result: (4877)
* upstream/develop:
  fixXChainRewardRounding: round reward shares down: (4933)
  Don't reach consensus as quickly if no other proposals seen: (4763)
  Write improved `forAllApiVersions` used in NetworkOPs (4833)
  Remove zaphod.alloy.ee hub from default server list: (4903)
  Enforce no duplicate slots from incoming connections: (4944)
  `fixEmptyDID`: fix amendment to handle empty DID edge case: (4950)
  perf: improve `account_tx` SQL query: (4955)
  Fix workflows (4951)
* upstream/develop:
  chore: change Github Action triggers for build/test jobs (4956)
* upstream/develop:
  Set version to 2.1.1
  fix: improper handling of large synthetic AMM offers:
* upstream/develop:
  chore: Default validator-keys-tool to master branch: (4943)
* upstream/develop:
  Set version to 2.2.0-b2
  fix Conan component reference typo
  Remove unused lambdas from MultiApiJson_test
* Reduce the copy backoff interval from 32k to 128
* Increase the backoff time from 100ms to 2s
* upstream/develop:
  fix: resolve database deadlock: (4989)
  test: verify the rounding behavior of equal-asset AMM deposits (4982)
  test: Add tests to raise coverage of AMM (4971)
  chore: Improve codecov coverage reporting (4977)
  test: Unit test for AMM offer overflow (4986)
  fix amendment to add `PreviousTxnID`/`PreviousTxnLgrSequence` (4751)
ximinez added 15 commits July 19, 2024 17:08
* upstream/develop:
  Add xrpld build option and Conan package test (5052)
* upstream/develop:
  Update BUILD.md after PR 5052 (5067)
* upstream/develop:
  chore: Add comments to SignerEntries.h (5059)
  chore: Rename two files from Directory* to Dir*: (5058)
* upstream/develop:
  Ensure levelization sorting is ASCII-order across platforms (5072)
  fix: Fix NuDB build error via Conan patch (5061)
  Disallow filtering account_objects by unsupported types (5056)
* upstream/develop:
  Update gcovr EXCLUDE (5084)
  Fix crash inside `OverlayImpl` loops over `ids_` (5071)
  Set version to 2.3.0-b2
  docs: Document the process for merging pull requests (5010)
  Remove unused constants from resource/Fees.h (4856)
  fix: change error for invalid `feature` param in `feature` RPC (5063)
  Set version to 2.2.1
  Use error codes throughout fast Base58 implementation
  Improve error handling in some RPC commands
* upstream/develop:
  Factor out Transactor::trapTransaction (5087)
  Remove shards (5066)
* upstream/develop:
  Address rare corruption of NFTokenPage linked list (4945)
* upstream/develop:
  chore: Fix documentation generation job: (5091)
  chore: libxrpl verification on CI (5028)
* upstream/develop:
  docs: Update options documentation (5083)
  refactor: Remove dead headers (5081)
  refactor: Remove reporting mode (5092)
* upstream/develop:
  Update Release Notes for 2.2.1 and 2.2.2
  Set version to 2.2.2
  Allow only 1 job queue slot for each validation ledger check
  Allow only 1 job queue slot for acquiring inbound ledger.
  Track latencies of certain code blocks, and log if they take too long
* upstream/develop:
  test: Retry RPC commands to try to fix MacOS CI jobs (5120)
* upstream/develop:
  Set version to 2.3.0-b4
  feat(SQLite): allow configurable database pragma values (5135)
  refactor: re-order PRAGMA statements (5140)
  fix(book_changes): add "validated" field and reduce RPC latency (5096)
  chore: fix typos in comments (5094)
  Set version to 2.2.3
  Update SQLite3 max_page_count to match current defaults (5114)
Merge remote-tracking branch 'upstream/develop' into shamapstore-backoff

* upstream/develop:
  Expand Error Message for rpcInternal (4959)
  docs: clean up API-CHANGELOG.md (5064)
* upstream/develop:
  Consolidate definitions of fields, objects, transactions, and features (5122)
  Ignore reformat when blaming
  Reformat code with clang-format-18
  Update pre-commit hook
  Update clang-format settings
  Update clang-format workflow
* upstream/develop:
  docs: Add protobuf dependencies to linux setup instructions (5156)
  fix: reject invalid markers in account_objects RPC calls (5046)
  Update RELEASENOTES.md (5154)
  Introduce MPT support (XLS-33d): (5143)
@ximinez ximinez marked this pull request as draft October 31, 2024 22:16
@ximinez
Copy link
Collaborator Author

ximinez commented Oct 31, 2024

Even though this has two approvals, I'm changing it back to draft because early perf tests did not show any change, and there may be a better / different solution.

* upstream/develop:
  Add hubs.xrpkuwait.com to bootstrap (5169)
* upstream/develop:
  Add AMMClawback Transaction (XLS-0073d) (5142)
* upstream/develop:
  Set version to 2.3.0-rc1
  Replace Uint192 with Hash192 in server_definitions response (5177)
  Fix potential deadlock (5124)
  Introduce Credentials support (XLS-70d): (5103)
  Fix token comparison in Payment (5172)
  Add fixAMMv1_2 amendment (5176)
* upstream/develop:
  fix: include `index` in `server_definitions` RPC (5190)
  Fix ledger_entry crash on invalid credentials request (5189)
* upstream/develop:
  Set version to 2.3.0
  refactor(AMMClawback): move tfClawTwoAssets check (5201)
  Add a new serialized type: STNumber (5121)
  fix: check for valid ammID field in amm_info RPC (5188)
* upstream/develop:
  test: Add more test cases for Base58 parser (5174)
  test: Check for some unlikely null dereferences in tests (5004)
  Add Antithesis intrumentation (5042)
* upstream/develop:
  refactor: clean up `LedgerEntry.cpp` (5199)
* upstream/develop:
  Enforce levelization in libxrpl with CMake (5111)
* upstream/develop:
  Set version to 2.4.0-b1
  fix: Add header for set_difference (5197)
  fix: allow overlapping types in `Expected` (5218)
  Add MPTIssue to STIssue (5200)
  Antithesis instrumentation improvements (5213)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Perf Attn Needed Attention needed from RippleX Performance Team
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

6 participants