Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

Closed
ximinez opened this issue Mar 26, 2020 · 5 comments
Closed

Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

ximinez opened this issue Mar 26, 2020 · 5 comments
Assignees
Labels
Bug Documentation README changes, code comments, etc. Good First Issue Great issue for a new contributor Syncing issue Trouble getting or keeping a server synced with the network

Comments

@ximinez
Copy link
Collaborator

ximinez commented Mar 26, 2020

Several issues with the configuration options for online_delete

  1. There are several undocumented features for the [node_db] config section which can be used to tune online_delete performance. These need to be documented at least in rippled-example.cfg.
    • delete_batch - number of records to delete per query
    • backOff - milliseconds to sleep between deletes
    • age_threshold - maximum age of the latest validated ledger before the online_delete process abends.
  2. BUG: The age_threshold option is ignored. Instead, SHAMapStore::health() uses a constexpr value.
    • This needs to be fixed so the configured value is used. Since it's not functional now anyway, also change the name to age_threshold_seconds, and change the SHAMapStore::ageThreshold_ variable to a chrono::seconds.
  3. Because backOff is a ms value, but is functional, add a preferred back_off_milliseconds config option, and only document that one. Also change SHAMapStore::backOff_ to be a chrono::milliseconds. Leave backOff available for backward compatibility for anyone using the undocumented feature.
  4. (Edited to add:) Add some begin/end log messages around the execution of the DELETE SQL query at trace level to allow for more detailed analysis if desired.

Steps to Reproduce

  • Define age_threshold to a value different than the default 60. Something really low, like "1" would be good for this test.
  • Run rippled until online_delete runs.
  • Observe that the process doesn't abend after the LVL reaches the specified age.

Expected Result

  • The online_delete process abends after age_threshold seconds.

Actual Result

  • Process only abends if the LVL gets to be 60 seconds old.

Environment

n/a

Supporting Files

n/a

@ximinez ximinez added Bug Documentation README changes, code comments, etc. Good First Issue Great issue for a new contributor Syncing issue Trouble getting or keeping a server synced with the network labels Mar 26, 2020
@mtrippled
Copy link
Collaborator

Also consider an optional parameter to set the SQLIte database to not be synchronous. What this should do is allow delete operations to return quickly, releasing the lock. The actual data would be deleted by the operating system asynchronously. This creates a risk of data problems if the server suddenly crashes, but it would likely reduce their symptoms. It should not be done on full history servers or on systems where we care too much about keeping that data in tact. I think these 2 fixes would go a long way towards making things better for bitso the next time they run online delete.

Basically, PRAGMA SYNCHRONOUS=OFF
somewhere in the initialization code. It is per-session, so every time rippled starts up and after opening the transaction.db.

@mtrippled
Copy link
Collaborator

Also, journal_mode=MEMORY (equally risky for data corruption) looks like it would reduce IO usage, and complement turning synchronous off: https://www.sqlite.org/pragma.html#pragma_journal_mode

@carlhua carlhua assigned carlhua and unassigned carlhua Mar 26, 2020
@ximinez ximinez self-assigned this Mar 27, 2020
@ximinez
Copy link
Collaborator Author

ximinez commented Mar 27, 2020

The current plan is to add config support to tweak some of the SQLite PRAGMA options for our databases. Options to consider include:

@mtrippled
Copy link
Collaborator

@ximinez One issue is that server_info doesn't currently reflect the validated server age. Instead, it reflects last closed ledger. This is in contrast to internal rippled evaluations, such as that used to abend online_delete, which use validated age. Because of this it's not possible to tell whether these evaluations are correct based on diagnostics output. So I created a patch to correct this going forward. Do you mind incorporating into your set of fixes for this, please?
https://github.com/mtrippled/rippled/tree/age

@ximinez
Copy link
Collaborator Author

ximinez commented May 28, 2020

Some more changes that are tangentially related, but close enough to be added to this issue. (Text imported from internal issue RIPD-1590.)

--
In rippled-example.cfg, the [node_db] clause has this comment preceeding it:

This is primary persistent datastore for rippled. This includes transaction
metadata, account states, and ledger headers. Helpful information can be
found here: https://ripple.com/wiki/NodeBackEnd
delete old ledgers while maintaining at least 2000. Do not require an
external administrative command to initiate deletion.

Specifically, the https://ripple.com/wiki/NodeBackEnd link is out of date.

We should consider changing the defaults to use nudb. It is currently RocksDB.

We should consider using the smallest allowed online_delete as the default (256, currently 2000).

The [node_db] settings are among the most frequently changed during troubleshooting, so our documentation and defaults should be made as clear as possible. We should explain that the size of ledger history must be adjusted for both the size of disk drive and system RAM (when using RocksDB). And make it clear that online_delete does not run automatically when advisory_delete is not 0.

ximinez added a commit to ximinez/rippled that referenced this issue Jun 2, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
ximinez added a commit to ximinez/rippled that referenced this issue Jun 2, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this issue Jun 24, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this issue Jun 24, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this issue Jun 25, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this issue Jun 25, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this issue Jun 25, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
manojsdoshi pushed a commit to manojsdoshi/rippled that referenced this issue Jun 25, 2020
* Document delete_batch, back_off_milliseconds, age_threshold_seconds.
* Convert those time values to chrono types.
* Fix bug that ignored age_threshold_seconds.
* Add a "recovery buffer" to the config that gives the node a chance to
  recover before aborting online delete.
* Add begin/end log messages around the SQL queries.
* Add a new configuration section: [sqlite] to allow tuning the sqlite
  database operations. Ignored on full/large history servers.
* Update documentation of [node_db] and [sqlite] in the
  rippled-example.cfg file.
* Resolves XRPLF#3321
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Documentation README changes, code comments, etc. Good First Issue Great issue for a new contributor Syncing issue Trouble getting or keeping a server synced with the network
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants