Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

ximinez · 2020-03-26T17:01:22Z

Several issues with the configuration options for `online_delete`

There are several undocumented features for the [node_db] config section which can be used to tune online_delete performance. These need to be documented at least in rippled-example.cfg.
- delete_batch - number of records to delete per query
- backOff - milliseconds to sleep between deletes
- age_threshold - maximum age of the latest validated ledger before the online_delete process abends.
BUG: The age_threshold option is ignored. Instead, SHAMapStore::health() uses a constexpr value.
- This needs to be fixed so the configured value is used. Since it's not functional now anyway, also change the name to age_threshold_seconds, and change the SHAMapStore::ageThreshold_ variable to a chrono::seconds.
Because backOff is a ms value, but is functional, add a preferred back_off_milliseconds config option, and only document that one. Also change SHAMapStore::backOff_ to be a chrono::milliseconds. Leave backOff available for backward compatibility for anyone using the undocumented feature.
(Edited to add:) Add some begin/end log messages around the execution of the DELETE SQL query at trace level to allow for more detailed analysis if desired.

Steps to Reproduce

Define age_threshold to a value different than the default 60. Something really low, like "1" would be good for this test.
Run rippled until online_delete runs.
Observe that the process doesn't abend after the LVL reaches the specified age.

Expected Result

The online_delete process abends after age_threshold seconds.

Actual Result

Process only abends if the LVL gets to be 60 seconds old.

Environment

n/a

Supporting Files

n/a

The text was updated successfully, but these errors were encountered:

mtrippled · 2020-03-26T18:33:09Z

Also consider an optional parameter to set the SQLIte database to not be synchronous. What this should do is allow delete operations to return quickly, releasing the lock. The actual data would be deleted by the operating system asynchronously. This creates a risk of data problems if the server suddenly crashes, but it would likely reduce their symptoms. It should not be done on full history servers or on systems where we care too much about keeping that data in tact. I think these 2 fixes would go a long way towards making things better for bitso the next time they run online delete.

Basically, PRAGMA SYNCHRONOUS=OFF
somewhere in the initialization code. It is per-session, so every time rippled starts up and after opening the transaction.db.

mtrippled · 2020-03-26T20:42:59Z

Also, journal_mode=MEMORY (equally risky for data corruption) looks like it would reduce IO usage, and complement turning synchronous off: https://www.sqlite.org/pragma.html#pragma_journal_mode

ximinez · 2020-03-27T19:42:38Z

The current plan is to add config support to tweak some of the SQLite PRAGMA options for our databases. Options to consider include:

https://www.sqlite.org/pragma.html#pragma_journal_mode
https://www.sqlite.org/pragma.html#pragma_synchronous
https://www.sqlite.org/pragma.html#pragma_temp_store
https://www.sqlite.org/pragma.html#pragma_optimize
Consider an exception for nodes running full history. Those should be forced to emphasise data safety over other considerations. They don't use online_delete anyway.

mtrippled · 2020-03-29T00:47:10Z

@ximinez One issue is that server_info doesn't currently reflect the validated server age. Instead, it reflects last closed ledger. This is in contrast to internal rippled evaluations, such as that used to abend online_delete, which use validated age. Because of this it's not possible to tell whether these evaluations are correct based on diagnostics output. So I created a patch to correct this going forward. Do you mind incorporating into your set of fixes for this, please?
https://github.com/mtrippled/rippled/tree/age

ximinez · 2020-05-28T16:56:33Z

Some more changes that are tangentially related, but close enough to be added to this issue. (Text imported from internal issue RIPD-1590.)

--
In rippled-example.cfg, the [node_db] clause has this comment preceeding it:

This is primary persistent datastore for rippled. This includes transaction
metadata, account states, and ledger headers. Helpful information can be
found here: https://ripple.com/wiki/NodeBackEnd
delete old ledgers while maintaining at least 2000. Do not require an
external administrative command to initiate deletion.

Specifically, the https://ripple.com/wiki/NodeBackEnd link is out of date.

We should consider changing the defaults to use nudb. It is currently RocksDB.

We should consider using the smallest allowed online_delete as the default (256, currently 2000).

The [node_db] settings are among the most frequently changed during troubleshooting, so our documentation and defaults should be made as clear as possible. We should explain that the size of ledger history must be adjusted for both the size of disk drive and system RAM (when using RocksDB). And make it clear that online_delete does not run automatically when advisory_delete is not 0.

* Document delete_batch, back_off_milliseconds, age_threshold_seconds. * Convert those time values to chrono types. * Fix bug that ignored age_threshold_seconds. * Add a "recovery buffer" to the config that gives the node a chance to recover before aborting online delete. * Add begin/end log messages around the SQL queries. * Add a new configuration section: [sqlite] to allow tuning the sqlite database operations. Ignored on full/large history servers. * Update documentation of [node_db] and [sqlite] in the rippled-example.cfg file. * Resolves XRPLF#3321

ximinez added Bug Documentation README changes, code comments, etc. Good First Issue Great issue for a new contributor Syncing issue Trouble getting or keeping a server synced with the network labels Mar 26, 2020

carlhua assigned carlhua and unassigned carlhua Mar 26, 2020

ximinez self-assigned this Mar 27, 2020

ximinez mentioned this issue May 12, 2020

Make server_info report consistent with internal #3397

Closed

ximinez mentioned this issue Jun 2, 2020

Improve online_delete configuration and DB tuning: #3429

Closed

manojsdoshi closed this as completed in 4702c8b Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

ximinez commented Mar 26, 2020 •

edited

Loading

mtrippled commented Mar 26, 2020

mtrippled commented Mar 26, 2020

ximinez commented Mar 27, 2020

mtrippled commented Mar 29, 2020

ximinez commented May 28, 2020

Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

Issues with online_delete configuration (Version: 1.5.0-rc3) #3321

Comments

ximinez commented Mar 26, 2020 • edited Loading

Several issues with the configuration options for online_delete

Steps to Reproduce

Expected Result

Actual Result

Environment

Supporting Files

mtrippled commented Mar 26, 2020

mtrippled commented Mar 26, 2020

ximinez commented Mar 27, 2020

mtrippled commented Mar 29, 2020

ximinez commented May 28, 2020

ximinez commented Mar 26, 2020 •

edited

Loading

Several issues with the configuration options for `online_delete`