Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve online_delete configuration and DB tuning: #3429

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 137 additions & 26 deletions cfg/rippled-example.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
# For more information on where the rippled server instance searches for the
# file, visit:
#
# https://developers.ripple.com/commandline-usage.html#generic-options
# https://xrpl.org/commandline-usage.html#generic-options
#
# This file should be named rippled.cfg. This file is UTF-8 with DOS, UNIX,
# or Mac style end of lines. Blank lines and lines beginning with '#' are
Expand Down Expand Up @@ -869,18 +869,65 @@
#
# These keys are possible for any type of backend:
#
# earliest_seq The default is 32570 to match the XRP ledger
# network's earliest allowed sequence. Alternate
# networks may set this value. Minimum value of 1.
# If a [shard_db] section is defined, and this
# value is present either [node_db] or [shard_db],
# it must be defined with the same value in both
# sections.
#
# online_delete Minimum value of 256. Enable automatic purging
# of older ledger information. Maintain at least this
# number of ledger records online. Must be greater
# than or equal to ledger_history.
#
# advisory_delete 0 for disabled, 1 for enabled. If set, then
# require administrative RPC call "can_delete"
# to enable online deletion of ledger records.
# These keys modify the behavior of online_delete, and thus are only
# relevant if online_delete is defined and non-zero:
#
# earliest_seq The default is 32570 to match the XRP ledger
# network's earliest allowed sequence. Alternate
# networks may set this value. Minimum value of 1.
# advisory_delete 0 for disabled, 1 for enabled. If set, the
# administrative RPC call "can_delete" is required
# to enable online deletion of ledger records.
# Online deletion does not run automatically if
# non-zero and the last deletion was on a ledger
# greater than the current "can_delete" setting.
# Default is 0.
#
# delete_batch When automatically purging, SQLite database
# records are deleted in batches. This value
# controls the maximum size of each batch. Larger
# batches keep the databases locked for more time,
# which may cause other functions to fall behind,
# and thus cause the node to lose sync.
# Default is 100.
#
# back_off_milliseconds
# Number of milliseconds to wait between
# online_delete batches to allow other functions
# to catch up.
# Default is 100.
#
# age_threshold_seconds
# The online delete process will only run if the
# latest validated ledger is younger than this
# number of seconds.
# Default is 60.
#
# recovery_buffer_seconds
# The online delete process checks periodically
# that rippled is still in sync with the network,
# and that the validated ledger is less than
# 'age_threshold_seconds' old. By default, if it
# is not the online delete process aborts and
# tries again later. If 'recovery_buffer_seconds'
# is set and rippled is out of sync, but likely to
# recover quickly, then online delete will wait
# this number of seconds for rippled to get back
# into sync before it aborts.
# Set this value if the node is otherwise staying
# in sync, or recovering quickly, but the online
# delete process is unable to finish.
# Default is unset.
#
# Notes:
# The 'node_db' entry configures the primary, persistent storage.
Expand All @@ -892,6 +939,12 @@
# [import_db] Settings for performing a one-time import (optional)
# [database_path] Path to the book-keeping databases.
#
# There are 4 or 5 bookkeeping SQLite databases that the server creates and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it "4 or 5"? Is one of them only added with certain configurations? Without more context, the phrase "4 or 5" feels uncertain and vague. Maybe "4 to 5" would carry the right implication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think state.db is created if you're using online_delete, so that would be the 5th one. I'll make the change.

# maintains. If you omit this configuration setting, it will default to
# creating a directory called "db" located in the same place as your
# rippled.cfg file. Partial pathnames will be considered relative to
# the location of the rippled executable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of "it will default to creating a directory", just say, "the server creates a directory." Similarly, instead of "will be considered relative" just say "are relative".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed and reworded to be less passive tone.

#
# [shard_db] Settings for the Shard Database (optional)
#
# Format (without spaces):
Expand All @@ -907,12 +960,68 @@
#
# max_size_gb Maximum disk space the database will utilize (in gigabytes)
#
# [sqlite] Tuning settings for the SQLite databases (optional)
#
# There are 4 bookkeeping SQLite database that the server creates and
# maintains. If you omit this configuration setting, it will default to
# creating a directory called "db" located in the same place as your
# rippled.cfg file. Partial pathnames will be considered relative to
# the location of the rippled executable.
# Format (without spaces):
# One or more lines of case-insensitive key / value pairs:
# <key> '=' <value>
# ...
#
# Example:
# sync_level=low
# journal_mode=off
#
# WARNING: These settings can have significant effects on data integrity,
# particularly in failure scenarios. It is strongly recommended that they
# be left at their defaults unless the server is having performance issues
# during normal operation or during automatic purging (online_delete)
# operations. A warning will be logged on startup if 'ledger_history'
# is configured to store more than 10,000,000 ledgers and any of these
# settings are less safe than the default. This is due to the inordinate
# amount of time and bandwidth it will take to safely rebuild a corrupted
# database from other peers.
#
# Optional keys:
#
# safety_level Valid values: high, low
# The default is "high", and tunes the SQLite
# databases in the most reliable mode. "low"
# is equivalent to
seelabs marked this conversation as resolved.
Show resolved Hide resolved
# journal_mode=memory
# synchronous=off
# temp_store=memory
# These settings trade speed and reduced I/O
# for a higher risk of data loss. See the
# individual settings below for more information.
#
# journal_mode Valid values: delete, truncate, persist, memory, wal, off
# The default is "wal", which uses a write-ahead
# log to implement database transactions.
# Alternately, "memory" saves disk I/O, but if
# rippled crashes during a transaction, the
# database is likely to be corrupted.
# See https://www.sqlite.org/pragma.html#pragma_journal_mode
# for more details about the available options.
#
# synchronous Valid values: off, normal, full, extra
# The default is "normal", which works well with
# the "wal" journal mode. Alternatively, "off"
# allows rippled to continue as soon as data is
# passed to the OS, which can significantly
# increase speed, but risks data corruption if
# the host computer crashes before writing that
# data to disk.
# See https://www.sqlite.org/pragma.html#pragma_synchronous
# for more details about the available options.
#
# temp_store Valid values: default, file, memory
seelabs marked this conversation as resolved.
Show resolved Hide resolved
# The default is "file", which will use files
# for temporary database tables and indices.
# Alternatively, "memory" may save I/O, but
# rippled does not currently use many, if any,
# of these temporary objects.
# See https://www.sqlite.org/pragma.html#pragma_temp_store
# for more details about the available options.
#
#
#
Expand Down Expand Up @@ -1212,24 +1321,25 @@ medium

# This is primary persistent datastore for rippled. This includes transaction
# metadata, account states, and ledger headers. Helpful information can be
# found here: https://ripple.com/wiki/NodeBackEnd
# delete old ledgers while maintaining at least 2000. Do not require an
# external administrative command to initiate deletion.
# found at https://xrpl.org/capacity-planning.html#node-db-type
# type=NuDB is recommended for non-validators with fast SSDs. Validators or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a none database backend on a validator. They don't need to concern themselves with historic data at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked capacity planning page recommends RocksDB, but I can definitely see your point about using none (or memory). My biggest concern with using those options on the individual node level, is that it means it will take much longer to start and restart, because the node will have to download the entire ledger every time. On the global level, I'm thinking about a failsafe / worst case situation. For example, if every validator shuts down simultaneously 😱, they will need to have some data on disk so they can bootstrap restarting the network.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, my full history node takes about half an hour to start with --load vs. a few minutes at most to get synced with --net.

Bootstrapping can be done safely with --ledgerfile (https://xrpl.org/commandline-usage.html#initial-ledger-options) and that's probably better anyways a better way forward than trusting that data was stored to disk correctly in such a catastrophic scenario. The file itself could be hashed and communicated out-of-band in a lot of different ways compared to a node database that's not even a full shard in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkusTeufelberger: re: slow start times with --load: are you using SSDs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Over a dozen TB of 'em.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that --net should become the default now, unless a different command line option (for which I don't have a good name) is specified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbougalis I think making --net the default is out of scope for this PR. Do you agree?

# slow / spinning disks should use RocksDB.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a note like this:

Caution: Spinning disks are not recommended. They do not perform well enough to consistently remain synced to the network.

# online_delete=512 is recommended to delete old ledgers while maintaining at
# least 512.
# advisory_delete=0 allows the online delete process to run automatically
# when the node has approximately two times the "online_delete" value of
# ledgers. No external administrative command is required to initiate
# deletion.
[node_db]
type=RocksDB
path=/var/lib/rippled/db/rocksdb
open_files=2000
filter_bits=12
cache_mb=256
file_size_mb=8
file_size_mult=2
online_delete=2000
type=NuDB
path=/var/lib/rippled/db/nudb
online_delete=512
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If such a small value is the default, please consider also adding a shard_db section with a limit of a few dozen/hundred GiB by default.

Copy link
Collaborator Author

@ximinez ximinez Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkusTeufelberger Thanks for that feedback. Those default settings are definitely not yet set in stone. My understanding is that the suggestion for the smaller default was to make it easier for more people to participate without requiring huge resources. (The original ticket didn't have a lot of detail.) Of course, there's a balance to be found, because we already have pretty high hardware requirements, and we do want as many nodes as possible to contribute to history storage as a way to help the network.

I'm not sure if shards are ready to be set by default. I'll defer to @miguelportilla on that question.

advisory_delete=0

# This is the persistent datastore for shards. It is important for the health
# of the ripple network that rippled operators shard as much as practical.
# NuDB requires SSD storage. Helpful information can be found here
# https://ripple.com/build/history-sharding
# NuDB requires SSD storage. Helpful information can be found at
# https://xrpl.org/history-sharding.html
#[shard_db]
#path=/var/lib/rippled/db/shards/nudb
#max_size_gb=500
Expand All @@ -1248,7 +1358,8 @@ time.apple.com
time.nist.gov
pool.ntp.org

# To use the XRP test network (see https://ripple.com/build/xrp-test-net/),
# To use the XRP test network
# (see https://xrpl.org/connect-your-rippled-to-the-xrp-test-net.html),
# use the following [ips] section:
# [ips]
# r.altnet.rippletest.net 51235
Expand Down
4 changes: 2 additions & 2 deletions src/ripple/app/ledger/Ledger.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -228,14 +228,14 @@ Ledger::Ledger(
!txMap_->fetchRoot(SHAMapHash{info_.txHash}, nullptr))
{
loaded = false;
JLOG(j.warn()) << "Don't have TX root for ledger";
JLOG(j.warn()) << "Don't have TX root for ledger" << info_.seq;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: as long as we're changing these messages, we could make them clearer by removing the abbreviations "TX" (here) and "AS" (below, line 238). Instead, just say "transaction" and "state data".

}

if (info_.accountHash.isNonZero() &&
!stateMap_->fetchRoot(SHAMapHash{info_.accountHash}, nullptr))
{
loaded = false;
JLOG(j.warn()) << "Don't have AS root for ledger";
JLOG(j.warn()) << "Don't have AS root for ledger" << info_.seq;
}

txMap_->setImmutable();
Expand Down
4 changes: 4 additions & 0 deletions src/ripple/app/ledger/LedgerMaster.h
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ class Transaction;
class LedgerMaster : public Stoppable, public AbstractFetchPackContainer
{
public:
// Age for last validated ledger if the process has yet to validate.
static constexpr std::chrono::seconds NO_VALIDATED_LEDGER_AGE =
std::chrono::hours{24 * 14};

explicit LedgerMaster(
Application& app,
Stopwatch& stopwatch,
Expand Down
2 changes: 1 addition & 1 deletion src/ripple/app/ledger/impl/LedgerMaster.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ LedgerMaster::getValidatedLedgerAge()
if (valClose == 0s)
{
JLOG(m_journal.debug()) << "No validated ledger";
return weeks{2};
return NO_VALIDATED_LEDGER_AGE;
}

std::chrono::seconds ret = app_.timeKeeper().closeTime().time_since_epoch();
Expand Down
3 changes: 2 additions & 1 deletion src/ripple/app/main/Application.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1022,7 +1022,7 @@ class ApplicationImp : public Application, public RootStoppable, public BasicApp

try
{
auto const setup = setup_DatabaseCon(*config_);
auto setup = setup_DatabaseCon(*config_);

// transaction database
mTxnDB = std::make_unique<DatabaseCon>(
Expand Down Expand Up @@ -1072,6 +1072,7 @@ class ApplicationImp : public Application, public RootStoppable, public BasicApp
mLedgerDB->setupCheckpointing(m_jobQueue.get(), logs());

// wallet database
setup.noPragma();
mWalletDB = std::make_unique<DatabaseCon>(
setup,
WalletDBName,
Expand Down
33 changes: 21 additions & 12 deletions src/ripple/app/main/DBInit.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,23 @@ namespace ripple {

////////////////////////////////////////////////////////////////////////////////

// These pragmas are built at startup and applied to all database
// connections, unless otherwise noted.
inline constexpr char const* CommonDBPragmaJournal{"PRAGMA journal_mode=%s;"};
inline constexpr char const* CommonDBPragmaSync{"PRAGMA synchronous=%s;"};
inline constexpr char const* CommonDBPragmaTemp{"PRAGMA temp_store=%s;"};
// A warning will be logged if any lower-safety sqlite tuning settings
// are used and at least this much ledger history is configured. This
// includes full history nodes. This is because such a large amount of
// data will be more difficult to recover if a rare failure occurs,
// which are more likely with some of the other available tuning settings.
inline constexpr std::uint32_t SQLITE_TUNING_CUTOFF = 10'000'000;

// Ledger database holds ledgers and ledger confirmations
inline constexpr auto LgrDBName{"ledger.db"};

inline constexpr std::array<char const*, 3> LgrDBPragma{
{"PRAGMA synchronous=NORMAL;",
"PRAGMA journal_mode=WAL;",
"PRAGMA journal_size_limit=1582080;"}};
inline constexpr std::array<char const*, 1> LgrDBPragma{
{"PRAGMA journal_size_limit=1582080;"}};

inline constexpr std::array<char const*, 5> LgrDBInit{
{"BEGIN TRANSACTION;",
Expand Down Expand Up @@ -63,15 +73,14 @@ inline constexpr auto TxDBName{"transaction.db"};

inline constexpr
#if (ULONG_MAX > UINT_MAX) && !defined(NO_SQLITE_MMAP)
std::array<char const*, 6>
std::array<char const*, 4>
TxDBPragma
{
{
#else
std::array<char const*, 5> TxDBPragma {{
std::array<char const*, 3> TxDBPragma {{
seelabs marked this conversation as resolved.
Show resolved Hide resolved
#endif
"PRAGMA page_size=4096;", "PRAGMA synchronous=NORMAL;",
"PRAGMA journal_mode=WAL;", "PRAGMA journal_size_limit=1582080;",
"PRAGMA page_size=4096;", "PRAGMA journal_size_limit=1582080;",
"PRAGMA max_page_count=2147483646;",
#if (ULONG_MAX > UINT_MAX) && !defined(NO_SQLITE_MMAP)
"PRAGMA mmap_size=17179869184;"
Expand Down Expand Up @@ -115,10 +124,8 @@ inline constexpr std::array<char const*, 8> TxDBInit{
// Temporary database used with an incomplete shard that is being acquired
inline constexpr auto AcquireShardDBName{"acquire.db"};

inline constexpr std::array<char const*, 3> AcquireShardDBPragma{
{"PRAGMA synchronous=NORMAL;",
"PRAGMA journal_mode=WAL;",
"PRAGMA journal_size_limit=1582080;"}};
inline constexpr std::array<char const*, 1> AcquireShardDBPragma{
{"PRAGMA journal_size_limit=1582080;"}};

inline constexpr std::array<char const*, 1> AcquireShardDBInit{
{"CREATE TABLE IF NOT EXISTS Shard ( \
Expand All @@ -130,6 +137,7 @@ inline constexpr std::array<char const*, 1> AcquireShardDBInit{
////////////////////////////////////////////////////////////////////////////////

// Pragma for Ledger and Transaction databases with complete shards
// These override the CommonDBPragma values defined above.
inline constexpr std::array<char const*, 2> CompleteShardDBPragma{
{"PRAGMA synchronous=OFF;", "PRAGMA journal_mode=OFF;"}};

Expand Down Expand Up @@ -172,6 +180,7 @@ inline constexpr std::array<char const*, 6> WalletDBInit{

static constexpr auto stateDBName{"state.db"};

// These override the CommonDBPragma values defined above.
static constexpr std::array<char const*, 2> DownloaderDBPragma{
{"PRAGMA synchronous=FULL;", "PRAGMA journal_mode=DELETE;"}};

Expand Down
5 changes: 4 additions & 1 deletion src/ripple/app/main/Main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,7 @@ run(int argc, char** argv)
return -1;
}

dbSetup.noPragma();
auto txnDB = std::make_unique<DatabaseCon>(
dbSetup, TxDBName, TxDBPragma, TxDBInit);
auto& session = txnDB->getSession();
Expand All @@ -555,7 +556,9 @@ run(int argc, char** argv)
session << "PRAGMA temp_store_directory=\"" << tmpPath.string()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the temp_store_directory is to make sure there's enough space to perform a VACUUM, which essentially requires as much as the entire database being VACUUMed. The temp_store in memory will likely cause a system to run out of RAM. Imagine transaction DB with 2TB+. (https://sqlite.org/tempfiles.html)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll change that.

<< "\";";
session << "VACUUM;";
session << "PRAGMA journal_mode=WAL;";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing behavior is that the journal_mode=OFF during the VACUUM activity introduces risk of corruption. Instead, I think the behavior of VACUUM should reflect the new config options and defaults for the txdb. Namely, use dbSetup.usePragma() (not noPragma()) and set the configs, then add the "PRAGMA temp_store_directory=" line and then execute VACUUM. Basically, treat the vacuum with the same "safety_mode" as normal operation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that temp_store_directory is deprecated: https://www.sqlite.org/pragma.html#pragma_temp_store_directory.

It looks like you're right about the journal_mode: https://www.sqlite.org/lang_vacuum.html

The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction.

So I'll update that.

assert(dbSetup.globalPragma);
for (auto const& p : *dbSetup.globalPragma)
session << p;
session << "PRAGMA page_size;", soci::into(pageSize);

std::cout << "VACUUM finished. page_size: " << pageSize
Expand Down
26 changes: 17 additions & 9 deletions src/ripple/app/misc/NetworkOPs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2757,16 +2757,24 @@ NetworkOPsImp::getServerInfo(bool human, bool admin, bool counters)
if (std::abs(closeOffset.count()) >= 60)
l[jss::close_time_offset] = closeOffset.count();

auto lCloseTime = lpClosed->info().closeTime;
auto closeTime = app_.timeKeeper().closeTime();
if (lCloseTime <= closeTime)
constexpr std::chrono::seconds HIGH_AGE_THRESHOLD{1000000};
seelabs marked this conversation as resolved.
Show resolved Hide resolved
if (m_ledgerMaster.haveValidated())
{
using namespace std::chrono_literals;
auto age = closeTime - lCloseTime;
if (age < 1000000s)
l[jss::age] = Json::UInt(age.count());
else
l[jss::age] = 0;
auto const age = m_ledgerMaster.getValidatedLedgerAge();
l[jss::age] =
Json::UInt(age < HIGH_AGE_THRESHOLD ? age.count() : 0);
}
else
{
auto lCloseTime = lpClosed->info().closeTime;
auto closeTime = app_.timeKeeper().closeTime();
if (lCloseTime <= closeTime)
{
using namespace std::chrono_literals;
auto age = closeTime - lCloseTime;
l[jss::age] =
Json::UInt(age < HIGH_AGE_THRESHOLD ? age.count() : 0);
}
}
}

Expand Down
Loading