-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve online_delete configuration and DB tuning: #3429
Conversation
evaluations for validated ledger age.
* Document delete_batch, back_off_milliseconds, age_threshold_seconds. * Convert those time values to chrono types. * Fix bug that ignored age_threshold_seconds. * Add a "recovery buffer" to the config that gives the node a chance to recover before aborting online delete. * Add begin/end log messages around the SQL queries. * Add a new configuration section: [sqlite] to allow tuning the sqlite database operations. Ignored on full/large history servers. * Update documentation of [node_db] and [sqlite] in the rippled-example.cfg file. * Resolves XRPLF#3321
online_delete=2000 | ||
type=NuDB | ||
path=/var/lib/rippled/db/nudb | ||
online_delete=512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If such a small value is the default, please consider also adding a shard_db section with a limit of a few dozen/hundred GiB by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkusTeufelberger Thanks for that feedback. Those default settings are definitely not yet set in stone. My understanding is that the suggestion for the smaller default was to make it easier for more people to participate without requiring huge resources. (The original ticket didn't have a lot of detail.) Of course, there's a balance to be found, because we already have pretty high hardware requirements, and we do want as many nodes as possible to contribute to history storage as a way to help the network.
I'm not sure if shards are ready to be set by default. I'll defer to @miguelportilla on that question.
# delete old ledgers while maintaining at least 2000. Do not require an | ||
# external administrative command to initiate deletion. | ||
# found at https://xrpl.org/capacity-planning.html#node-db-type | ||
# type=NuDB is recommended for non-validators with fast SSDs. Validators or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a none
database backend on a validator. They don't need to concern themselves with historic data at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked capacity planning page recommends RocksDB, but I can definitely see your point about using none
(or memory
). My biggest concern with using those options on the individual node level, is that it means it will take much longer to start and restart, because the node will have to download the entire ledger every time. On the global level, I'm thinking about a failsafe / worst case situation. For example, if every validator shuts down simultaneously 😱, they will need to have some data on disk so they can bootstrap restarting the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anecdotally, my full history node takes about half an hour to start with --load
vs. a few minutes at most to get synced with --net
.
Bootstrapping can be done safely with --ledgerfile
(https://xrpl.org/commandline-usage.html#initial-ledger-options) and that's probably better anyways a better way forward than trusting that data was stored to disk correctly in such a catastrophic scenario. The file itself could be hashed and communicated out-of-band in a lot of different ways compared to a node database that's not even a full shard in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarkusTeufelberger: re: slow start times with --load
: are you using SSDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. Over a dozen TB of 'em.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that --net
should become the default now, unless a different command line option (for which I don't have a good name) is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbougalis I think making --net
the default is out of scope for this PR. Do you agree?
(Include these items in the main commit when squashing.) * Update rippled URLs in documentation. * Full history and fetch depth use max() value instead of an arbitrary value of 1 billion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Still thinking about this.
src/ripple/core/DatabaseCon.h
Outdated
}; | ||
|
||
template <std::size_t N, std::size_t M> | ||
DatabaseCon( | ||
Setup const& setup, | ||
std::string const& DBName, | ||
bool useCommonPragma, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I like this; I'm generally allergic to functions that accept a bool
and use it to do one or the other thing. Why can't we set the right values in setup
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be pretty simple to set the values in the Setup
object, now that you mention it. I might need to define a member function or two, just to make the internals easier, and to make it easier switch between the options for Setup
objects that are used to create multiple DatabaseCon
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS. I went down this path because I initially thought the other ctor override would need to use the "common pragma" but without a Setup
object being passed in. I later realized none of them do, but didn't backtrack to see if the bool
was still a good idea. So thanks for pointing this out!
src/ripple/core/DatabaseCon.h
Outdated
static std::unique_ptr<std::vector<std::string> const> CommonPragma; | ||
/// Shortcut used by the database connections that ignore the common | ||
/// pragma strings | ||
static const std::vector<std::string> NoCommonPragma; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
μNit: weird case for the name(s) here. Also confused as to why two separate things are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of other static class variables have capitalized names, but I don't necessarily have to keep doing it.
Edit: I looked again, and didn't find any non-const examples. So 🤷♂️ I'll make it more sensible.
src/ripple/app/main/DBInit.h
Outdated
// full history nodes. This is because such a large amount of data will | ||
// be more difficult to recover if a rare failure occurs, which are | ||
// more likely with some of the other available tuning settings. | ||
inline constexpr std::uint32_t SQLITE_TUNING_CUTOFF = 100'000'000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the rationale behind this, but I'm not sure if it's a good idea. I think that it's reasonable to warn in the configuration file and even on every startup about these options, but making the options entirely unavailable is not. Ironically, the existing logic that enforces this will prevent choosing options that are safer than the default even if the server operator really wants them.
I'll give this some more thought, but right now, it seems like a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change it to warning log message(s) pretty easily, and update the config file.
* changed the sqlite ledger_history cutoff to log a warning instead of ignoring sqlite settings. updated documentation. * integrated the commonpragma back in to the setup struct and restored the interface of the databasecon ctors.
* involved moving some test logging subclasses around to be more easily accessible, and some minor refactoring of Env.
@@ -555,7 +556,9 @@ run(int argc, char** argv) | |||
session << "PRAGMA temp_store_directory=\"" << tmpPath.string() | |||
<< "\";"; | |||
session << "VACUUM;"; | |||
session << "PRAGMA journal_mode=WAL;"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing behavior is that the journal_mode=OFF during the VACUUM activity introduces risk of corruption. Instead, I think the behavior of VACUUM should reflect the new config options and defaults for the txdb. Namely, use dbSetup.usePragma() (not noPragma()) and set the configs, then add the "PRAGMA temp_store_directory=" line and then execute VACUUM. Basically, treat the vacuum with the same "safety_mode" as normal operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that temp_store_directory
is deprecated: https://www.sqlite.org/pragma.html#pragma_temp_store_directory.
It looks like you're right about the journal_mode
: https://www.sqlite.org/lang_vacuum.html
The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction.
So I'll update that.
@@ -348,23 +359,14 @@ SHAMapStoreImp::run() | |||
|
|||
// will delete up to (not including) lastRotated | |||
if (validatedSeq >= lastRotated + deleteInterval_ && | |||
canDelete_ >= lastRotated - 1) | |||
canDelete_ >= lastRotated - 1 && !health()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The health() check is also performed a few lines later, during clearPrior(). Why introduce it here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily, it's to prevent writing the log message if no actual work is going to get done. Compare to what it was doing before: log the "rotating" message, then do a health()
check, then clearPrior()
.
If any change should be made, I'd think it would be to take the first health()
check out of clearPrior()
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, removing the first health() from clearPrior() makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -378,14 +380,13 @@ SHAMapStoreImp::run() | |||
default:; | |||
} | |||
|
|||
JLOG(journal_.trace()) << "copying ledger " << validatedSeq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May as well make this debug(), so we can see how long it takes to copy ledgers without setting the level to trace(). This is a very low-frequency message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense.
|
||
JLOG(journal_.trace()) << "freshening caches"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may as well be debug() for the same reason as above.
assert(false); | ||
return false; | ||
} | ||
JLOG(journal_.trace()) << "End: Look up lowest value of: " << minQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this check? Does it catch a non-existent table or some other error that makes checking if m is unset insufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't remember for sure why I added the if (auto db = database.checkoutDb())
check... I think it was so that it would go out of scope and release the lock sooner, and then I added the else
because it "wouldn't hurt". It would make more sense to rewrite it as a simple local block.
else | ||
{ | ||
assert(false); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No caller of clearSql() checks the return value. This function can return void instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Will fix.
* Make `safety_level` and other tuning options mutually exclusive * Clean up VACUUM functionality and tuning to use settings. * Fix bugs in the config setup where it still ignored tuning settings if full history. Add tests to catch that mistake. * Fix some bugs in the unit tests, and make them more robustly check error messages.
* Change some log message levels. * clearSql no longer returns a bool that every caller ignores. * Restructure a few unnecessary validity checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments, but overall this looks good.
@@ -552,9 +549,6 @@ run(int argc, char** argv) | |||
std::cout << "VACUUM beginning. page_size: " << pageSize | |||
<< std::endl; | |||
|
|||
session << "PRAGMA journal_mode=OFF;"; | |||
session << "PRAGMA temp_store_directory=\"" << tmpPath.string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the temp_store_directory is to make sure there's enough space to perform a VACUUM, which essentially requires as much as the entire database being VACUUMed. The temp_store in memory will likely cause a system to run out of RAM. Imagine transaction DB with 2TB+. (https://sqlite.org/tempfiles.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over the documentation changes. Definitely a big improvement over the previous version! I left a few optional suggestions to tune up the language in a few places, but 👍 with or without those changes.
cfg/rippled-example.cfg
Outdated
@@ -892,6 +939,12 @@ | |||
# [import_db] Settings for performing a one-time import (optional) | |||
# [database_path] Path to the book-keeping databases. | |||
# | |||
# There are 4 or 5 bookkeeping SQLite databases that the server creates and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it "4 or 5"? Is one of them only added with certain configurations? Without more context, the phrase "4 or 5" feels uncertain and vague. Maybe "4 to 5" would carry the right implication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think state.db
is created if you're using online_delete
, so that would be the 5th one. I'll make the change.
cfg/rippled-example.cfg
Outdated
# maintains. If you omit this configuration setting, it will default to | ||
# creating a directory called "db" located in the same place as your | ||
# rippled.cfg file. Partial pathnames will be considered relative to | ||
# the location of the rippled executable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instead of "it will default to creating a directory", just say, "the server creates a directory." Similarly, instead of "will be considered relative" just say "are relative".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed and reworded to be less passive tone.
cfg/rippled-example.cfg
Outdated
# external administrative command to initiate deletion. | ||
# found at https://xrpl.org/capacity-planning.html#node-db-type | ||
# type=NuDB is recommended for non-validators with fast SSDs. Validators or | ||
# slow / spinning disks should use RocksDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a note like this:
Caution: Spinning disks are not recommended. They do not perform well enough to consistently remain synced to the network.
src/ripple/app/ledger/Ledger.cpp
Outdated
@@ -228,14 +228,14 @@ Ledger::Ledger( | |||
!txMap_->fetchRoot(SHAMapHash{info_.txHash}, nullptr)) | |||
{ | |||
loaded = false; | |||
JLOG(j.warn()) << "Don't have TX root for ledger"; | |||
JLOG(j.warn()) << "Don't have TX root for ledger" << info_.seq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: as long as we're changing these messages, we could make them clearer by removing the abbreviations "TX" (here) and "AS" (below, line 238). Instead, just say "transaction" and "state data".
* Rename retry_buffer_seconds to retry_wait_seconds. * Improve documentation and examples in example.cfg file. * Simplify the DatabaseCon::Setup pragma handling. * Other misc cleanups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I did leave one nit that I think we should resolve (the Base
parameter on the DatabaseCon
constructor). However since that's so minor I'll give a thumbs up now.
// Only the most trivial databases will fit in memory on typical | ||
// (recommended) software. Force temp files to be written to disk | ||
// regardless of the config settings. | ||
session << boost::format(CommonDBPragmaTemp) % "file"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original behavior needs to come back--namely, that the actual temp directory be specified. Otherwise, SQLIte will by default use the directory in which the database is.
Alternately, I think that getting rid of this feature altogether makes a lot of sense. If somebody really wants to VACUUM a database, it's easier to use the command line tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
👍
New changes 👍 from me. |
* Document delete_batch, back_off_milliseconds, age_threshold_seconds. * Convert those time values to chrono types. * Fix bug that ignored age_threshold_seconds. * Add a "recovery buffer" to the config that gives the node a chance to recover before aborting online delete. * Add begin/end log messages around the SQL queries. * Add a new configuration section: [sqlite] to allow tuning the sqlite database operations. Ignored on full/large history servers. * Update documentation of [node_db] and [sqlite] in the rippled-example.cfg file. Resolves #3321
recover before aborting online delete.
database operations. Ignored on full/large history servers.
rippled-example.cfg file.
I added the "recovery buffer" function that wasn't described in the original issue to scratch my own itch, because my local machine was losing sync for a fraction of a second while the ledger was being copied over. This resolved that issue. I defaulted it to not run, so it doesn't cause unexpected behavior.
Note this includes two commits, one from @mtrippled . Both need to be reviewed.
This change is