-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate thread safety to RCLConsensus (RIPD-1389) #2106
Conversation
Some points I may consider addressing if reviewers think worthwhile:
|
On the ring buffer, saw this this morning: https://github.com/martinmoene/ring-span-lite I haven't checked it out, I don't know how good it is. Just thought I would mention it in case it is helpful. |
Codecov Report
@@ Coverage Diff @@
## develop #2106 +/- ##
===========================================
+ Coverage 69.48% 69.55% +0.07%
===========================================
Files 685 689 +4
Lines 50520 50565 +45
===========================================
+ Hits 35105 35173 +68
+ Misses 15415 15392 -23
Continue to review full report at Codecov.
|
Rebased on 0.70.0-b6 |
Tagging @wilsonianb for 16589d4 |
src/ripple/app/misc/ValidatorKeys.h
Outdated
//============================================================================== | ||
|
||
#ifndef RIPPLE_APP_MISC_VALIDATOR_IDENTITY_H_INCLUDED | ||
#define RIPPLE_APP_MISC_VALIDATOR_IDENTITY_H_INCLUDED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why IDENTITY
instead of KEYS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I had renamed the file a few times.
src/ripple/app/misc/ValidatorKeys.h
Outdated
class Config; | ||
|
||
/** Validator keys and manifest as set in configuration file. Values will be | ||
emtpy if not configured as a validator or not configured with a manifest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emtpy
-> empty
{ | ||
configInvalid_ = true; | ||
JLOG(j.fatal()) | ||
<< "Invalid entry in validator token configuration."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could modify this to resemble the message for an invalid seed below
"Invalid seed specified in [" SECTION_VALIDATOR_TOKEN "]"
// No config -> no key but valid | ||
Config c; | ||
ValidatorKeys k{c, j}; | ||
BEAST_EXPECT(k.publicKey.size() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also check secretKey
to be extra safe. Same with a few other tests below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I missed this when adding the [FOLD]
commit, but secretKey
has a static size. How would you recommend I test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot about that. I'm not sure how you could test that.
src/test/app/ValidatorKeys_test.cpp
Outdated
} | ||
|
||
{ | ||
// Token takes precedence over seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't allow the config to have both a token and a seed, but the check happens here:
https://github.com/bachase/rippled/blob/consensus-lock3/src/ripple/core/impl/Config.cpp#L361-L364
Do you think that should be moved into ValidatorKeys
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I clearly missed that since I added this test. From a code standpoint, it would be nice to move that check over, but I like the idea of validating as early as possible, which would be where the config currently does the work. What would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like keeping that check with the other config validations in Config::loadFromString
, so I'd vote for either leaving as is or checking again in ValidatorKeys
(and marking configInvalid_
).
@@ -40,6 +40,7 @@ namespace ripple { | |||
class InboundTransactions; | |||
class LocalTxs; | |||
class LedgerMaster; | |||
class ValidatorKeys; | |||
|
|||
/** Manges the generic consensus algorithm for use by the RCL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manges
--> Manages
Rebased on 0.70.0, squashed a few commits and address @wilsonianb's comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor things
|
||
// Called when consensus operating mode changes | ||
void onModeChange(ConsensuMode before, ConsensusMode after); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConsensuMode
is missing an s
|
||
// Encapsulates the result of consensus. | ||
template <class Adaptor> | ||
struct ConsensusResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... ConsensusResult
, ConsensusMode
, ConsensusCloseTimes
... it's starting to be a bit much. Should we consider a separate consensus
namespace for all this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for it! There were some downvotes last time the topic came up. I could put them inside the Consensus class, but not all depend on the Adaptor
template type, so its convenient to specify independently.
@@ -449,7 +466,7 @@ RCLConsensus::doAccept( | |||
JLOG(j_.info()) << "CNF buildLCL " << newLCLHash; | |||
|
|||
// See if we can accept a ledger as fully-validated | |||
ledgerMaster_.consensusBuilt(sharedLCL.ledger_, getJson(true)); | |||
ledgerMaster_.consensusBuilt(sharedLCL.ledger_, consensusJson); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do std::move(consensusJson)
here for a potential micro-optimization
// Since Consensus does not provide intrinsic thread-safety, this mutex | ||
// guards all calls to consensus_. adaptor_ uses atomics internally | ||
// to allow concurrent access of its data members that have getters. | ||
mutable std::recursive_mutex mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we easily avoid std::recursive_mutex
in favor of a non-recursive one? Recursive locking = code smell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly confident the current version does not require the recursive_mutex
. In fact, that was the primary motivation for this refactor. But @scottschurr has done a good job convincing me that unless the code structurally prevents recursive calls while under lock, future refactors can break this assumption in non-obvious ways.
In particular, we may call into support code in NetworkOps
or InboundLedgers
while the mutex is under lock. Currently those won't call back into RCLConsensus
but its tough to prevent that from changing in a clear way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since my name came up, I'll mention that it should be possible to add instrumentation that would fire an assert
or LogicError
if a mutex were about to be used recursively when it is not intended for such use. I haven't thought hard about the particular scenario in question, but here's an example where we know that recursive calls are possible, but we didn't want them to be recursive at a particular place: https://github.com/ripple/rippled/blob/develop/src/ripple/core/impl/DeadlineTimer.cpp#L36-L41
Another possibility would be that one or more of our (non-recursive) mutex
library implementations may shriek in an appropriate fashion (:scream:) if called recursively. @HowardHinnant would be a good resource to ask about that. I know Howard has strong opinions about recursive_mutex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generally best to avoid recursive_mutex
if possible. However I don't loose sleep over the use of recursive_mutex
(I put it in the standard ;-) ). I can whip up a shrieking mutex if we decide that's what we want. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to using a shrieking mutex 😱 if that is the preference.
@@ -707,7 +700,7 @@ void NetworkOPsImp::processHeartbeatTimer () | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above code-path we can call setMode(omCONNECTED)
back to back: if we're in omDISCONNECTED
(line 687) we set omCONNECTED
, then in 698, if we're omCONNECTED
, we again set omCONNECTED
.
This may be harmless (except for the extra work), but it could cause some "flapping" of the network mode (disconnected -> connected -> syncing -> connected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing but would prefer modifying as a separate change, as I'm less sure the consequences.
On first, pass, I'm not seeing how flapping would happen; I would think both calls for setMode(omCONNECTED)
would generally end up setting the same mode, unless we are close to the 1 minute ledger age boundary.
src/ripple/consensus/Consensus.h
Outdated
|
||
static std::string | ||
to_string(Mode m); | ||
//Revoke our outstanding proposal, if any, and cease proposing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micronit: leading space missing.
We enter the round proposing or observing. If we detect we are working | ||
on the wrong prior ledger, we go to wrongLedger and attempt to acquire | ||
the right one. Once we acquire the right one, we go to the switchedLedger | ||
mode. If we again detect the wrong ledger before this round ends, we go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads weird:
Once we acquire the right one [...] If we again detect the wrong ledger [...] we go back [...] until we acquire the right one.
If we acquired the right one, how can we have the wrong one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are really really slow, its possible we receive new validations that change our choice of the right ledger. So
- Detect wrong ledger
Id:12
and attempt to acquire it. - Time passes.
- We acquire ledger
Id:12
, - Receive new validations for some other ledger
Id:13
and attempt to acquire it.
I admit this is an outlandish case, but it is a possible sequence of events currently.
src/ripple/overlay/impl/PeerImp.cpp
Outdated
@@ -1094,7 +1094,7 @@ PeerImp::onMessage (std::shared_ptr <protocol::TMTransaction> const& m) | |||
flags |= SF_TRUSTED; | |||
} | |||
|
|||
if (! app_.getOPs().getValidationPublicKey().size()) | |||
if (! app_.getValidationPublicKey().size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if having bool Application::validating() const
makes sense.
Checking the size of the public key is a bit like checking whether your car radio is working by connecting a multimeter to the antenna input, instead of turning the radio on and seeing if sound comes out of the speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll try to make that explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ValidatorKeys
LGTM
Rebased on 0.70.1 |
4f3961e
to
556d6f3
Compare
Rebased on 0.80.0-b1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@JoelKatz @wilsonianb can I get a second look at the latest commit 1f5917b? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't one already, could you add an explicit test that consensus is reached with 4 of 5 in agreement?
// additional timerEntry call for the updated peer 0 and | ||
// peer 1 positions to arrive. Once they do, now peers 2-5 | ||
// see complete agreement and declare consensus | ||
if (p.id > 1) | ||
BEAST_EXPECT( | ||
p.prevRoundTime() > sim.peers[0].prevRoundTime()); | ||
} | ||
else // peer 0 is not participating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this comment and the one on line 217 to include peer 1.
p.prevRoundTime() == sim.peers[0].prevRoundTime()); | ||
} | ||
BEAST_EXPECT(p.prevProposers() == sim.peers.size() - 1); | ||
BEAST_EXPECT(p.prevRoundTime() == sim.peers[0].prevRoundTime()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run this test after reverting the agreement comparison from >=
to >
, the only check that fails is the prevRoundTime
. Is that expected (is that the only thing here indicating that consensus was reached)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the only change for now. The nodes still reach consensus, it just takes peers 1-4 longer.
As part of building out the framework, there is a now a better model for writing an explicit checker that would verify the number of proposals each node sent. I'll target this spot for an upgrade when it gets released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Change to consensus check LGTM
Moves thread safety from generic Consensus to RCLConsensus and switch generic Consensus to adaptor design.
In 0.80.0-b2 |
This PR includes changes that move consensus thread safety logic from the generic implementation in
Consensus
into the RCL adapted versionRCLConsensus
. This has the advantage of placing the concurrency responsibilities in one spot in the code and additionally improves the performance of the consensus simulation framework by eliminating the uneeded concurrency overhead. The two primary changes are:Switch from a CRTP design to an
Adaptor
design for genericConsensus
. Previously, CRTP was used to allow the specializing class to call into the genericConsensus
code while servicing some of its required callback members (e.g.onClose
could call some other member ofConsensus
). TheAdaptor
design makes the communication fromConsensus
to the specializations inAdaptor
more explicit by passing any dependencies as arguments to the callback functions.Move the thread safety logic from
Consensus
toRCLConsensus
. I now useatomic
s to hold status related information at theRCLConsensus
level to avoid acquiring a mutex just to read a value. There is still room for improving the design, particularly when dispatching theonAccept
call to the job queue, but I think this is still better. Although not necessary, I still use arecursive_mutex
instead of a regularmutex
inRCLConsensus
to protect any future refactoring from breaking things.This is the last planned substantial refactor of the consensus code before focusing fully on the simulation framework.
Doc preview: http://bachase.github.io/ripd1389/