-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented NegativeUNL #3380
Implemented NegativeUNL #3380
Conversation
Is there a more technical description available of this, ideally including some mathematics? |
I'm especially worried about the consensus mechanism starting to bleed into the on-ledger representation with this amendment. If servers are unreliable, a UNL registry should remove them immediately, I don't see why it would be so important to vote (some of) them out collectively after 256 ledgers and create spam objects on-ledger for everyone else. Also I don't see why this should be globally coordinated - if nodes just keep their private nUNL, the effects would be nearly the same with no need to do extra work in consensus on which widely trusted validators are globally seen as unreliable. For this reason I'd really like to see the mathematical models and proofs(!) behind this change. If the concept itself is seen as necessary for some reason, please at least make it off-ledger. Right now everything on-ledger is agnostic to the consensus mechanism used, this somewhat "violates" the layer separation between consensus mechanism and actual database/ledger/blockchain. |
Codecov Report
@@ Coverage Diff @@
## develop #3380 +/- ##
===========================================
+ Coverage 70.22% 70.41% +0.19%
===========================================
Files 683 685 +2
Lines 54621 54999 +378
===========================================
+ Hits 38360 38730 +370
- Misses 16261 16269 +8
Continue to review full report at Codecov.
|
@MarkusTeufelberger |
I'd still like my question answered why this has to be coordinated through a (voted + recorded) pseudo transaction instead of decided privately by each server instead (if necessary, only every flag ledger instead of every ledger to slow the reaction time down a bit) - or was this not even considered? The doc links to internal data that's not published (yet?), so it is a bit hard to know why David proposed this feature the way it is designed here. |
Hey @MarkusTeufelberger. First off, apologies that we didn't actually include the spec alongside the commit. It's been added with the latest commit.
Ack, that should have been caught; apologies. The document primarily discusses the rationale for "simplifying" the original proposal by limiting the number of changes that can be done at once. Here's the relevant content:
Now, you wrote:
This is certainly reasonable and arguably should be done by all responsible publishers. But no publisher can react in near real-time and the reality is that sometimes servers go down for more than just a couple of minutes and that's not necessarily an indication of unreliability. It's reasonable for a UNL publisher to have a policy that, e.g. an outage exceeding 24 hours triggers removal. But what to do during those 24 hours? This change aims to target that "window" of time between a validator experiencing an outage and the UNL publisher responding by removing that validtor. We've previously discussed using the ledger for storing such objects, most recently in a conversation involving storing of manifests. You had similar objections back then and I do agree that they are legitimate and certainly merit consideration. With that said, I'm not sure that I agree that this is an inappropriate usage of the ledger or that the consensus is "bleeding into the on-ledger representation." I'll follow up on some of your other questions tomorrow. 🖖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick first read. Will follow-up with more thorough comments.
Thanks for the additional background! :-) |
@MarkusTeufelberger "why this has to be coordinated through a (voted + recorded) pseudo transaction instead of decided privately by each server instead" |
As long as UNLs mostly overlap, removing nodes wouldn't necessarily reduce that overlap. It can also increase it. Also network safety isn't reduced in any way (recorded+ coordinated, private + coordinated or private + uncoordinated implementation), only network liveness. I already suggested using flag ledgers as synchronization points, there's no need to synchronize with others though. A node could just update its nUNL without consulting others at every 256th ledger (which happens to be coordinated with others anyways, so all nUNLs network wide get updated at the same time - they would just look potentially different, e.g. in case some node has severe routing problems to parts of the network). I get that with an on-ledger implementation they would all be using the same one - but since 80% of nodes would even need to agree that these nodes are unreliable, this would only capture nodes that have (from a network POV) a catastrophic failure, like being offline or extremely overloaded. If a node on my UNL is "reliably unreliable" for me, it should be put on my nUNL - irrespective of if it is unreliable for someone else. Also I don't see why other nodes should be able to force my node to remove a (from my node's PoV) perfectly fine validating node from its UNL. Lastly, this leaks out UNL information from validators (as I understand it, a validator would only try to vote out nodes that it has on its private UNL). A possible attack(?) scenario would be to run a validator and switch it off for ~130 ledgers to see which other validators want to vote it out. That way I can know with certainty how widely trusted my validator is across the network, even if these votes never make it into a validated ledger (in case my validator doesn't yet have "UNL dominance", meaning it is on 80+% of UNLs of other "UNL dominant" nodes). |
I share your concerns. But without on-ledger negative UNL, if nodes disable validators privately, I cannot see if it's safe. It will be like simply lower the quorum from 80% to a lower percentage (with a delay due to flag ledgers). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not finished with the review, (I really just looked at the code in NegativeUNLVote
). Here's a patch with all the suggested changes for a reference: seelabs@d231943
If you do decide to accept some of the comments I thought it might be easier to grab them from that patch.
I also have some larger issues I need to think about, and I obviously need to look at the other code as well, but I wanted to leave the comments I had now instead of waiting to finish the whole thing.
I find this feature fascinating though am no expert on Byzantine Consensus so can't comment too much on the theory behind the implementation. A couple things to note though
|
How about the "global Synchronized Healthcheck Interval Transgressor list"? :-P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff @pwang200 I left some comments and I have one major comment that I would like to ask you to consider. The implementation as it stands performs bulk of the calculation at a single synchronization point - which is at a flag ledger. This increases the node's workload at every 256 ledger. Depending on the amount of work, we can potentially see a jitter in the network due to a spike in work load. Although I doubt its "that much" work, we should generally try to avoid doing lots of work at any single one point. Instead, its good practice to distribute the work over a period of time "naturally". In this case, I would ask to do the actual accumulation of scores and other housekeeping over every ledger and only at flag ledger we would do a determination. I would imagine going forward, as the software evolves only more work would be done at flag ledger and its always good to not overload the amount of work needed at any one concentrated point.
Ah! I think going forward, we should solicit more community feedback during the design/spec writing phase! Would you be interested to review those design artifacts? I am thinking perhaps we can open issues ahead of time to describe what we are doing - or maybe an open repo for improvement proposals. Any thought on this? |
@movitto -- Yes, so far "it just seems like a good number". I acknowledge the need for analysis of the network. It has not been a priority yet. A targeted scenario of the negative UNL, and an important one, is like several validators go offline during a long weekend. In these kinds of cases, 50% seems a good number. Of course, 60% may seem good too. I will collect data and do some analysis. Though I am not sure if the extra complexity of voting on the water marks are needed. -- I feel the word "blacklist" might be misleading. A modification of the original UNL changes two things: who to trust, and who contribute to quorum number calculation and count towards full validation of ledgers. Negative UNL does not change who to trust. So I feel "Blacklist" is kind of too strong. The "global Synchronized Health check Interval Transgressor list" might be more accurate. :-) As @MarkusTeufelberger said. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Round 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick first pass to leave some comments. Good stuff @pwang200 .
src/ripple/app/ledger/Ledger.cpp
Outdated
@@ -282,6 +283,11 @@ Ledger::Ledger(Ledger const& prevLedger, NetClock::time_point closeTime) | |||
info_.closeTime = | |||
prevLedger.info_.closeTime + info_.closeTimeResolution; | |||
} | |||
|
|||
if (prevLedger.rules().enabled(featureNegativeUNL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the best place to call updateNegativeUNL()
, with the update happening behind the scenes. Updating the negative unl seems similar to updating the skiplist. updateSkipList()
is called here:
https://github.com/ripple/rippled/blob/develop/src/ripple/app/ledger/impl/BuildLedger.cpp#L60
I think updateNegativeUNL()
should be called there too. When I use this constructor, I expect the constructed object to have the exact same state map as the previous ledger, with no differences. After that, I modify the state map based on new transations, fee voting, updating the skip list, etc. These modifications happen after the constructor. I think updateNegativeUNL()
is a similar modification that should happen after the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When I use this constructor, I expect the constructed object to have the exact same state map as the previous ledger, with no differences." Is this agreed? @nbougalis @seelabs If so, we must document that.
updateNegativeUNL()
is private. I'd like to keep it private if possible. It must be called after the state map is copied and before applying the transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should call it here: https://github.com/ripple/rippled/blob/97712107b71a8e2089d2e3fcef9ebf5362951110/src/ripple/app/ledger/impl/BuildLedger.cpp#L49
updating the negative unl is part of building the ledger, so I would call it in that function, before applying the transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For correctness, the updateNegativeUNL()
function must be called after the state map is copied and before applying the transactions. So the place you suggested is correct, and the current code is also correct as well. In the current code, however, the function is private. I think that is a benefit, because it reduces the chance of misuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwang200 @cjcobb23 I understand both points of view. This constructor is for "creating a ledger that follows this one". So updating the negative unl here makes sense. OTOH, we don't do that for the skip list, and they are both similar concepts, so we should treat them the same and update them in the same place. (Side note, I don't necessarily agree that after calling this constructor the state maps need to be the same; this isn't a copy constructor).
My vote is with C.J. The analogy with the skip list is good; we should treat them both the same. The easiest change is to move the neg unl update outside of this constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to move it into buildLedgerImpl
, but I agree that there's a bit of smell here. We should revisit this and see if we can cleanup.
* This was always a stub and enabled no functional changes
Corrects the public_key parameter name in the comment. See XRPLF/xrpl-dev-portal#854 for context.
*The tecUNFUNDED code is actively used when attempting to create payment channels; the messages incorrectly list it as deprecated. Meanwhile, the tecUNFUNDED_ADD code actually is an unused legacy code, dating back to when there was a WalletAdd transactor. *Engine result messages are not part of the binary format and are documented as subject to change without notice, so this should not require an amendment nor a new API version. *Mark terLAST, terFUNDS_SPENT deprecated
* Fixes XRPLF#3269 * Work on a version 2 of the XRP Network API has begun. The new API returns the code `notSynced` in place of `noClosed`, `noCurrent`, and `noNetwork`. And `invalidParams` is returned in place of `lgrIdxInvalid`. * The version 2 API can be specified by adding "api_version" = 2 to your json request. The default version remains 1 (if unspecified), except for the command line interface which tracks version 2. * It can be re-enabled by setting ApiMaximumSupportedVersion to 2.
This commit, if merged, adds support to allow multiple indepedent nodes to produce a binary identical shard for a given range of ledgers. The advantage is that servers can use content-addressable storage, and can more efficiently retrieve shards by downloading from multiple peers at once and then verifying the integrity of a shard by cross-checking its checksum with the checksum other servers report.
The Negative UNL is a protocol change to improve the network’s reliability in the face of failed validators without sacrificing network safety. The Negative UNL protocol uses the existing consensus algorithm for the network to agree on a list of validators that are believed to be unreliable at the moment and lists the validators on the ledgers. If a validator V is on that list, it would still participate in consensus in precisely the same way. But nodes with V in their UNL adjust the quorum and V’s validation message is not counted when verifying if a ledger is fully validated. When V becomes reliable, nodes use the consensus algorithm to vote it off the list.
This feature adds a new pseudo-transaction type and a Negative UNL component to the ledger data structure. Any tools or systems that rely on the format of this data will have to be updated.
This feature will need an amendment to activate.