-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix amendment majority flapping: use a more stable threshold for the number of votes required; when missing STValidation
, use the last vote seen
#4410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there's a subtle problem with this: namely, that the "effective vote" that this creates will cause convergence issues.
Consider, inter alia, that this vote isn't saved across restarts. Imagine that some validator X goes offline. All other servers store its previously recorded vote. Now, half of those servers restart. They forget about X's vote and on the next voting round the half that didn't restart think that X is voting YES and the other half think that X is voting NO. There's a similar problem if a validator is offline during the vote: not only does it not get to vote, but it doesn't know what the other validators voted, so it can't update its own table; now, on the next round, it might rely on a stale vote.
This is the problem I was describing in the other PR: there's no reasonable way to save "voting state" unless you do it on the ledger and you can't do it on the ledger because whose votes are you going to save? You can't save all the votes, because that opens up an attack vector, and you can't save those that are in the UNL, because there's no "the UNL" (i.e. my UNL can be different than yours).
As it is now, a single validator fails to vote in a single round, you can flap. That's just one missed validation out of 256, or a validator with "uptime" of 99.6% can still cause flapping. The hysteresis added by the other amendment means that you can afford to miss one and as long as you're back in time for the next, all is well.
IMO, the flapping fix is only meant to address a short-term issue and not long-term outage.
Just my $0.02.
Thanks for looking @nbougalis. I'm not following 100% what your concern is. I'll describe my understanding of the concern, then you can refine the description. I think your concern is two fold. There are two consequences of storing the vote in memory that I'm seeing raised:
If I got those concerns right, then let me see if I can address them.
Additionally, there are some advantages of the approach taken here that should not be overlooked. @mDuo13 noted:
That is precisely what is provided by the approach taken. The threshold is determined by the number of entries in the local node's UNL. That threshold only changes when the number of entries in the local UNL changes. That is an improvement on the current situation where the threshold changes whenever any validator's In terms of the frequency of flapping, with this proposal flapping might occur when the validator is restarted. This occurs much less frequently than validators losing sync with the network. So flapping should be considerably reduced from where it is now as well as from the approach taken by #4364. Additionally, the approach taken here adds no fields to the protocol. The approach in #4364 adds I look forward your further thoughts and corrections. |
I think I didn’t communicate the concern well enough. Assume a network with, say, 10 validators. There’s some amendment A which has 100% support. Now, three servers go offline—maybe it’s because they restarted or they simply just had connectivity issues; the important thing is that they missed a round of voting. Those servers have no way of knowing if the votes they still have cached are valid. Perhaps one of the validators on the UNL shifted their vote, for example. And you can’t assume that will be fixed in the next round: the validator(s) that shifted their vote may now be offline. The net result is that servers (not just validators!) doing the tally may actually be tallying different sets of ballots! This isn’t necessarily a disaster you mind you; servers will just build different proposed ledgers, and validators make different proposals, which will result in a desync and then the normal resync will happen. And in many cases that is what will happen, although I can think of a few corner cases that deserve more careful consideration. For example, a validator desyncing on the voting ledger will issue a partial validation in the next round, and the implications of that must be thought through. “But” you might say “it’s possible for a server to receive a vote too late to consider it, and how is that different?” It’s true, the scenario is similar but not identical and the important thing to keep in mind is that in that case the server’s behavior is deterministic. The other PR implements hysteresis by storing minimal state on the ledger (specifically, by storing which amendments would have lost majority in the past round but were granted a one round reprieve). Your solution of caching the votes can achieve the same thing (if you cache them only for one “election”) but it suffers from the problem described above: that the caching isn’t consistent across servers. My position continues to be that votes shouldn’t be sticky and that the intent didn’t to prevent flapping for long-term outages, but only for short-term ones, which (in my mind, at least) last less than 255 ledgers. If the consensus is that the XRPL should have sticky votes, then we need a mechanism to allow a validator to make their votes sticky on ledger, with all the complications that introduces. |
fixAmendmentFlapping
: use a more stable threshold for the number of votes required; when missing STValidation
, use the last vote seen
Thanks for the clarification @nbougalis. I feel like there's something that is obvious to you that I have not yet caught up on. So please bear with me.
Right. But if they don't cache the votes, they use a default. Currently that default is to act as though the votes that they didn't receive are missing. That reduces their voting threshold to zero. So they vote strictly according to their local amendment votes. We have observed empirically that validators change their votes very seldom. Much more slowly than every flag ledger. If we were to go ahead with the caching performed by this pull request there are two possible outcomes:
Right. But a validator that went off-line may currently be tallying different sets of votes from a validator that did not go off line. This is unavoidable. Some validators are simply missing information. The code in this pull request does not make the situation any worse. What this pull request does do is, if information goes missing, make the validator's guess more likely to be correct. Since validators seldom change their amendment votes, assuming that their vote did not change is the best we can do.
That description is at odds with my understanding of the code in this pull request. There are no dice in this pull request. Behavior of the validator votes remains as deterministic as possible given the vagaries of communication and reboots. I will agree that the deterministic behavior is local to each validator. But I don't think there's anything new there. Each validator has always voted using the information that it has locally.
Maybe we're at the heart of the matter now. Yes, in this pull request the caching is not consistent across servers. But I'm not convinced that matters. The arrival of amendment votes is not guaranteed to be consistent across servers at all times. Heck, even the UNL is local to each separate server. We try to encourage the UNL to be the same, but it is not guaranteed to be the same. Even two servers that share a remote UNL file may be different for some duration when that file changes and the servers update at different times. My understanding is that each validator votes independently of other validators based on the best information that it has. Those votes are then shared. Sometimes validators disagree because they get different information. And the disagreement self-heals when communication is reestablished. That's still true with this pull request. But I also don't think the situation is worse with this pull request. In fact, I think the situation is improved because the local information is more likely to represent the intention of the missing validator(s) than before the local cache was added. Here's a different way to think about it. As you say, “it’s possible for a server to receive a vote too late to consider it." But it's easy for that lateness to be non-uniform. The vote will arrive late for some servers but not for others. The consensus algorithm is designed to deal with non-uniformity of information. What this pull request does is improve the information each individual server has (locally) so each is more likely to vote correctly if some information goes temporarily missing or misses a deadline.
I'm happy to make the change so the vote is only cached for one flag ledger. I'm not convinced that's necessarily an improvement, but I bow to your greater understanding of the ledger. Other than that, I'm also okay with closing this pull request. Just because I don't understand your objection(s) doesn't mean that I'm right. Further thoughts? Thanks for taking the time. |
I think we're forgetting that an amendment officially gaining or losing majority is still governed by the consensus process. The pseudo-transaction (PT) that records the change in status must still be included in a ledger validated by a quorum of validators. Let's compare the before-and-after of some of the scenarios Nik laid out, with respect to this amendment:
Even when the caches are incorrect, at worst we'll get a discordant set of PTs, and it will be resolved according to consensus. As a bonus, it will tend to resolve to "no change" (i.e. no flapping), unless every validator restarts, in which case we're back to the situation we have today where they all assume the missing vote is a NO. |
@thejohnfreeman @mDuo13 could you have a look at this? If you don't think this is worth merging at this time, then go ahead and comment saying so. We can close this PR to reduce noise as we try to keep PRs from getting too old. The same solution could be proposed again in the future (potentially even in a new PR). Also, @scottschurr if you happen to have any suggestions for moving this PR forward (or if you think it can be closed), please let such suggestions be heard. |
@intelliot, thanks for asking. As far as I can see there is a stalemate on this pull request.
As it stands I, personally, will not approve #4364. I think this pull request (4410) is strictly an improvement over 4364. But, potentially, you could get other folks to approve 4364 and close 4410. Another possible outcome would be for someone to explain to me the problems in this pull request (4410) in a way that I can understand. If someone can explain the concerns in a way that I can understand then I can either tweak 4410 to fix the problem, or close it in deference to 4364. |
I don't have strong opinions between #4410 and #4364, but hopefully I can help clarify the argument. It sounds to me like @nbougalis is concerned that #4410 will make it easier for nodes to disagree about whether or not some validator supports the amendment. I'll try to elaborate a bit on why this might be true in some cases. Sorry Nik if I'm misunderstanding your critique! With #4410, suppose validator A votes yes on some amendment in their validation for voting ledger 256n-1, but then goes offline or loses sync at e.g. ledger 256n+200. Nodes that stay alive will continue to assume A voted yes in ledger 256(n+1)-1. But any node that restarts between ledgers 256n and 256(n+1) will perceive A as voting no. The main contrast in this situation with the current design or #4364, is that it doesn't matter where in this interval the node restarts. With the current design or #4364, it's guaranteed that if messages are delivered quickly then all nodes that are alive for ledger 256(n+1)-1 will agree on A's vote, even if they restart anywhere in the interval 256n ~ 256(n+1)-2. Since it's more likely that nodes will restart across a long span of time than it is they will restart at a specific ledger, one could expect it's more likely for disagreement to occur with #4410. The interval within which restarts can cause disagreement becomes larger the longer we keep saved votes alive for. |
I can imagine no scenario where this change makes things worse and lots of scenarios where it makes it better. No amendment is needed, and one should not be used, as this doesn't change any behavior related to consensus or ledger contents. Some possible alternate designs that some might prefer:
I see no reason to oppose this change if it's not associated with an amendment. There are cases where it makes things better, I can think of no cases where it makes things worse. It makes it easier to evolve the behavior to make things even better. It doesn't risk gaming the 80% threshold as proposals that change the threshold or add hysteresis might do. Though prone to sub-optimal behavior in some edge cases (like around validator restarts) it significantly reduces the sub-optimal behavior around very common cases. |
Amendment flapping usually occurs when an amendment is on the verge of gaining majority, and a validator not in favor of the amendment goes offline. If fixAmendmentFlapping activates then two changes occur: 1. The number of validators in the UNL determines the required threshold for an amendment to gain majority. 2. The AmendmentTable keeps a record of the most recent Amendment vote received from each trusted validator in the UNL. If no validation arrives from a given validator, then the AmendmentTable assumes that the previously received vote has not changed. Fixes XRPLF#4350
@ChronusZ thank you for the clarifier above. Will you be able to review this PR? |
fixAmendmentFlapping
: use a more stable threshold for the number of votes required; when missing STValidation
, use the last vote seenSTValidation
, use the last vote seen
STValidation
, use the last vote seenSTValidation
, use the last vote seen
I think the approach in this PR will lead to voting results which are easier to explain in the face of network disconnects, node restart etc. failures, as opposed to the approach presented in #4364 . The reason why I think so is that in this PR there are is no extra information recorded in the ledger, and everything can be explained by information available locally on the nodes. I think explainability of voting is a very important property that we must not lose the sight of, and for that reason I would vote for this PR and not for the other, even if this PR does not appear to be robust in the face of all hypothetical failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that we still count validator's vote towards quorum after the 24h time limit had passed, only as "all noes". It's a simple mental model which I think is correct, given that if we wanted to drop that (or any other) validator from quorum we can do that explicitly, as it's handled elsewhere.
…ism (XRPLF#4410)" This reverts commit 2bb8de0. Manual testing showed some unexpected behavior. We're removing the commit since the release date is close and fixing the problem is risky. A fixed commit will be put into a later release.
Executed tests to verify this change and they work as expected. 👍 |
…LF#4410) Amendment "flapping" (an amendment repeatedly gaining and losing majority) usually occurs when an amendment is on the verge of gaining majority, and a validator not in favor of the amendment goes offline or loses sync. This fix makes two changes: 1. The number of validators in the UNL determines the threshold required for an amendment to gain majority. 2. The AmendmentTable keeps a record of the most recent Amendment vote received from each trusted validator (and, with `trustChanged`, stays up-to-date when the set of trusted validators changes). If no validation arrives from a given validator, then the AmendmentTable assumes that the previously-received vote has not changed. In other words, when missing an `STValidation` from a remote validator, each server now uses the last vote seen. There is a 24 hour timeout for recorded validator votes. These changes do not require an amendment because they do not impact transaction processing, but only the threshold at which each individual validator decides to propose an EnableAmendment pseudo-transaction. Fix XRPLF#4350
High Level Overview of Change
This pull request is being offered as an alternative to approach to #4364. Both pull requests seek to limit amendment flapping. Only one of these pull requests should be merged.
This pull request specifically intends to address a comment by @mDuo13: #4364 (review)
Context of Change
Amendment flapping usually occurs when an amendment is on the verge of gaining majority, and a validator not in favor of the amendment goes offline.
This pull request addresses this problem by making three changes:
Prior to this pull request the
AmendmentTable
was unaware of which validators are trusted. With this change theAmendmentTable
keeps a record of which validators are currently on the UNL of this particular server. That record must be kept up to date by the server.The number of votes required for an amendment to attain majority is based on the number of trusted validators, not on the number of
STValidations
received from trusted validators prior to the flag ledger.The
AmendmentTable
now keeps a record of the last amendment votes seen from each of the validators on the current UNL. If, for some reason, theAmendmentTable
does not receive anSTValidation
from a validator on the UNL prior to the flag ledger, then theAmendmentTable
assumes that the missing validator(s) did not change their vote. It uses the last vote seen from that validator.These changes address amendment flapping in the following ways.
While it's not uncommon for trusted validators to sporadically go offline, it's much less common for the UNL of a validator to change. With this change we should only see flapping when the UNL changes. Flapping around a UNL change is inevitable, since the voting participants change at that point.
The threshold for an amendment to attain majority is more stable. The threshold only changes with the size of the UNL changes. Previously a missing
STValidation
from a trusted validator would change the threshold.If a trusted validator's
STValidation
is missing during a vote, theAmendmentTable
assumes that the vote from that validator has not changed. This provides additional stability of the vote while reducing requirements on trusted validators to remain online all the time.Type of Change
Testing
Unit tests have been added to exercise the new code. However, with a change like this, it would also be good to exercise the code on a small test network to assure reasonable amendment voting behavior when the UNL changes on that network.