-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Light-node stops reporting finalized headers for certain chain-specs #239
Comments
Thanks for the report! I've noticed this myself, and in my previous investigation it seems that we simply stop receiving GrandPa commit messages from the Substrate nodes. By looking at your logs, it seems to be the same issue. This could in turn be caused either by Substrate bug, or by a bug in smoldot that causes the GrandPa neighbor messages to not always be sent to the peers, I don't really know. Debugging this is a lot of effort, and given that I'm not myself very familiar with GrandPa, it's kind of complicated. In addition to this, if finality gossiping stops working for a few seconds for some reason (including reasonable reasons like a connectivity issue), finality could actually get stuck for a reason I forgot. In order to prevent that, smoldot is supposed to query the justification of the blocks containing an epoch transition in case the finality gossiping is stuck. This is unfortunately not implemented yet. However, this only covers a niche situation, and given that this issue is easy to reproduce I think that this isn't the primary cause. |
Ah I missed that info. However, I can also easily reproduce the issue on Westend, which as far as I know doesn't have that. |
@skunert Would it be possible to also upload the logs of one of the two relay chain nodes? |
New run with requested log levels: |
It turns out that I need |
No problem, here we go 👍 |
I think that it's indeed smoldot that doesn't send its view update notifications to the full nodes. When the light client connects, it first opens a block announces substream, which is accepted, then tries to open a transactions and GrandPa substream, which are denied. This denial shouldn't happen in the absolute, but I understand why it happens knowing how the Substrate code works. |
I think the solution is to make smoldot try open substreams again after 10 seconds-ish |
I believe that #240 fixes the problem, but I'd like to leave this issue open until a test is added. Thanks for the help. It's a very problematic bug, but I wouldn't have started working on it if it wasn't for this issue. |
Nice! I will test again once that is merged. |
Sadly it looks like the issue still exists, new logs with #240 applied (same problem as before): |
There is progress: the substream is now properly opened, and smoldot sends its neighbor packages. I think that at this point I should ping @andresilva for help 🙏 because I don't know enough to continue investigating Here are some truncated Alice logs extracts:
One potential issue is that smoldot always says that it is at round 1. I've been told in the past that this was ok, but maybe some changes happened in Substrate? |
After looking more, what I think happens is:
Smoldot maybe has to query the justification of the block that transitions to set 1, I'm not sure. |
As part of grandpa's gossip protocol the peers exchange neighbor packets that serve to inform: which round the peer is observing, from what set id, and what is the best block that they have seen finalized. Using this information the peers will restrict who they gossip to, e.g. if you announce to a peer that you are on set id 1, then the peer will not gossip any messages related to set id 0 to you. The same logic applies to the Another possibility here is a race condition between smoldot and the rest of the full nodes serving commits, i.e. as full nodes observe the full grandpa protocol they transition to set id 1 and therefore they immediately stop broadcasting stuff related to set id 0, as such smoldot never gets the commit message to finalize the last block in set id 0 that would make it transition to set id 1 and gets stuck on set id 0 forever. This is probably more prevalent on local test networks since there is no network latency and all of these transitions will be "instant". In substrate whenever we see a block that changes the authorities we will also fallback to requesting a justification for this block through sync, on the happy path we will just finalize this block through grandpa gossip (either full round or commits) but otherwise we will eventually get a justification from sync and then move on to the latest set. To make this race condition less prevalent I can try changing the substrate gossip logic to be a bit more lenient and allow gossiping the last commit from the previous set, but I think falling back to fetching justifications from sync for set-transition blocks might still be necessary. |
Yes I do! A neighbor packet is sent to every peer whenever a block has been finalized. The set id is the one corresponding to the block that has been finalized.
How does that work precisely? Do you start requesting justifications for block N after block N+3 has been received? Or do you immediately try? If peers say that they don't have a justification, I guess you try again after some time? |
When for example block 10 is authored and announced over the network, the nodes will start finalizing block 8. The problem is that, from smoldot's perspective, if I ask for a justification of block 8 immediately after learning about block 10, there's a high chance that there's no justification available yet. I could wait for a bit before querying the justification, but "a bit" is a bit vague. Another solution is to query the justification of block 8 only when we learn about block 11, as that would indicate for sure that finality is stuck. But that means that from smoldot's perspective block 8 is finalized several seconds after it has "actually" been finalized. There's this floating period between block 10 and block 11 where smoldot has no way to know the precise moment when block 8 is finalized, except by repeatedly querying justifications. However, repeatedly querying justifications would lead to smoldot being banned from the full nodes. |
Then this shouldn't be the issue, I think your usage of neighbor packets is correct.
On substrate we start asking for justifications through sync as soon as we import a block that triggers an authority set change. On polkadot this is one block every 4h in the worst case (session ends), and every 24h in the best case (no session keys were rotated and we "forcefully" start a new set on new staking era). We only keep one in-flight request, rotate the peers we ask, and do a cooldown of 10s before retrying the same request to someone else. The reputation system in substrate won't immediately ban you if you make duplicate requests just for the justification (treated as "small requests"), but it will still decrease your reputation slightly. I think another way you could implement this is based on neighbor packets to trigger the requests, i.e. if you are on set id 0 and some peer sent you a neighbor packet announcing that they are on set id 1 then that peer should be able to provide you the justification for the last block in set id 0 (otherwise that peer wouldn't have been able to transition). |
After discussion on Element, the conclusion is that Substrate should start sending neighbor packets to light clients when their set id changes. This way, the light client is able to know whether it is up to date with finality, and can query a justification in return. |
We will do appropriate changes in KAGOME as well, to enable sending neighbor messages to light clients |
Thanks for the changes in Substrate and KAGOME When it comes to smoldot, the things that remain to do are:
The second point requires some API changes, as right now when the sync state machine sends a request, it is implicitly a query for a header only. |
Waiting for a Substrate version to be released to work on this. |
I think (but not completely sure) that Polkadot v0.9.40 contains the neighbor packets change. |
It is contained in v0.9.40 👍 |
This should now be fixed after #441, but it would be great if you could confirm! |
Please reopen if it's still not working. |
I came back to this today and ran my local test setup, can confirm that this is fixed! |
I attached a smoldot light node to some local, natively running polkadot nodes.
I listened for finalized headers via
chain_subscribeFinalizedHeads
.First 20 headers are reported, then no further notifications.
On restart, a few notifications are send out until we reach 30/40/50, then nothing again.
Shallow Investigation
rococo-local
chainspecpolkadot-local
and others seem work fine, connecting to live polkadot also works fine"session_length_in_blocks": 10
in its chainspecReproduction
zombienet --provider native spawn <path_to_toml>
https://github.com/paritytech/cumulus/blob/master/zombienet/examples/small_network.tomlFull logs (removed only cranelift spam):
light_client.log
The text was updated successfully, but these errors were encountered: