Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light-node stops reporting finalized headers for certain chain-specs #239

Closed
skunert opened this issue Feb 28, 2023 · 26 comments · Fixed by #441
Closed

Light-node stops reporting finalized headers for certain chain-specs #239

skunert opened this issue Feb 28, 2023 · 26 comments · Fixed by #441

Comments

@skunert
Copy link
Contributor

skunert commented Feb 28, 2023

I attached a smoldot light node to some local, natively running polkadot nodes.
I listened for finalized headers via chain_subscribeFinalizedHeads.
First 20 headers are reported, then no further notifications.
On restart, a few notifications are send out until we reach 30/40/50, then nothing again.

Shallow Investigation

  • I found that this only happens when I use the rococo-local chainspec
  • polkadot-local and others seem work fine, connecting to live polkadot also works fine
  • Maybe related to session changes? rococo-local has "session_length_in_blocks": 10 in its chainspec

Reproduction

[2023-02-28T10:11:23Z DEBUG network] Chain(rococo_local_testnet) <= SetLocalGrandpaState(set_id: 0, commit_finalized_height: 20)
[2023-02-28T10:11:23Z DEBUG runtime-rococo_local_testnet] Worker <= InputFinalized(hash=0x5151…18d8, best=0x423c…d850)
[2023-02-28T10:11:23Z DEBUG runtime-rococo_local_testnet] Worker => OutputFinalized(hash=0x5151…18d8, best=0x423c…d850)
[2023-02-28T10:11:23Z DEBUG tx-service-rococo_local_testnet] Finalized(block=0x5151…18d8, body-transactions={})

Full logs (removed only cranelift spam):
light_client.log

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

Thanks for the report!

I've noticed this myself, and in my previous investigation it seems that we simply stop receiving GrandPa commit messages from the Substrate nodes. By looking at your logs, it seems to be the same issue.

This could in turn be caused either by Substrate bug, or by a bug in smoldot that causes the GrandPa neighbor messages to not always be sent to the peers, I don't really know. Debugging this is a lot of effort, and given that I'm not myself very familiar with GrandPa, it's kind of complicated.

In addition to this, if finality gossiping stops working for a few seconds for some reason (including reasonable reasons like a connectivity issue), finality could actually get stuck for a reason I forgot. In order to prevent that, smoldot is supposed to query the justification of the blocks containing an epoch transition in case the finality gossiping is stuck. This is unfortunately not implemented yet. However, this only covers a niche situation, and given that this issue is easy to reproduce I think that this isn't the primary cause.

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

Maybe related to session changes? rococo-local has "session_length_in_blocks": 10 in its chainspec

Ah I missed that info.

However, I can also easily reproduce the issue on Westend, which as far as I know doesn't have that.

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

@skunert Would it be possible to also upload the logs of one of the two relay chain nodes? sync=debug, sub-libp2p=debug, and grandpa=debug would be useful.

@skunert
Copy link
Contributor Author

skunert commented Feb 28, 2023

New run with requested log levels:
alice.log
bob.log
smoldot.log

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

It turns out that I need grandpa=trace, sorry 😬
sub-libp2p=trace (instead of debug) could also be useful, in case something is revealed.

@skunert
Copy link
Contributor Author

skunert commented Feb 28, 2023

No problem, here we go 👍
alice-trace.log
bob-trace.log
smoldot.log

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

I think that it's indeed smoldot that doesn't send its view update notifications to the full nodes.
Grepping Notification(12D3KooWBTwTFNvgPeEFA3VAtyGcXnDunfW3Q1GtJDXVUL5FpqNK in the logs gives no result (while it does for another PeerId), meaning that we never receive any single notification from smoldot.

When the light client connects, it first opens a block announces substream, which is accepted, then tries to open a transactions and GrandPa substream, which are denied. This denial shouldn't happen in the absolute, but I understand why it happens knowing how the Substrate code works.
Since the substream has been denied, smoldot never sends notifications.
In the case of Substrate, it would try opening the substream again in the future, something that smoldot doesn't do.

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

I think the solution is to make smoldot try open substreams again after 10 seconds-ish

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

I believe that #240 fixes the problem, but I'd like to leave this issue open until a test is added.

Thanks for the help. It's a very problematic bug, but I wouldn't have started working on it if it wasn't for this issue.

@skunert
Copy link
Contributor Author

skunert commented Feb 28, 2023

I believe that #240 fixes the problem, but I'd like to leave this issue open until a test is added.

Thanks for the help. It's a very problematic bug, but I wouldn't have started working on it if it wasn't for this issue.

Nice! I will test again once that is merged.

@skunert
Copy link
Contributor Author

skunert commented Feb 28, 2023

Sadly it looks like the issue still exists, new logs with #240 applied (same problem as before):
alice.log
bob.log
smoldot.log

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

Sadly it looks like the issue still exists, new logs with #240 applied (same problem as before): alice.log bob.log smoldot.log

There is progress: the substream is now properly opened, and smoldot sends its neighbor packages.
However, Substrate doesn't send back any commit message, like I think it's supposed to do.

I think that at this point I should ping @andresilva for help 🙏 because I don't know enough to continue investigating

Here are some truncated Alice logs extracts:

# Everything works fine, smoldot is connected and has 12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT as PeerId

# A commit is ready
2023-02-28 17:48:52.102 TRACE tokio-runtime-worker grandpa: Got commit for round_number 32: target_number: 19, target_hash: 0xa97eb7968bbd09f4d1884094cd8fd3482efd80e9cfeb41d55eb65bcd23456953    

# The commit message is sent to smoldot
2023-02-28 17:48:52.102 TRACE tokio-runtime-worker sub-libp2p: External API => Notification(PeerId("12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT"), OnHeap("/d0009b025ac5d4aa1c624d547c93ccd84a0ba0f3e65a134fa17ccaed0700f72c/grandpa/1"), 319 bytes)    

# Smoldot then replies with a neighbor packet
2023-02-28 17:48:52.102 TRACE tokio-runtime-worker sub-libp2p: Handler(ConnectionId(4)) => Notification(12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT, SetId(2), 22 bytes)    
2023-02-28 17:48:52.102 TRACE tokio-runtime-worker grandpa: Peer 12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT updated view. Now at Round(1), SetId(0)    

# A bit later, the same sequence of events happens, and everything works fine:
2023-02-28 17:48:58.633 TRACE tokio-runtime-worker grandpa: Got commit for round_number 34: target_number: 20, target_hash: 0xc6123f7cf49ba415021f88f0e49ae21de3b97855c539642615e3084609e685d5    
2023-02-28 17:48:58.633 TRACE tokio-runtime-worker sub-libp2p: External API => Notification(PeerId("12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT"), OnHeap("/d0009b025ac5d4aa1c624d547c93ccd84a0ba0f3e65a134fa17ccaed0700f72c/grandpa/1"), 319 bytes)    
2023-02-28 17:48:58.634 TRACE tokio-runtime-worker sub-libp2p: Handler(ConnectionId(4)) => Notification(12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT, SetId(2), 22 bytes)    
2023-02-28 17:48:58.634 TRACE tokio-runtime-worker grandpa: Peer 12D3KooWAhTYXNB6CThQfLB33WSLhBHA93m8bQ21eVfRRujX7aNT updated view. Now at Round(1), SetId(0)    

# Then later, another commit is ready (note that this line is indeed printed twice for some reason):
2023-02-28 17:49:06.667 TRACE tokio-runtime-worker grandpa: Got commit for round_number 1: target_number: 21, target_hash: 0x4ca3bed1a335ebf8022f8cd8914d353f55010736f28befbe374f7fce789444c0    
2023-02-28 17:49:06.667 TRACE tokio-runtime-worker grandpa: Got commit for round_number 1: target_number: 21, target_hash: 0x4ca3bed1a335ebf8022f8cd8914d353f55010736f28befbe374f7fce789444c0    

# We can see that a message is sent to some peers (only 22 bytes though?!), but nothing to smoldot
2023-02-28 17:49:06.667 TRACE tokio-runtime-worker sub-libp2p: External API => Notification(PeerId("12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby"), OnHeap("/d0009b025ac5d4aa1c624d547c93ccd84a0ba0f3e65a134fa17ccaed0700f72c/grandpa/1"), 22 bytes)    
2023-02-28 17:49:06.667 TRACE tokio-runtime-worker sub-libp2p: External API => Notification(PeerId("12D3KooWLcwySCZ2kPRC7pKTM1tNyghm2fxVuSeYfVeM9oWqQNK7"), OnHeap("/d0009b025ac5d4aa1c624d547c93ccd84a0ba0f3e65a134fa17ccaed0700f72c/grandpa/1"), 22 bytes)    

# Later, six more commits are ready (the "Got commit" line), but none of them are sent to smoldot, which is the problem

One potential issue is that smoldot always says that it is at round 1. I've been told in the past that this was ok, but maybe some changes happened in Substrate?

@tomaka
Copy link
Contributor

tomaka commented Feb 28, 2023

After looking more, what I think happens is:

  • A few times, smoldot and Bob send neighbor packets saying that they're at set 0. Everything works fine.
  • Then, Alice sends some votes/precommits (or whatever makes GrandPa work) to Bob.
  • Bob, since it holds all the votes/precommits locally, is able to determine that the block has now enough votes to be finalized and finalizes it.
  • Bob sends a neighbor packet saying that it is now at set 1.
  • Later, Alice has a commit ready, but only sends it to Bob because Bob is "correctly" at set 1 while smoldot is still at set 0.

Smoldot maybe has to query the justification of the block that transitions to set 1, I'm not sure.
Since I have no real good grasp of how all of this works, I don't know which party works correctly and which doesn't.

@andresilva
Copy link

andresilva commented Mar 1, 2023

As part of grandpa's gossip protocol the peers exchange neighbor packets that serve to inform: which round the peer is observing, from what set id, and what is the best block that they have seen finalized. Using this information the peers will restrict who they gossip to, e.g. if you announce to a peer that you are on set id 1, then the peer will not gossip any messages related to set id 0 to you. The same logic applies to the commit_finalized_height field that is part of the neighbor packet, i.e. if you claim that you've seen block 10 get finalized, then the peer would not send you a commit that finalizes block 9 for example. For light clients it is OK to always keep the round on the neighbor message at 1, since you're not interested in getting round vote data anyway. It is also fine to keep commit_finalized_height stuck at some block (potentially even 0), although this means that you might end up receiving useless commit data. But as you transition to different sets you need to send a neighbor packet updating the peers about the set id you are now tracking. I have looked at the smoldot code and it seems that neighbor packets are sent when a substream is opened and after new blocks are synced (through https://github.com/smol-dot/smoldot/blob/main/lib/src/network/service/notifications.rs#L662), but it wasn't obvious to me whether it gets called if you transition to a new set as part of processing a commit message (i.e. not as part of syncing blocks). In the logs that @skunert posted you can see that on block 21 the grandpa voters transition to a new set and this is the point when smoldot stops getting commits.

Another possibility here is a race condition between smoldot and the rest of the full nodes serving commits, i.e. as full nodes observe the full grandpa protocol they transition to set id 1 and therefore they immediately stop broadcasting stuff related to set id 0, as such smoldot never gets the commit message to finalize the last block in set id 0 that would make it transition to set id 1 and gets stuck on set id 0 forever. This is probably more prevalent on local test networks since there is no network latency and all of these transitions will be "instant". In substrate whenever we see a block that changes the authorities we will also fallback to requesting a justification for this block through sync, on the happy path we will just finalize this block through grandpa gossip (either full round or commits) but otherwise we will eventually get a justification from sync and then move on to the latest set. To make this race condition less prevalent I can try changing the substrate gossip logic to be a bit more lenient and allow gossiping the last commit from the previous set, but I think falling back to fetching justifications from sync for set-transition blocks might still be necessary.

@tomaka
Copy link
Contributor

tomaka commented Mar 1, 2023

it wasn't obvious to me whether it gets called if you transition to a new set as part of processing a commit message (i.e. not as part of syncing blocks)

Yes I do! A neighbor packet is sent to every peer whenever a block has been finalized. The set id is the one corresponding to the block that has been finalized.

In substrate whenever we see a block that changes the authorities we will also fallback to requesting a justification for this block through sync, on the happy path we will just finalize this block through grandpa gossip (either full round or commits) but otherwise we will eventually get a justification from sync and then move on to the latest set.

How does that work precisely? Do you start requesting justifications for block N after block N+3 has been received? Or do you immediately try? If peers say that they don't have a justification, I guess you try again after some time?

@tomaka
Copy link
Contributor

tomaka commented Mar 1, 2023

When for example block 10 is authored and announced over the network, the nodes will start finalizing block 8. The problem is that, from smoldot's perspective, if I ask for a justification of block 8 immediately after learning about block 10, there's a high chance that there's no justification available yet. I could wait for a bit before querying the justification, but "a bit" is a bit vague.

Another solution is to query the justification of block 8 only when we learn about block 11, as that would indicate for sure that finality is stuck. But that means that from smoldot's perspective block 8 is finalized several seconds after it has "actually" been finalized.

There's this floating period between block 10 and block 11 where smoldot has no way to know the precise moment when block 8 is finalized, except by repeatedly querying justifications. However, repeatedly querying justifications would lead to smoldot being banned from the full nodes.

@andresilva
Copy link

andresilva commented Mar 1, 2023

Yes I do! A neighbor packet is sent to every peer whenever a block has been finalized. The set id is the one corresponding to the block that has been finalized.

Then this shouldn't be the issue, I think your usage of neighbor packets is correct.

How does that work precisely? Do you start requesting justifications for block N after block N+3 has been received? Or do you immediately try? If peers say that they don't have a justification, I guess you try again after some time?

On substrate we start asking for justifications through sync as soon as we import a block that triggers an authority set change. On polkadot this is one block every 4h in the worst case (session ends), and every 24h in the best case (no session keys were rotated and we "forcefully" start a new set on new staking era). We only keep one in-flight request, rotate the peers we ask, and do a cooldown of 10s before retrying the same request to someone else. The reputation system in substrate won't immediately ban you if you make duplicate requests just for the justification (treated as "small requests"), but it will still decrease your reputation slightly.

I think another way you could implement this is based on neighbor packets to trigger the requests, i.e. if you are on set id 0 and some peer sent you a neighbor packet announcing that they are on set id 1 then that peer should be able to provide you the justification for the last block in set id 0 (otherwise that peer wouldn't have been able to transition).

@tomaka
Copy link
Contributor

tomaka commented Mar 1, 2023

After discussion on Element, the conclusion is that Substrate should start sending neighbor packets to light clients when their set id changes. This way, the light client is able to know whether it is up to date with finality, and can query a justification in return.

@kamilsa
Copy link

kamilsa commented Mar 6, 2023

We will do appropriate changes in KAGOME as well, to enable sending neighbor messages to light clients

@tomaka
Copy link
Contributor

tomaka commented Mar 10, 2023

Thanks for the changes in Substrate and KAGOME

When it comes to smoldot, the things that remain to do are:

  • Report the neighbor packets to the sync state machine.
  • Make the sync state machine send justification requests when relevant.

The second point requires some API changes, as right now when the sync state machine sends a request, it is implicitly a query for a header only.

@tomaka tomaka added the blocked Progress on this issue requires something beyond our control label Mar 14, 2023
@tomaka
Copy link
Contributor

tomaka commented Mar 14, 2023

Waiting for a Substrate version to be released to work on this.

@tomaka
Copy link
Contributor

tomaka commented Mar 30, 2023

I think (but not completely sure) that Polkadot v0.9.40 contains the neighbor packets change.

@tomaka tomaka removed the blocked Progress on this issue requires something beyond our control label Mar 30, 2023
@skunert
Copy link
Contributor Author

skunert commented Mar 30, 2023

It is contained in v0.9.40 👍

@tomaka
Copy link
Contributor

tomaka commented Apr 19, 2023

This should now be fixed after #441, but it would be great if you could confirm!
If it still doesn't work, I'd need the debug logs to figure out.

@tomaka
Copy link
Contributor

tomaka commented May 9, 2023

Please reopen if it's still not working.

@tomaka tomaka closed this as completed May 9, 2023
@skunert
Copy link
Contributor Author

skunert commented May 11, 2023

I came back to this today and ran my local test setup, can confirm that this is fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants