Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor WebRTC code #51

Merged
merged 10 commits into from
Apr 17, 2024
Merged

Refactor WebRTC code #51

merged 10 commits into from
Apr 17, 2024

Conversation

altonen
Copy link
Contributor

@altonen altonen commented Mar 1, 2024

This PR refactors the WebRTC code into something that actually works. The WebRtcConnection code is split into OpeningWebRtcConnection and WebRtcConnection. The former only deals with the Noise handshake whereas the latter implements a typical connection event loop, dealing with inbound/outbound substreams and data exchange. The PR also implements a new substream type for WebRTC which is implemented using tokio::sync::mpsc::{Sender, Receiver} since str0m doesn't expose a substream type. The multistream-select code used by WebRTC is refactored and its test coverage is increased.

This PR also updates str0m to the latest version but during testing it was discovered that there are issues with ChannelId/SCTP stream id reuse and those have to be reported to upstream and fixed. They're are not blockers for merging this PR though because the fixes should be contained to str0m and for litep2p it would only be a version bump if the fixes are accepted.

The WebRTC implementation (with pending fixes to str0m) has been tested with smoldot and the code works. IPFS Ping, Kademlia and Identify all work, opening and negotiating notification protocols work and block requests work. There are, however, two issues that I was able to find:

  1. The first "notification" smoldot reads from the substream is not actually a notification but the substream handshake which results in "protocol-error":
[network] protocol-error; peer_id=12D3KooWNGCW63xu9tGh1WxWEazHs6UgMdTjrLSos89NrVSCh9bz, error=BadBlockAnnounce(DecodeBlockAnnounceError(Verify))
[network] protocol-error; peer_id=12D3KooWNGCW63xu9tGh1WxWEazHs6UgMdTjrLSos89NrVSCh9bz, error=BadGrandpaNotification(DecodeGrandpaNotificationError(Verify))

It looks like smoldot is still able to receive block announcements and GRANDPA notifications so I'm not entirely what the issue here is.

  1. There is another issues with justifications but I don't see how it could be related to WebRTC
[sync-service-westend2] finality-proof-verify-error; error=FinalityVerify(UnknownTargetBlock { block_number: 19772928, block_hash: [22, 163, 175, 94, 114, 75, 142, 174, 83, 146, 11, 251, 184, 119, 212, 76, 34, 85, 144, 47, 83, 251, 83, 86, 152, 141, 204, 104, 253, 236, 21, 48] }), sender=12D3KooWNGCW63xu9tGh1WxWEazHs6UgMdTjrLSos89NrVSCh9bz
[sync-service-westend2] Error while verifying justification: Justification targets a block (#19772928) that isn't in the chain.

This happens on the first try, gossip is closed, then it's opened again and it's able to start warp sync.

altonen and others added 7 commits February 21, 2024 14:41
Introduce `OpeningWebRtcConnection` which only deals with handling the
initial Noise handsdhake which starts every WebRTC connection.

When `str0m` receives an inbound connection from a new peer,
an `OpeningWebRtcConnection` object is created which tracks the Noise
handshake state of the connection. This object doesn't have its own
asynchronous event loop and is polled by the `WebRtcTransport` object.

Once the remote `PeerId` has been received in the second Noise message,
`TransportManager` will be notified of the new connection which it
validates and either accepts or reject. If the connection is rejected,
all connection state is discarded.

If the connection is accepted, `OpeningWebRtcConnection` sends the third
and final Noise handshake message, fully opening the connection.
After that, the object is destroyed and a `WebRtcConnection` object
is crated using the `Rtc` object that was held by
`OpeningWebRtcConnection`. This connection object has its own
asynchronous event loop like connection loops of other transports.

Currently the `WebRtcConnection` is just a dummy connection which polls
the connection and negotiates protocols for inbound substreams but does
nothing else. Proper connection support will be implemented in future
commits.
Introduce `encode_multistream_message()` utility function for crafting
`multistream-select` messages, correctly detect invalid messages and
improve test coverage of the function.
Add support for fallback protocols and increase test coverage
The new event loop of `WebRtcConnection` only deals with opening,
accepting and negotiating substreams.
@altonen altonen requested a review from dmitry-markin March 1, 2024 11:23
@altonen
Copy link
Contributor Author

altonen commented Mar 1, 2024

@dmitry-markin

Since Versi was busy with Kusama tests, I spoke with Andre and we decided that I should look into WebRTC in the meantime, even though it was technically your task. This is a big PR so I'm not expecting a comprehensive review but it'll (kind of) make it work so it should be relatively straightforward to fix any further issues.

@dmitry-markin
Copy link
Collaborator

This PR also updates str0m to the latest version but during testing it was discovered that there are issues with ChannelId/SCTP stream id reuse and those have to be reported to upstream and fixed. They're are not blockers for merging this PR though because the fixes should be contained to str0m and for litep2p it would only be a version bump if the fixes are accepted.

What this means for us? I.e., is this causing any issues for substrate/lightp2p, or we can happily live without these issues fixed in str0m?

@altonen
Copy link
Contributor Author

altonen commented Mar 2, 2024

It means that outbound substreams don't work very well which is a blocker insofar as releasing WebRTC goes. Code needed to fix those issues are quite trivial so I'll open a pull request to str0m tomorrow/next week and hopefully they'll accept it.

Copy link
Collaborator

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not a thorough review as this is all new stuff for me, and I'm far from understanding all the details. Having said that, looks good.

Comment on lines +273 to +278
let message = encode_multistream_message(
std::iter::once(protocol.clone())
.chain(fallback_names.clone())
.filter_map(|protocol| Protocol::try_from(protocol.as_ref()).ok())
.map(|protocol| Message::Protocol(protocol)),
)?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably don't understand something, but as per multistream-select spec, protocols are tried one-by-one. Why are we encoding all protocols (including fallback protocols) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check this but this works with smoldot, both for listener and dialer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It looks like an incorrect way of handling the handshake and litep2p would reject this message as well but smoldot both sends and accepts Message::Protocols. I'm kind of tempted to keep it as is since it works but I'll check if smoldot would work with Message::Protocol

src/transport/webrtc/config.rs Show resolved Hide resolved
dmitry-markin added a commit that referenced this pull request Apr 16, 2024
This reverts commit 0ca33be.

It turned out to be a bad idea to apply major code style changes while
having unmerged PRs. Specifically, in
#51 I hit the all file long
merge conflicts where it's impossible to manually track important
changes and merge them. Applying the new code style to the PR itself
also didn't help — git failed to correctly handle the changed parts and
this led to even more merge conflicts.

So, the plan is to revert the code style changes, apply the pending PRs,
and reintroduce the code style changes.
@dmitry-markin dmitry-markin force-pushed the altonen-refactor-webrtc branch from 50c9ed2 to 8f189ab Compare April 16, 2024 10:57
@dmitry-markin
Copy link
Collaborator

@lexnv #74 made resolving conflicts with master trivial, and this is now ready for review.

@@ -0,0 +1,80 @@
// Copyright 2023 litep2p developers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: At some point I think we'd need to change this to paritytech

// Copyright 2023-2024 Parity Technologies (UK) Ltd.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can gradually append this line to modified files. I think we should also keep the original copyright that basically refers to Aaro, as litep2p started as his side project.

tests/webrtc.rs Outdated Show resolved Hide resolved
tests/webrtc.rs Outdated Show resolved Hide resolved
let config = Litep2pConfigBuilder::new()
.with_keypair(Keypair::generate())
.with_webrtc(Config {
listen_addresses: vec!["/ip4/192.168.1.170/udp/8888/webrtc-direct".parse().unwrap()],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block-announces protocol uses e143f23803ac50e8f6f8e62695d1ce9e4e1d68aa36c1cd2cfd15340213f3423e which is the genesis of Westend, however I cannot connect to this addr.
Is there another way to test this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listen address should be updated to 127.0.0.1. Or does the WebRTC spec impose the limitations of connecting to loopback addresses?

Then, I guess, for this test to succeed another Westend node with litep2p backend is needed configured to dial this listen address.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account the noise prologue comment, it looks like smoldot is needed as a second node for this test to succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was used to see how litep2p works with smoldot. It doesn't really test anything which is why it was disabled. How it is you run the test RUST_LOG=info cargo test webrtc_test -- --ignored, copy the printed local address (127.0.0.1 didn't work for some reason) and go to https://tomaka.github.io/test-browser-node/index.html

WebRTC testing branch is still active (commit), if you want to test WebRTC with Polkadot. If you only test smoldot with litep2p it will terminate the connection pretty soon because the syncing handshake is incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @altonen for the explanation!

};

#[tokio::test]
#[ignore]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is ignored because the IP address is not reachable anymore

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably is intended to be run manually with a companion node.

Vec::with_capacity(PREFIX.len() + local_fingerprint.len() + remote_fingerprint.len());
prologue.extend_from_slice(PREFIX);
prologue.extend_from_slice(&remote_fingerprint);
prologue.extend_from_slice(&local_fingerprint);
Copy link
Collaborator

@lexnv lexnv Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is <prefix><fingerprint A><fingerprint B>, depending on whatever this is the handshake responder or the initiator.
(extracted from: https://github.com/libp2p/specs/blob/master/webrtc/webrtc-direct.md#connection-security)

Do we need to handle the case where we are the responder? Ie <prefix><local><remote> instead of

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is it's based on the role of smoldot in the WebRTC noise handshake. It looks like we don't support litep2p to litep2p WebRTC connections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only implemented WebRTC Direct (browser <-> server), not server <-> server.

*self.state.lock() = State::SendClosed;
}

if flags & 2 == Flag::ResetStream as i32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: We don't care about Flag::FinAck because we are closing the connection either way? (https://github.com/libp2p/specs/blob/master/webrtc/README.md#shared-concepts)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we apply binary &, even though flags seem to be exclusive and not a bitmask. Also, this way FinAck is handled by conditions flags & 1 == StopSending and flags & 2 == ResetStream, so this is probably the reason we don't have a separate condition for FinAck...

Copy link
Collaborator

@dmitry-markin dmitry-markin Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would merge as is and refactor this in a follow-up PR.

@@ -36,7 +36,7 @@ simple-dns = "0.5.3"
smallvec = "1.10.0"
snow = { version = "0.9.3", features = ["ring-resolver"], default-features = false }
socket2 = { version = "0.5.5", features = ["all"] }
str0m = "0.2.0"
str0m = "0.4.1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
str0m = "0.4.1"
str0m = "0.5.0"

The str0m blocker has been fixed (algesten/str0m@7ed5143) and is included in the 0.5.0 release. I think we should be using the latest version since it includes a few other fixes https://github.com/algesten/str0m/blob/main/CHANGELOG.md#050

Copy link
Collaborator

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on my limited view of the WebRTC 👍

There are a few breaking changes when updating to str0m 0.5, we could probably handle those in a later PR, since some of the changes are not entirely trivial

Co-authored-by: Alexandru Vasile <[email protected]>
@dmitry-markin dmitry-markin merged commit 337538f into master Apr 17, 2024
7 checks passed
@lexnv lexnv mentioned this pull request May 24, 2024
lexnv added a commit that referenced this pull request May 24, 2024
## [0.5.0] - 2023-05-24

This is a small patch release that makes the `FindNode` command a bit
more robst:

- The `FindNode` command now retains the K (replication factor) best
results.
- The `FindNode` command has been updated to handle errors and
unexpected states without panicking.

### Changed

- kad: Refactor FindNode query, keep K best results and add tests
([#114](#114))

## [0.4.0] - 2023-05-23

This release introduces breaking changes to the litep2p crate, primarily
affecting the `kad` module. Key updates include:

- The `GetRecord` command now exposes all peer records, not just the
latest one.
- A new `RecordType` has been introduced to clearly distinguish between
locally stored records and those discovered from the network.

Significant refactoring has been done to enhance the efficiency and
accuracy of the `kad` module. The updates are as follows:

- The `GetRecord` command now exposes all peer records.
- The `GetRecord` command has been updated to handle errors and
unexpected states without panicking.

Additionally, we've improved code coverage in the `kad` module by adding
more tests.

### Added

- Add release checklist
([#115](#115))
- Re-export `multihash` & `multiaddr` types
([#79](#79))
- kad: Expose all peer records of `GET_VALUE` query
([#96](#96))

### Changed

- multistream_select: Remove unneeded changelog.md
([#116](#116))
- kad: Refactor `GetRecord` query and add tests
([#97](#97))
- kad/store: Set memory-store on an incoming record for PutRecordTo
([#88](#88))
- multistream: Dialer deny multiple /multistream/1.0.0 headers
([#61](#61))
- kad: Limit MemoryStore entries
([#78](#78))
- Refactor WebRTC code
([#51](#51))
- Revert "Bring `rustfmt.toml` in sync with polkadot-sdk (#71)"
([#74](#74))
- cargo: Update str0m from 0.4.1 to 0.5.1
([#95](#95))

### Fixed

- Fix clippy  ([#83](#83))
- crypto: Don't panic on unsupported key types
([#84](#84))

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants