Race condition while waiting on all listen interfaces #3305

stormshield-pj50 · 2023-01-03T09:31:36Z

Summary

We have some code based on the dcutr example that starts a first event loop and wait on all listen interfaces for one second. Our code can act both as a client (listener + dialer) and a server (listener only).

We are experimenting a race condition while waiting on all listen interfaces: both NewListenAddr and incoming ConnectionEstablished events can be received if for instance an other peer dial the peer at the same time. One possible fix is to explicitly handle the ConnectionEstablished event but we could also miss other events (SwarmEvent::Behaviour events, ...).

Expected behaviour

When starting get all listen interfaces through NewListenAddr events, and then receive other events.

Actual behaviour

Listen interfaces events are mixed with other kind of events.

Possible Solution

Provide a swarm API that can filter in events. Filtered out events are kept in the swarm for later usage.

Version

libp2p version (version number, commit, or branch):
0.50.0

Would you like to work on this bug ?

Maybe, depending on the proposed solution and its complexity.

mxinden · 2023-01-04T16:22:59Z

Can you expand on why you can not handle the other events right away?

Provide a swarm API that can filter in events. Filtered out events are kept in the swarm for later usage.

I am not familiar enough with your use-case. Off the top of my head, this seems like a hack that you can easily implement on top of Swarm by buffering the events. Is that correct?

stormshield-pj50 · 2023-01-05T13:35:04Z

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

That is to say we have removed the first wait listen loop in our code and it seems to be ok, but the interrogation still remains.

Provide a swarm API that can filter in events. Filtered out events are kept in the swarm for later usage.

I am not familiar enough with your use-case. Off the top of my head, this seems like a hack that you can easily implement on top of Swarm by buffering the events. Is that correct?

Yes for sure, it can be implemented by buffering the events but seems to be a bit ugly.

rkuhn · 2023-01-09T16:40:01Z

Another approach could be to send all swarm events to multiple recipients: one is the long-running loop that services all swarm requests, another is a stream consumer that just filters for NewListenAddr events and does whatever you need to do with those.

One potentially nice feature to be added to libp2p might be to emit an event once the listeners on all initially detected interfaces have been bound (or failed to do so). I’m a bit on the fence, though, whether a more sustainable design is enforced by only emitting addresses as they appear — since that can happen at any later time anyway. @stormshield-pj50 this is the crux of the matter: you typically will get NewListenAddr later, after other events have been emitted, so when would you decide to say “now the listening phase is complete”?

thomaseizinger · 2023-01-11T22:22:39Z

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

It is a standalone example that doesn't necessarily represent the way you'd structure a production application.

stormshield-pj50 · 2023-01-12T08:26:21Z

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

It is a standalone example that doesn't necessarily represent the way you'd structure a production application.

I understand this is not a production application, but how can you dial and make port reuse work if you haven't waited first to listen on all interfaces ? How would you do that in a production application ? Can you provide an example code ?

rkuhn · 2023-01-12T08:52:51Z

@stormshield-pj50 Could you be more precise? I ask because “all interfaces” is not well-defined: interfaces may appear or disappear at any time, and they may or may not be enumerated within a given timeout. The situation would be completely different if your application listens on specific addresses, in which case you know all addresses from the start.

(I don’t know much about dcutr, so take my comments as generic Swarm statements.)

stormshield-pj50 · 2023-01-12T09:10:10Z

@stormshield-pj50 Could you be more precise? I ask because “all interfaces” is not well-defined: interfaces may appear or disappear at any time, and they may or may not be enumerated within a given timeout. The situation would be completely different if your application listens on specific addresses, in which case you know all addresses from the start.

(I don’t know much about dcutr, so take my comments as generic Swarm statements.)

Yes I mean "all interfaces" like in dcutr by listening on 0.0.0.0, so I don't know all addresses from the start.

rkuhn · 2023-01-12T13:11:49Z

That’s what I said: binding to 0.0.0.0 has well-defined semantics on the socket level, but doesn’t tell the socket what its reachable addresses are. Since libp2p needs to know the actual addresses, it binds to some addresses that are discovered — one by one, over time — through RT_NETLINK (or similar). This is a dynamic process, there is no “final answer” to what “all addresses” means. So you’ll have to be more specific than that.

What you could do is use if-watch yourself, discover a set of addresses you’re happy with, and then let the swarm listen exactly on those. That way your code knows in advance via which addresses the swarm will very likely be reachable (modulo bind or listen syscall errors).

thomaseizinger · 2023-01-13T01:24:28Z

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

It is a standalone example that doesn't necessarily represent the way you'd structure a production application.

I understand this is not a production application, but how can you dial and make port reuse work if you haven't waited first to listen on all interfaces ? How would you do that in a production application ? Can you provide an example code ?

Are you starting your applications at the same time? Normally the order of events should be:

Start the relay
Relay either has its public address configured or discovers it through other peers via identify
Once discovered, relay can offer its relay services to other peers by advertising its address somewhere, be it through a DHT, rendezvous or some other service
Client A can discover the relay and use it for listening
Client A can advertise its relayed listen address to other peers
Client B discovers relayed listen address and attempts a hole punch via dctur

If I understand your setup correctly, the "race condition" you are experiencing is because Client A magically knows the relay's address ahead of time. If it were to learn it at runtime, the NewListenAddr event must have happened on the relay side and thus the dial must strictly only happen after it (and should succeed). One exception to this is that the relay uses a static public address or a DNS address. In that case, the entire "advertise relay services" part can be skipped altogether and the dial either succeeds because the relay is up or the dial fails because the relay is down.

In both cases, I think the race condition should not occur.

Bottom line: The example takes a short-cut by having the client learn the relay's address "out-of-band" via a commandline parameter. This design is prone to race conditions so that is the bit you should likely get rid of.

Hope that helps :)

thomaseizinger · 2023-05-08T09:57:34Z

@stormshield-pj50 I am closing this as resolved. Please let me know if there is still an issue you think needs addressing! :)

thomaseizinger mentioned this issue Jan 11, 2023

Shrink Swarm API and empower NetworkBehaviours? #3314

Closed

thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition while waiting on all listen interfaces #3305

Race condition while waiting on all listen interfaces #3305

stormshield-pj50 commented Jan 3, 2023

mxinden commented Jan 4, 2023

stormshield-pj50 commented Jan 5, 2023

rkuhn commented Jan 9, 2023

thomaseizinger commented Jan 11, 2023

stormshield-pj50 commented Jan 12, 2023

rkuhn commented Jan 12, 2023

stormshield-pj50 commented Jan 12, 2023

rkuhn commented Jan 12, 2023

thomaseizinger commented Jan 13, 2023 •

edited

Loading

thomaseizinger commented May 8, 2023

Race condition while waiting on all listen interfaces #3305

Race condition while waiting on all listen interfaces #3305

Comments

stormshield-pj50 commented Jan 3, 2023

Summary

Expected behaviour

Actual behaviour

Possible Solution

Version

Would you like to work on this bug ?

mxinden commented Jan 4, 2023

stormshield-pj50 commented Jan 5, 2023

rkuhn commented Jan 9, 2023

thomaseizinger commented Jan 11, 2023

stormshield-pj50 commented Jan 12, 2023

rkuhn commented Jan 12, 2023

stormshield-pj50 commented Jan 12, 2023

rkuhn commented Jan 12, 2023

thomaseizinger commented Jan 13, 2023 • edited Loading

thomaseizinger commented May 8, 2023

thomaseizinger commented Jan 13, 2023 •

edited

Loading