Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition while waiting on all listen interfaces #3305

Closed
stormshield-pj50 opened this issue Jan 3, 2023 · 10 comments
Closed

Race condition while waiting on all listen interfaces #3305

stormshield-pj50 opened this issue Jan 3, 2023 · 10 comments

Comments

@stormshield-pj50
Copy link
Contributor

Summary

We have some code based on the dcutr example that starts a first event loop and wait on all listen interfaces for one second. Our code can act both as a client (listener + dialer) and a server (listener only).

We are experimenting a race condition while waiting on all listen interfaces: both NewListenAddr and incoming ConnectionEstablished events can be received if for instance an other peer dial the peer at the same time. One possible fix is to explicitly handle the ConnectionEstablished event but we could also miss other events (SwarmEvent::Behaviour events, ...).

Expected behaviour

When starting get all listen interfaces through NewListenAddr events, and then receive other events.

Actual behaviour

Listen interfaces events are mixed with other kind of events.

Possible Solution

Provide a swarm API that can filter in events. Filtered out events are kept in the swarm for later usage.

Version

  • libp2p version (version number, commit, or branch):
    0.50.0

Would you like to work on this bug ?

Maybe, depending on the proposed solution and its complexity.

@mxinden
Copy link
Member

mxinden commented Jan 4, 2023

Can you expand on why you can not handle the other events right away?

Provide a swarm API that can filter in events. Filtered out events are kept in the swarm for later usage.

I am not familiar enough with your use-case. Off the top of my head, this seems like a hack that you can easily implement on top of Swarm by buffering the events. Is that correct?

@stormshield-pj50
Copy link
Contributor Author

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

That is to say we have removed the first wait listen loop in our code and it seems to be ok, but the interrogation still remains.

Provide a swarm API that can filter in events. Filtered out events are kept in the swarm for later usage.

I am not familiar enough with your use-case. Off the top of my head, this seems like a hack that you can easily implement on top of Swarm by buffering the events. Is that correct?

Yes for sure, it can be implemented by buffering the events but seems to be a bit ugly.

@rkuhn
Copy link
Contributor

rkuhn commented Jan 9, 2023

Another approach could be to send all swarm events to multiple recipients: one is the long-running loop that services all swarm requests, another is a stream consumer that just filters for NewListenAddr events and does whatever you need to do with those.

One potentially nice feature to be added to libp2p might be to emit an event once the listeners on all initially detected interfaces have been bound (or failed to do so). I’m a bit on the fence, though, whether a more sustainable design is enforced by only emitting addresses as they appear — since that can happen at any later time anyway. @stormshield-pj50 this is the crux of the matter: you typically will get NewListenAddr later, after other events have been emitted, so when would you decide to say “now the listening phase is complete”?

@thomaseizinger
Copy link
Contributor

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

It is a standalone example that doesn't necessarily represent the way you'd structure a production application.

@stormshield-pj50
Copy link
Contributor Author

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

It is a standalone example that doesn't necessarily represent the way you'd structure a production application.

I understand this is not a production application, but how can you dial and make port reuse work if you haven't waited first to listen on all interfaces ? How would you do that in a production application ? Can you provide an example code ?

@rkuhn
Copy link
Contributor

rkuhn commented Jan 12, 2023

@stormshield-pj50 Could you be more precise? I ask because “all interfaces” is not well-defined: interfaces may appear or disappear at any time, and they may or may not be enumerated within a given timeout. The situation would be completely different if your application listens on specific addresses, in which case you know all addresses from the start.

(I don’t know much about dcutr, so take my comments as generic Swarm statements.)

@stormshield-pj50
Copy link
Contributor Author

@stormshield-pj50 Could you be more precise? I ask because “all interfaces” is not well-defined: interfaces may appear or disappear at any time, and they may or may not be enumerated within a given timeout. The situation would be completely different if your application listens on specific addresses, in which case you know all addresses from the start.

(I don’t know much about dcutr, so take my comments as generic Swarm statements.)

Yes I mean "all interfaces" like in dcutr by listening on 0.0.0.0, so I don't know all addresses from the start.

@rkuhn
Copy link
Contributor

rkuhn commented Jan 12, 2023

That’s what I said: binding to 0.0.0.0 has well-defined semantics on the socket level, but doesn’t tell the socket what its reachable addresses are. Since libp2p needs to know the actual addresses, it binds to some addresses that are discovered — one by one, over time — through RT_NETLINK (or similar). This is a dynamic process, there is no “final answer” to what “all addresses” means. So you’ll have to be more specific than that.

What you could do is use if-watch yourself, discover a set of addresses you’re happy with, and then let the swarm listen exactly on those. That way your code knows in advance via which addresses the swarm will very likely be reachable (modulo bind or listen syscall errors).

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Jan 13, 2023

Can you expand on why you can not handle the other events right away?

We'd like to handle those events straight away, however we have based our code on dcutr, and as dcutr we first wait to listen on all interfaces, then we dial other peers. Do you known why dcutr is written this way ?

It is a standalone example that doesn't necessarily represent the way you'd structure a production application.

I understand this is not a production application, but how can you dial and make port reuse work if you haven't waited first to listen on all interfaces ? How would you do that in a production application ? Can you provide an example code ?

Are you starting your applications at the same time? Normally the order of events should be:

  • Start the relay
  • Relay either has its public address configured or discovers it through other peers via identify
  • Once discovered, relay can offer its relay services to other peers by advertising its address somewhere, be it through a DHT, rendezvous or some other service
  • Client A can discover the relay and use it for listening
  • Client A can advertise its relayed listen address to other peers
  • Client B discovers relayed listen address and attempts a hole punch via dctur

If I understand your setup correctly, the "race condition" you are experiencing is because Client A magically knows the relay's address ahead of time. If it were to learn it at runtime, the NewListenAddr event must have happened on the relay side and thus the dial must strictly only happen after it (and should succeed). One exception to this is that the relay uses a static public address or a DNS address. In that case, the entire "advertise relay services" part can be skipped altogether and the dial either succeeds because the relay is up or the dial fails because the relay is down.

In both cases, I think the race condition should not occur.

Bottom line: The example takes a short-cut by having the client learn the relay's address "out-of-band" via a commandline parameter. This design is prone to race conditions so that is the bit you should likely get rid of.

Hope that helps :)

@thomaseizinger
Copy link
Contributor

@stormshield-pj50 I am closing this as resolved. Please let me know if there is still an issue you think needs addressing! :)

@thomaseizinger thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants