Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(transport): Metrics oriented connection upgrade #4331

Open
dhuseby opened this issue Aug 15, 2023 · 8 comments
Open

feat(transport): Metrics oriented connection upgrade #4331

dhuseby opened this issue Aug 15, 2023 · 8 comments

Comments

@dhuseby
Copy link
Contributor

dhuseby commented Aug 15, 2023

Description

I am looking for a way to collect aggregate metrics for each stream/substream at the transport level. I think the best way to do this is to create a connection upgrade that just passes buffers through while also sending tuples of timestamp, direction (e.g. in or out), and buffer size over a user-provided bounded channel. I think this would allow upgrading of arbitrary/all streams/sub-streams with each upgraded connection sending their measurements over a bounded mpsc channel to logic that then calculates the aggregate metrics over time.

Motivation

Currently it is not simple/possible to collect metrics for all/some of the network activity at the transport level--below any protocols. These metrics are useful for analytics and debugging as well as nifty "stats for geeks" displays in rust-libp2p applications.

Requirements

  1. Upgrade a connection to add measuring the in/out bytes passing through it.
  2. Report the timestamped and annotated measurements over an mpsc channel if one has been supplied.
  3. Examples for upgrading single connections and all connections along with an example behavior that consumes the metrics and emits behavior events with aggregate metrics based on the config.
  4. Maybe the example network behavior speaks a protocol that allows for the aggregate metrics to be sent over the network to another peer.

Open questions

  1. Is a connection upgrade the correct/best way to do this?
  2. Is the consumption of the measurements and the generation of aggregate metrics best done in a network behavior?

Are you planning to do it yourself in a pull request?

Maybe

@thomaseizinger
Copy link
Contributor

Does https://docs.rs/libp2p/latest/libp2p/trait.TransportExt.html#method.with_bandwidth_logging help your usecase?

Perhaps the metrics module should offer a collector that exports these metrics in prometheus form?

One of the tricky things here is that we can't scope stuff like bandwidth to peers or protocols because that would lead to a possible cardinality explosion in prometheus.

@mxinden
Copy link
Member

mxinden commented Aug 17, 2023

Duplicate of #3262.

@thomaseizinger thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale Aug 17, 2023
@dhuseby
Copy link
Contributor Author

dhuseby commented Aug 17, 2023

#3180 and #3262 only give totals. The main idea for #4331 was to emit a stream of tuples containing a timestamp, direction and buffer size on every connection. The idea being that a "connection manager" receiving the stream of events could do instantaneous bandwidth calculations on a per-stream and aggregate basis as well as maintain totals. This is similar to the idea of a "connection manager" responding to ping behavior messages to implement connection policies:

//! It is up to the user to implement a health-check / connection management policy based on the ping protocol.
//!
//! For example:
//!
//! - Disconnect from peers with an RTT > 200ms
//! - Disconnect from peers which don't support the ping protocol
//! - Disconnect from peers upon the first ping failure
//!
//! Users should inspect emitted [`Event`]s and call APIs on [`Swarm`]:
//!
//! - [`Swarm::close_connection`](libp2p_swarm::Swarm::close_connection) to close a specific connection
//! - [`Swarm::disconnect_peer_id`](libp2p_swarm::Swarm::disconnect_peer_id) to close all connections to a peer
//!
//! [`Swarm`]: libp2p_swarm::Swarm
//! [`Transport`]: libp2p_core::Transport

@dhuseby dhuseby reopened this Aug 17, 2023
@thomaseizinger
Copy link
Contributor

Thanks for clarifying! I'd still see this as a duplicate of #3262 by changing #3262 into reporting these stats per connection. The comment on #3262 also applies to this issue here because the connection being emitted does not yet have an identifier at the time it is created. The identifier is only assigned within Swarm and cannot be accessed from within the Transport implementation.

@dhuseby
Copy link
Contributor Author

dhuseby commented Aug 23, 2023

So with #3180 giving total aggregate bytes at the transport level, it would be interesting to also have the same thing on a per-stream basis. Mostly because I want to do some performance benchmarking of a single stream at the stream level while also benchmarking at the transport level to measure the loss caused by our stack.

Also the design for the transport metrics makes me wonder if the Arc + atomic counter is the fastest way to get the information out of the transport.

@thomaseizinger
Copy link
Contributor

So with #3180 giving total aggregate bytes at the transport level, it would be interesting to also have the same thing on a per-stream basis.

That would be great! Streams don't have a public identity (yet) so I wouldn't know how to implement this.

I think all muxers have some kind of internal ID so perhaps we can reuse that or we introduce our own tracking similar to listeners and connections.

The latter would have the advantage that we could scope the stream ID to the connection and do something like 1/1 (stream 1 on connection 1).

@thomaseizinger
Copy link
Contributor

Note that we now have metrics per transport protocol: #4727

Adding more metrics to also track it per protocol is more difficult. One reason for that is that we shouldn't just use the protocol as a prometheus label. The number of protocols is technically unbounded and could thus lead to a cardinality explosion. Hence, to be safe, we'd have to limit at compile-time, which protocol we add as a label.

At that point, I am no longer sure how useful that is ..

@mxinden
Copy link
Member

mxinden commented Nov 24, 2023

The number of protocols is technically unbounded and could thus lead to a cardinality explosion.

I argue that the number of locally supported protocols is likely bounded. Thus, whenever a protocol is negotiated (i.e. supported by the local node) it is fine to use it as a Prometheus label.

Either way, the problem described in #3262 (comment) still stands, i.e. how to get a hand on the protocol name in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants