feat(transport): Metrics oriented connection upgrade #4331

dhuseby · 2023-08-15T19:22:14Z

Description

I am looking for a way to collect aggregate metrics for each stream/substream at the transport level. I think the best way to do this is to create a connection upgrade that just passes buffers through while also sending tuples of timestamp, direction (e.g. in or out), and buffer size over a user-provided bounded channel. I think this would allow upgrading of arbitrary/all streams/sub-streams with each upgraded connection sending their measurements over a bounded mpsc channel to logic that then calculates the aggregate metrics over time.

Motivation

Currently it is not simple/possible to collect metrics for all/some of the network activity at the transport level--below any protocols. These metrics are useful for analytics and debugging as well as nifty "stats for geeks" displays in rust-libp2p applications.

Requirements

Upgrade a connection to add measuring the in/out bytes passing through it.
Report the timestamped and annotated measurements over an mpsc channel if one has been supplied.
Examples for upgrading single connections and all connections along with an example behavior that consumes the metrics and emits behavior events with aggregate metrics based on the config.
Maybe the example network behavior speaks a protocol that allows for the aggregate metrics to be sent over the network to another peer.

Open questions

Is a connection upgrade the correct/best way to do this?
Is the consumption of the measurements and the generation of aggregate metrics best done in a network behavior?

Are you planning to do it yourself in a pull request?

Maybe

thomaseizinger · 2023-08-16T09:16:09Z

Does https://docs.rs/libp2p/latest/libp2p/trait.TransportExt.html#method.with_bandwidth_logging help your usecase?

Perhaps the metrics module should offer a collector that exports these metrics in prometheus form?

One of the tricky things here is that we can't scope stuff like bandwidth to peers or protocols because that would lead to a possible cardinality explosion in prometheus.

mxinden · 2023-08-17T09:26:08Z

Duplicate of #3262.

dhuseby · 2023-08-17T23:54:26Z

#3180 and #3262 only give totals. The main idea for #4331 was to emit a stream of tuples containing a timestamp, direction and buffer size on every connection. The idea being that a "connection manager" receiving the stream of events could do instantaneous bandwidth calculations on a per-stream and aggregate basis as well as maintain totals. This is similar to the idea of a "connection manager" responding to ping behavior messages to implement connection policies:

rust-libp2p/protocols/ping/src/lib.rs

Lines 32 to 46 in cbdbaa8

    
           //! It is up to the user to implement a health-check / connection management policy based on the ping protocol. 
        
           //! 
        
           //! For example: 
        
           //! 
        
           //! - Disconnect from peers with an RTT > 200ms 
        
           //! - Disconnect from peers which don't support the ping protocol 
        
           //! - Disconnect from peers upon the first ping failure 
        
           //! 
        
           //! Users should inspect emitted [`Event`]s and call APIs on [`Swarm`]: 
        
           //! 
        
           //! - [`Swarm::close_connection`](libp2p_swarm::Swarm::close_connection) to close a specific connection 
        
           //! - [`Swarm::disconnect_peer_id`](libp2p_swarm::Swarm::disconnect_peer_id) to close all connections to a peer 
        
           //! 
        
           //! [`Swarm`]: libp2p_swarm::Swarm 
        
           //! [`Transport`]: libp2p_core::Transport

thomaseizinger · 2023-08-18T09:11:12Z

Thanks for clarifying! I'd still see this as a duplicate of #3262 by changing #3262 into reporting these stats per connection. The comment on #3262 also applies to this issue here because the connection being emitted does not yet have an identifier at the time it is created. The identifier is only assigned within Swarm and cannot be accessed from within the Transport implementation.

dhuseby · 2023-08-23T00:47:39Z

So with #3180 giving total aggregate bytes at the transport level, it would be interesting to also have the same thing on a per-stream basis. Mostly because I want to do some performance benchmarking of a single stream at the stream level while also benchmarking at the transport level to measure the loss caused by our stack.

Also the design for the transport metrics makes me wonder if the Arc + atomic counter is the fastest way to get the information out of the transport.

thomaseizinger · 2023-08-23T05:22:56Z

So with #3180 giving total aggregate bytes at the transport level, it would be interesting to also have the same thing on a per-stream basis.

That would be great! Streams don't have a public identity (yet) so I wouldn't know how to implement this.

I think all muxers have some kind of internal ID so perhaps we can reuse that or we introduce our own tracking similar to listeners and connections.

The latter would have the advantage that we could scope the stream ID to the connection and do something like 1/1 (stream 1 on connection 1).

thomaseizinger · 2023-11-23T21:53:39Z

Note that we now have metrics per transport protocol: #4727

Adding more metrics to also track it per protocol is more difficult. One reason for that is that we shouldn't just use the protocol as a prometheus label. The number of protocols is technically unbounded and could thus lead to a cardinality explosion. Hence, to be safe, we'd have to limit at compile-time, which protocol we add as a label.

At that point, I am no longer sure how useful that is ..

mxinden · 2023-11-24T09:12:09Z

The number of protocols is technically unbounded and could thus lead to a cardinality explosion.

I argue that the number of locally supported protocols is likely bounded. Thus, whenever a protocol is negotiated (i.e. supported by the local node) it is fine to use it as a Prometheus label.

Either way, the problem described in #3262 (comment) still stands, i.e. how to get a hand on the protocol name in the first place.

dhuseby added the difficulty:hard label Aug 15, 2023

thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale Aug 17, 2023

dhuseby reopened this Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(transport): Metrics oriented connection upgrade #4331

feat(transport): Metrics oriented connection upgrade #4331

dhuseby commented Aug 15, 2023

thomaseizinger commented Aug 16, 2023

mxinden commented Aug 17, 2023

dhuseby commented Aug 17, 2023 •

edited

Loading

thomaseizinger commented Aug 18, 2023

dhuseby commented Aug 23, 2023

thomaseizinger commented Aug 23, 2023

thomaseizinger commented Nov 23, 2023

mxinden commented Nov 24, 2023

feat(transport): Metrics oriented connection upgrade #4331

feat(transport): Metrics oriented connection upgrade #4331

Comments

dhuseby commented Aug 15, 2023

Description

Motivation

Requirements

Open questions

Are you planning to do it yourself in a pull request?

thomaseizinger commented Aug 16, 2023

mxinden commented Aug 17, 2023

dhuseby commented Aug 17, 2023 • edited Loading

thomaseizinger commented Aug 18, 2023

dhuseby commented Aug 23, 2023

thomaseizinger commented Aug 23, 2023

thomaseizinger commented Nov 23, 2023

mxinden commented Nov 24, 2023

dhuseby commented Aug 17, 2023 •

edited

Loading