-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bitswap: peer prom tacker #413
Conversation
For #209 Need examples.
Need tests (I think the client's tracer is bugged and it does not record outbound messages).
Codecov Report
@@ Coverage Diff @@
## main #413 +/- ##
==========================================
- Coverage 49.61% 49.41% -0.21%
==========================================
Files 248 249 +1
Lines 29838 29945 +107
==========================================
- Hits 14805 14798 -7
- Misses 13615 13726 +111
- Partials 1418 1421 +3
|
} | ||
}() | ||
|
||
peerIdLabel := []string{"peer-id"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High cardinality labels in prometheus are is considered antipattern. If there was 20K of peers, this will create 20K time series, and that may cause problems (performance, billing) when Grafana tries to visualize it.
To understand why high cardinality is a problem, see:
- https://stackoverflow.com/questions/46373442/how-dangerous-are-high-cardinality-labels-in-prometheus
- https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/
IMO this PR can't land in boxo in this form as it creates footgun for users of this library.
There needs to be either a hard-limit on the number of peers tracked, or an explicit opt-in via constructor option or ENV variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 I think if we wanted to have metrics similar to this, we could measure P95 globally without running into the cardinality problem.
To do so, one would define Objectives
in SummaryOpts
to be P50, P75, P95 etc, and calculate messages-received
, messages-sent
, bytes-received
etc across all peers, not specific per peer. This way we get useful P95 metric with known error margin, without exploding the time series.
This was comunicated as not needed and we don't want to run it in production in Kubo due to the cardinality issue. |
Fixes #209