-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose metrics via Prometheus #1422
Conversation
There were the following issues with your Pull Request
Guidelines are available to help. Your feedback on GitCop is welcome on this issue This message was auto-generated by https://gitcop.com |
I haven't really covered how other use cases mentioned in #1418 could be solved, just wanted to finally get this out here for feedback, and will add more info later tonight. |
@@ -12,6 +12,7 @@ import ( | |||
_ "github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/codahale/metrics/runtime" | |||
ma "github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-multiaddr" | |||
"github.com/ipfs/go-ipfs/Godeps/_workspace/src/github.com/jbenet/go-multiaddr-net" | |||
prom "github.com/prometheus/client_golang/prometheus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to run make vendor
to make sure imports are re-written. if godeps yells at you, see #1385 (comment)
couple comments above, but this LGTM. and, prometheus comes well recommended. (and i love the name). maybe, actually, could put the metrics under
so that we could add other metrics there if we needed to |
There were the following issues with your Pull Request
Guidelines are available to help. Your feedback on GitCop is welcome on this issue This message was auto-generated by https://gitcop.com |
I agree about the distinct namespace for "internal" handlers. |
@lgierth oh uh: https://travis-ci.org/ipfs/go-ipfs/jobs/68493979#L194 this means the metrics clashed when the tests tried to run. looks like we need to add some uniqueness to the metrics names. This could be done either by prefixing with a ptr, or a monotonically increasing number (global for the proc :/) |
It's using prometheus.MustRegisterOrGet() now, which gracefully skips over duplicate metric collectors -- build's green now |
@@ -82,6 +90,8 @@ func NewSwarm(ctx context.Context, listenAddrs []ma.Multiaddr, | |||
s.cg.SetTeardown(s.teardown) | |||
s.SetConnHandler(nil) // make sure to setup our own conn handler. | |||
|
|||
prom.MustRegisterOrGet(peersTotal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh now i understand. sorry, i missed this earlier. this is getting initialized on init
. I believe this needs to be on the swarm itself (i.e. a gauge per swarm). in many circumstances we create multiple swarms per process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or if not stored on the swarm, then create a new one per swarm? point it, it should be independent gauges per swarm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's for another PR, imo. Right now this basically counts "total number of peers over all swarms", which is a good simple metric to get started with.
If the swarm itself is an interesting dimension of the peers count, we can replace the Counter with a CounterVec, which allows supplying labels (i.e. dimensions) with each Inc()/Dec() call. More info on labels.
CounterVec is a Collector that bundles a set of Counters that all share the same Desc, but have different values for their variable labels. This is used if you want to count the same thing partitioned by various dimensions (e.g. number of HTTP requests, partitioned by response code and method). Create instances with NewCounterVec.
peersTotal.GetMetricWithLabelValues(Labels{"swarm": "foo"}).Inc(1)
Labels should only be used with dimensions that are more or less bounded. For example, a unix process ID is a bad label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"total number of peers over all swarms", which is a good simple metric to get started with.
What would this metric be used for?
If the swarm itself is an interesting dimension of the peers count,
It's not just an interesting dimension. not conditioning on the swarm is incorrect.
not sure you see what i mean: there's no relationship implied between two different swarms in the same process (they will most likely belong to different nodes, but not necessarily), and it will double count any connections, between swarms in the same process, and from the multiple swarms to the same peer. (This happens all the time in tests, for example). so, in an in-proc network of 5 nodes, you'd see "25 Number of connected peers" which is an incorrect statement, because there's "25 individual connections, across only 5 peers".
I don't want to merge things to the surface area (interface) that are either (partially) incorrect or that we'll just have to change right away. anything we merge is stuff that someone might start using. I'd probably start using this metric just because it's there, but it would be misleading.
If want to merge just the vendoring of the lib with a test metric, sure let's do it, but maybe go for something else, that isn't a singleton?
Labels should only be used with dimensions that are more or less bounded. For example, a unix process ID is a bad label.
What do you mean by bounded in this case? (cause unix pids are bounded, but not in the way you mean)
but i think it's incorrect. see #1422 (comment) |
License: MIT Signed-off-by: Lars Gierth <[email protected]>
Updated, it now adds the local PeerID as a dimension (label in prometheus lingo). With prometheus' automatic labels included, this ends up as:
The instance will be something like
I meant something that doesn't end up with a huge cardinality after a short amount of time. PeerIDs are constant on the nodes that we want to monitor. I can imagine use cases with throw-away identities, but these cases will just have to live with the fact that this particular metric might not perform that well. |
} | ||
|
||
func (n *ps2netNotifee) Disconnected(c *ps.Conn) { | ||
n.not.Disconnected(n.net, inet.Conn((*Conn)(c))) | ||
if metric, err := peersTotalGauge(n.net.LocalPeer()); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder, do we want a log.Debug
if err != nil ? shrug maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// An error is returned if the number and names of the Labels are inconsistent
// with those of the VariableLabels in Desc.
I can swap it for the other func that panics 🤘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we shouldnt ship things with inconsistent labels. im on the fence (a debug log may be ok), but dont care much either way. i think @whyrusleeping would say panic.
ok sounds good! @lgierth i think this LGTM and is ready for merge. ready? |
Not yet, this currently increments/decrements once per notifee, which is wrong -- I'll put the increment/decrement into a distinct notifee, which is anyhow the cleanest way. (Also just noticed that Travis is red). |
i think it's an intermittent travis failure. we've still to get these to run perfectly right. (it's extra paranoid, which helps, but we really should fix these to be able to just get safe greens) |
License: MIT Signed-off-by: Lars Gierth <[email protected]>
License: MIT Signed-off-by: Lars Gierth <[email protected]>
Okay, it's good now: 8b164f9 -- the count is reliably the same as The Travis failure looks unrelated (mock and mock_notif). |
@lgierth LGTM! 👍 |
Expose metrics via Prometheus This commit was moved from ipfs/kubo@74a331d
I've been approaching #1418 from an ops perspective, where prometheus looks very promising. (Disclaimer: it comes from my old employer SoundCloud, where we had plenty of pain with Graphite, Ganglia, Statsd, and all kinds of custom metrics libraries and tools.)
This PR is massive because of Godeps (did I do this right?), but the actual integration is uninvasive, complete, and incremental.
Prometheus exposes metrics via a net/http handler, which is then aggregated by the prometheus daemon, which serves graphs and timeseries, and can back proper dashboards like Graphana.
Currently, this handler is registered with the API (http://127.0.0.1:5001/metrics) and could be used by command line tools, in order to get metrics snapshots, e.g. "number of peers right now". These metrics can be tagged with "labels" to add more dimensions, e.g. "http responses with status code 502", "peers connected via cjdns", "metrics from pluto.i.ipfs.io" etc.