Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util: expose otel metrics through grpc and prometheus #4460

Merged
merged 1 commit into from
Dec 12, 2023

Conversation

jsternberg
Copy link
Collaborator

@jsternberg jsternberg commented Dec 4, 2023

This exposes the otel metrics through the environment variables for the
otel collector or through the prometheus /metrics endpoint through the
debug address.

This also adds metrics for the grpc endpoints from the opentelemetry
libraries.

For the opentelemetry collector, the metric temporality is delta which
reduces the amount of data that goes over the wire overall. Prometheus
metrics are cumulative because prometheus metrics require that.

@jsternberg jsternberg force-pushed the otel-metrics branch 3 times, most recently from f0f15ef to c233e37 Compare December 5, 2023 18:39
go.mod Outdated
Comment on lines 82 to 91
go.opentelemetry.io/otel/sdk v1.21.0
go.opentelemetry.io/otel/trace v1.21.0
go.opentelemetry.io/proto/otlp v1.0.0
golang.org/x/crypto v0.14.0
golang.org/x/net v0.17.0
golang.org/x/sync v0.3.0
golang.org/x/sys v0.13.0
golang.org/x/sys v0.14.0
golang.org/x/time v0.3.0
google.golang.org/genproto/googleapis/rpc v0.0.0-20230711160842-782d3b101e98
google.golang.org/grpc v1.58.3
google.golang.org/genproto/googleapis/rpc v0.0.0-20230822172742-b8732ec3820d
google.golang.org/grpc v1.59.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to bump otel deps for this feature? We tend to infer these updates through containerd, see #4318 (review)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to. I'm attempting to get this working with v1.19.0 since that was the first version where the metrics were considered stable. So that should be fine.

@jsternberg jsternberg force-pushed the otel-metrics branch 2 times, most recently from 8586a33 to a54d044 Compare December 5, 2023 19:33
@jsternberg jsternberg marked this pull request as ready for review December 5, 2023 20:13
@jsternberg jsternberg force-pushed the otel-metrics branch 2 times, most recently from f025d9c to c52719b Compare December 8, 2023 15:59
@jsternberg
Copy link
Collaborator Author

I have a PR almost ready for some cache metrics, but I also found that opentelemetry has grpc metrics built in so I included them in this PR so this PR actually uses the metrics.

Copy link
Member

@crazy-max crazy-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions but overall LGTM

@@ -28,6 +29,8 @@ func setupDebugHandlers(addr string) error {
bklog.G(req.Context()).Debugf("triggered GC from debug endpoint")
}))

m.Handle("/metrics", promhttp.Handler())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering myself why not /debug/metrics as route but this is part of the debug handler so good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly because /metrics is the typical endpoint for prometheus. I don't think it matters though so I can always move it to /debug/metrics.

The main reason why it's in the debug handler is because buildkit doesn't expose an HTTP server otherwise. At the same time, this pathway is only relevant for buildkit when it's run in isolation. If buildkit is embedded, such as moby, we'll piggyback on their own /metrics endpoint and be included there.

Comment on lines 148 to 150
// Register the prometheus reader if there was no error.
// Don't fail if there was an error.
// TODO(jsternberg): Maybe log the error here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should log the error using github.com/moby/buildkit/util/bklog

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot that I put this todo there. Good catch. I'll log the error as a warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

sdktp := sdktrace.NewTracerProvider(sdktrace.WithSpanProcessor(sp), sdktrace.WithResource(res))
closers = append(closers, sdktp.Shutdown)
if r, err := prometheus.New(); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC prometheus is a pull metrics exporter right? If we fail to register it and a scraper tries to pull, what happens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it'll just include the defaults for prometheus. It just won't include any of our own metrics.

@@ -43,3 +60,36 @@ func otlpExporter() (sdktrace.SpanExporter, error) {

return otlptrace.New(context.Background(), c)
}

func otlpMetricExporter() (sdkmetric.Exporter, error) {
set := os.Getenv("OTEL_METRICS_EXPORTER") == "otlp" || os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" || os.Getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT") != ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but I think we should start writing a dedicated "observability" docs? Atm we just have

buildkit/README.md

Lines 765 to 777 in c2baaa9

## OpenTelemetry support
BuildKit supports [OpenTelemetry](https://opentelemetry.io/) for buildkitd gRPC
API and buildctl commands. To capture the trace to
[Jaeger](https://github.com/jaegertracing/jaeger), set `JAEGER_TRACE`
environment variable to the collection address.
```bash
docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest
export JAEGER_TRACE=0.0.0.0:6831
# restart buildkitd and buildctl so they know JAEGER_TRACE
# any buildctl command should be traced to http://127.0.0.1:16686/
```

And https://docs.docker.com/build/building/opentelemetry/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that would probably help.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall SGTM

util/tracing/detect/otlp.go Outdated Show resolved Hide resolved
This exposes the otel metrics through the environment variables for the
otel collector or through the prometheus `/metrics` endpoint through the
debug address.

This also adds metrics for the grpc endpoints from the opentelemetry
libraries.

For the opentelemetry collector, the metric temporality is delta which
reduces the amount of data that goes over the wire overall. Prometheus
metrics are cumulative because prometheus metrics require that.

Signed-off-by: Jonathan A. Sternberg <[email protected]>
@tonistiigi tonistiigi merged commit f84cfe3 into moby:master Dec 12, 2023
59 of 60 checks passed
@jsternberg jsternberg deleted the otel-metrics branch December 12, 2023 22:07
pl4nty added a commit to pl4nty/homelab that referenced this pull request Dec 24, 2023
moby/buildkit#4460 tries and fails to send metrics if OTEL_METRICS_EXPORTER is 'none'

Signed-off-by: Tom Plant <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants