-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics: Consider introducing dedicated measure instrument for timings #464
Comments
from the spec sig mtg today, sounds like this one could be suitable for v0.5 milestone |
This issue is being highlighted in a comparison of Micrometer and OpenTelemetry. |
I continue to think this is a good idea. It provides a very concrete instrument for doing what is likely the most common instrumentation metric recordings. |
@jmacd You mentioned few design points in the Gitter channel. Should they be updated here? |
I wrote:
I think we want (1) and to tell users of (2) they might consider using a Span instead. |
If all we want is (1), couldn't we support an additional data type (i.e. int64, double, & duration) rather than creating a new instrument? |
What's the status of this? With respect to the comment above, I think (2) is still valid because not all spans are sampled. It's not possible to determine the latency and throughput of a method without timers being present for all executions. |
My two cents: Recording time values is not a first-class concept in OpenTelemetry, there is no component that supports this by default but users need to measure time on their own and record it using a My experience with measuring elapsed time is the following:
Please also notice that having a
Why measuring elapsed time is hard?Since there is no clock abstraction currently in OTel, it is easy for users to calculate time values incorrectly (e.g.: using wall time instead of monotonic time) so depending on how do you get the current time, you might get into various issues. GranularityThe method that you use maybe returns the current time in milliseconds but the granularity of the time value can depend on the underlying operating system and may be larger than the base unit. For example, many operating systems measure time in units of tens of milliseconds. Let's say your OS measures time in chunks of 100ms. In this scenario there is no way to measure anything that needs to be more precise than this. So if an operation takes somewhere between 20 and 80ms, your measurement will always be either 0 or 100ms (but this precision can be worse, e.g.: 1s). Time can change (non-monotonic)The method that you use to get the current time might consider:
If you want to have some fun, read this: https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time. There might be other issues too but my point is: measuring elapsed time is hard (because of this languages/language-SDKs has support for measuring elapsed time). Let me show two OpenTelemetry examples that demonstrate some of the problems above:
Why testing a timer is hard?Because of the issues above and because of the uncertainty of the duration of operations, it can be hard to verify if your instrumentation is correct. Having a clock abstraction makes this easy since you can mock/fake the clock so that it will return what you want. Why could the time unit be a problem?Some metrics backends have a preference for a base time unit (e.g.: Prometheus expects times in seconds). This preference is a concern for exporting but recording should be possible in any time unit. Without having a first-class concept of a |
C++ also has |
Per our previous SIG meeting, here's a Java example: https://github.com/jonatan-ivanov/otel-spec-gh-464
Please check out the code and the "tests", here's an example to use the timer and fake the clock: Timer timer = meter.createTimer("exampleTimer", "just an example");
Timer.Sample sample = timer.start(fakeClock);
fakeClock.add(Duration.ofSeconds(1));
long recordedDuration = timer.stop(sample);
assertThat(recordedDuration).isEqualTo(Duration.ofSeconds(1).toNanos()); This is just an example, there are other arrangements to do this, this is just an example (e.g.: timer methods and how the clock is injected, histogram is not used in the example, etc.). |
Here's a possible implementation in Go: type TimerInstrument struct {
syncInstrument
}
type Timer struct {
mu sync.Mutex
instr TimerInstrument
start time.Time
labels []attribute.KeyValue
}
func (t TimerInstrument) Start(ctx context.Context, labels ...attribute.KeyValue) *Timer {
return &Timer{instr: t, start: time.Now(), labels: labels}
}
func (t TimerInstrument) Record(ctx context.Context, f func(context.Context) error, labels ...attribute.KeyValue) error {
timer := t.Start(ctx, labels...)
err := f(ctx)
err2 := timer.Stop(ctx)
if err == nil && err2 != nil {
return fmt.Errorf("error stopping timer: %w", err2)
} else if err != nil && err2 == nil {
return fmt.Errorf("error executing timed function: %w", err)
} else if err != nil && err2 != nil {
return fmt.Errorf("error executing timed function (%s) and error stopping timer (%s)", err, err2)
}
return nil
}
func (t *Timer) Stop(ctx context.Context, labels ...attribute.KeyValue) error {
t.mu.Lock()
defer t.mu.Unlock()
if t.start.IsZero() {
return fmt.Errorf("timer not active, unable to stop")
}
dur := time.Since(t.start)
t.instr.directRecord(ctx, number.NewInt64Number(int64(dur)), append(t.labels, labels...))
t.start = time.Time{}
return nil
} It seems largely similar to what @victlu and @jonatan-ivanov have proposed, though it would have the user call I think it will be important to be able to provide attributes both when starting and stopping a timer. I've included a |
Here is a Go prototype as well. The thing I like about this design is that you can write a one-liner to record a duration. https://github.com/lightstep/opentelemetry-prometheus-sidecar/blob/main/telemetry/timer.go |
For reference, here's our current implementation of request metrics in Java At first I was quite sad not to have a Timer API which I would have just put into the Context directly - but since I ended up needing to propagate labels from start to end of the request, adding a Edit: But if there were a similar class for the counter too, where add and sub don't operate on the meter but happens based on an intermediate object, then labels don't have to be propagated / grokked and that would be nice too. I think this issue may be not only about timings but about any metric that spans a request's beginning and end which includes active request counting as well. |
Hello! I'm working on the next generation of the MicroProfile Metrics spec. We're intending to allow for any metrics API to be used as the implementation of our next API. One important part of that API will be (and has been in past releases of our spec) a Timer metric. So it would be great if OTel could find a solution for this issue so that OTel metrics could be used by implementers of the next version of the MicroProfile Metrics spec. |
hi @donbourne! we are currently capturing "timing" metrics as histograms, e.g. http.server.duration it's likely that a future "timer" instrument would just be a (very helpful) abstraction on top of histogram since you would be providing your own Timer abstraction in MicroProfile metrics, would it work for you to bridge that directly to otel histogram for now? |
Agreeing with @trask -- to me the reason for a dedicated timing instrument is to abstract away details about handling clocks and/or to have a more stopwatch-like API. If you already have a timing measurement, then just use a histogram with the correct units. |
thanks @trask and @jmacd , knowing that the plan for OTel is to continue to use a histogram for Timers is useful. I realize that different metrics APIs will use different ways of representing a Timer, so it would be difficult for us to try to standardize on the exact output expected for timers. That said, using a histogram is similar in spirit to what we've done in the past for Timers (and what Micrometer can do when using a Prometheus Meter Registry). So while I agree that having a timing instrument seems very useful from a syntax perspective, I get your point that an implementer of MicroProfile Metrics could make do with a histogram (particularly if future timing instrument will be implemented on top of a histogram). |
Related to open-telemetry/oteps#129 |
Hey all! interesting discussion, I'm migrating the organization I work in from multiple metric strategies (Metrics.Net and app-metrics to influxDB) to a single industry standard solution, which I'm happy to embrace. This conversation caught my eye because OTEL doesn't define a "timer", and here you are discussing a possible implementation. I'm afraid that defining a How would such an instrument help me to get mean, min, max and quantiles? |
Histograms are designed for scenarios where you want mean, min, max and quantiles. The difference between a Histogram and other types of in-memory aggregations (like Summary/Counter) is that Histogram pushes as much data as reasonable (according to bucket boundaries) so the Quantile computation can happen server side. Every histogram has these components:
While there are many ways of calculating Quantiles, Histograms have become very popular due to better ability to handle time-alignment issues at query time over raw "gauge of quantile" type behavior we had before. Hope that helps! |
Wouldn't defining a Histogram lead to the same confusion with developers that are not familiar with the spec? :) To me the biggest advantage of having a Timer is the ability of measuring elapsed time. Which seems to be surprisingly hard because it seems most of the users are not get it right even though the solution is surprisingly easy in most platforms (see above or the article I linked above, though time is hard in general). There are more advantages like testing and handling time units too, see my comment above. |
For reference: The Prometheus Java client library has a |
This text was considered for the 0.3 release, but held back. Consider this for 0.4.
See the comment: #430 (comment)
The text was updated successfully, but these errors were encountered: