Stable metrics sdk #412

tsloughter · 2022-07-19T13:05:36Z

Opening as a draft because I'm not 100% on the direction of this but I think it is close enough to start getting feedback.

There will be a number of tasks after this PR is merged, like supporting floats in the aggregations, filtering attributes in views and adding the async/observable metrics, in addition to many clean up tasks.

Important things to review in this PR are design decisions. One being that each measurement is a message. I do not like, and don't think we can keep, the message going to the otel_meter_server because that also handles other messages and should remain responsive for those. I will likely create a process, with the possibility in the future to be a pool of processes assuming aggregations remain safe to act concurrently.

The Reader process isn't properly handled yet either because it is monitored by the otel_meter_server and not linked but the ets tables it uses are owned by the server. Either the reader needs to be able to run on its own or it should be linked. I'd prefer the former.

Also getting an Instruments no longer requires a call to the server process. Initially I thought it best if it did so that it could be matched to views at that point and not on each measurement but this is likely not useful, and certainly not useful compared to not having to make a call to get an Instrument. Note that future view matching could include attributes in the criteria, so then setting up matches based only on the Instrument isn't even an option.

An issue is that views can be added at any time and after an instrument is matched once (also an issue if we were to support criteria that matches on attributes) it won't be matched again, meaning views added after the first time an instrument is used will never be matched.

Note that filtering attributes is different from using them as criteria. Filtering still needs to be added but needs to be there before this moves out of experimental.

Not sure if the matchspec in checkpoint is efficient but I think checkpoint should be working against all metrics at once instead of one at a time, so it likely will be removed anyway.

codecov · 2022-07-19T13:09:43Z

Codecov Report

Merging #412 (352cfad) into main (6990ff3) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
- Coverage   73.80%   73.76%   -0.05%     
==========================================
  Files          53       53              
  Lines        1680     1681       +1     
==========================================
  Hits         1240     1240              
- Misses        440      441       +1

Flag	Coverage Δ
api	`68.99% <ø> (-0.12%)`	⬇️
elixir	`18.83% <ø> (-0.04%)`	⬇️
erlang	`75.23% <ø> (ø)`
exporter	`73.59% <ø> (ø)`
sdk	`78.95% <ø> (ø)`
zipkin	`53.84% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
apps/opentelemetry_api/lib/open_telemetry/span.ex	`25.00% <0.00%> (-1.32%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

tsloughter · 2022-07-19T20:56:59Z

Eh, marking as ready for review. I don't think anything major needs to be changed before merging since it is all in experimental. So I'd like to get reviews and merge this so I can be doing smaller future PRs that are easier to review.

ferd

Started with a review of the API, I'll try to do the internals a bit later this week.

apps/opentelemetry_api_experimental/src/otel_counter.erl

apps/opentelemetry_api_experimental/src/otel_histogram.erl

ferd · 2022-08-04T12:18:05Z

apps/opentelemetry_api_experimental/include/otel_metrics.hrl

+                     kind        :: otel_instrument:kind(),
+                     value_type  :: otel_instrument:value_type(),
+                     unit        :: otel_instrument:unit() | undefined,
+                     callback    :: otel_instrument:callback() | undefined}).


Is this record meant to be publicly available outside the app? It is given its own type in otel_instrument so I'm assuming the record ought to be private (eg src/otel_metrics_rec.hrl which could include this file to keep the macro definitions public)

Issue is it is needed in the SDK app. So I'm not sure what is best to do since I'd rather not go the route of putting it all by a module for accessing each field.

Oh well we could cheat a bit I guess. Have the content is wrapped in a sort of:

-ifdef(OTEL_INTERNAL_MACRO). -record(instrument, ...) -endif.

and then you could include the same file but running a -define(OTEL_INTERNAL_MACRO, true) prior to inclusion gives you visibility.

I don't know if that's overkill though.

Yea, might be overkill. But might be worth it if we don't think of anything else.

For now though since this is in experimental I think we should move forward as is and can freely make the record private after this as long as its before it goes into the stable SDK.

apps/opentelemetry_api_experimental/src/otel_meter_provider.erl

apps/opentelemetry_api_experimental/src/otel_observable_counter.erl

apps/opentelemetry_experimental/src/otel_metrics.hrl

ferd · 2022-08-05T11:49:29Z

apps/opentelemetry_experimental/src/otel_aggregation_histogram_explicit.erl

+                                    min=infinity, %% works because any atom is > any integer
+                                    max=neg_infinity, %% requires an explicit check


Just so I'm understanding right, those min/max values are the default until any data is encountered, right? So it's normal for min to be bigger than max? I think semantically undefined would make most sense because it wouldn't be a legitimate in-band value (though given Erlang doesn't support infinite values in floats this is sort of a moot point)

Right, these are the defaults until any data is encountered.

Going with undefined probably makes the most sense. I had hopes that there would be something I'd find that could be used for max that would always be less than any number, but since there isn't... wait, just realized that even though Erlang has no number size limit the Otel protocol does, so technically we could use -2^64 here. hmm

The reason I wanted a term that I could compare with < is so it could be done in a matchspec -- and it means only one comparison needed, not one to check if its undefined and then to check if its <.

apps/opentelemetry_experimental/src/otel_aggregation_histogram_explicit.erl

ferd · 2022-08-05T12:26:07Z

apps/opentelemetry_experimental/src/otel_meter_server.erl

+    %% unless this is changed this needs to be switched to a link
+    %% also need to be able to recover who the readers are when the meter server restarts
+    %% discovery by looking at a supervision tree could be a good idea as it would allow for added
+    %% readers to be returned to the server after a crash


supervision tree would work though it can fail to restore fine if the server's failure lines up with a reader's failure; another idea is to have a parent-owned table recording which readers should exist, so in case of a restart the new server can restore state to where it should be?

If does rely on having stable predictable names though, and being able to map these to either processes, or supervisor entries.

apps/opentelemetry_experimental/src/otel_metric_exporter_pid.erl

apps/opentelemetry_experimental/src/otel_metric_reader.erl

apps/opentelemetry_experimental/src/otel_view.erl

ferd · 2022-08-08T21:03:09Z

apps/opentelemetry_experimental/src/otel_aggregation_histogram_explicit.erl

+                                    bucket_counts=zero_buckets(length(Boundaries)),
+                                    record_min_max=RecordMinMax,
+                                    min=infinity, %% works because any atom is > any integer
+                                    max=?MIN_DOUBLE,


The fact that the data model marks both the min and max value as optional does seem to mean we have to be careful not to use the MIN value unless you want it to be impossible to distinguish between a legit MIN and a placeholder (https://github.com/open-telemetry/opentelemetry-proto/blob/c5c8b28012583fda55b0cb16f73a820722171d49/opentelemetry/proto/metrics/v1/metrics.proto#L496-L501)

Is this logic referred to the collector or the exporter at this point?

We don't need to distinguish because this record is only created when a measurement is taken. If minmax are to be recorded it will be set at that time.

Which reminds me, explicit_histogram_checkpoint should only be number().

The only time this may be an issue is a delta temporality, which will reset to the default when taking the checkpoint. That isn't supported yet.

apps/opentelemetry_experimental/src/otel_metrics.hrl

Co-authored-by: Fred Hebert <[email protected]>

ferd

Approved on my end, though I'd like to see what @GregMefford has to say about it

tsloughter · 2022-08-11T00:18:15Z

Going to go ahead and merge. But it would be great if @GregMefford gets some time.

If @GregMefford and/or @bryannaegele are able to make the SIG meeting tomorrow we could also do a code walk through at that time.

Future PRs will be smaller.

github-actions bot added the language-erlang label Jul 19, 2022

tsloughter force-pushed the stable-metrics-sdk branch from 3c9c024 to b7328db Compare July 19, 2022 13:17

tsloughter marked this pull request as ready for review July 19, 2022 20:53

tsloughter requested a review from a team July 19, 2022 20:53

github-actions bot added the scope-ci label Jul 19, 2022

tsloughter force-pushed the stable-metrics-sdk branch from b16baec to c131250 Compare July 19, 2022 22:15

ferd reviewed Aug 4, 2022

View reviewed changes

GregMefford self-requested a review August 4, 2022 16:08

ferd reviewed Aug 5, 2022

View reviewed changes

tsloughter requested a review from ferd August 7, 2022 23:50

tsloughter force-pushed the stable-metrics-sdk branch from 12231e7 to 1baebe3 Compare August 8, 2022 18:16

ferd reviewed Aug 8, 2022

View reviewed changes

apps/opentelemetry_experimental/src/otel_metrics.hrl Outdated Show resolved Hide resolved

tsloughter force-pushed the stable-metrics-sdk branch from 1baebe3 to da6a480 Compare August 8, 2022 21:16

Tristan Sloughter added 14 commits August 9, 2022 17:09

start on new metrics API

f2f172e

initial metrics API rewrite for stable spec

d99df9c

add gauge support

c1a053f

inefficient explicit bounds histogram

97a7926

move view aggregations to only ets table

1a92e31

use per-reader metric storage

0362ea5

move to single active_metric record for value in ets table

0a8c95f

support options to view aggregations

3b9b4b9

metrics: use per-reader ETS tables

e673217

add test for per-reader default aggregations

696289d

use previous instrument matches to find aggregations for measurements

7ad0cbb

rm create_ prefix to make clear it has no side effect to get instrument

fc64849

ignore metric values of differnet type than the instrument

f920cd3

update metrics console exporter to support new reader metrics datapoints

19e39ef

tsloughter and others added 12 commits August 9, 2022 17:10

Update apps/opentelemetry_experimental/src/otel_metric_reader.erl

534ff0a

Co-authored-by: Fred Hebert <[email protected]>

add logging on bad instrument use

4459ed3

pr cleanup: use async timer cancel and add value_or function

4ff6d33

metrics pid exporter: tag wrapping metric message, default: otel_metric

d0ce967

improve type specs in metrics code

3632353

add note about improving explicit histogram aggregate function

d6f419d

use -2^63 for the initial min value in explicit histogram

d0820fe

bump rebar3 to 3.19.0 in CI

c922c01

fix reader configuration type spec

4cf7e03

bump elixir to 1.12.3 in CI

8087b9b

remove unneeded config in github actions

a9093da

force update of rebar3 by mix

0ba607d

tsloughter force-pushed the stable-metrics-sdk branch from 379b27d to 0ba607d Compare August 9, 2022 23:14

clear cache in failing mix cover step

255823f

tsloughter force-pushed the stable-metrics-sdk branch 6 times, most recently from 0153159 to 4a29466 Compare August 10, 2022 12:19

ferd reviewed Aug 10, 2022

View reviewed changes

ferd approved these changes Aug 10, 2022

View reviewed changes

Tristan Sloughter added 3 commits August 10, 2022 10:35

mix install locally available rebar3

edaa404

bump otp 23 version in CI to 23.3.4.16

2297077

only run dialyzer on latest supported OTP

6b8885b

tsloughter force-pushed the stable-metrics-sdk branch from 0d8865e to 6b8885b Compare August 10, 2022 16:35

use rebar3 3.17.0 in elixir tests

352cfad

tsloughter merged commit 2a700f8 into open-telemetry:main Aug 11, 2022

tsloughter deleted the stable-metrics-sdk branch August 11, 2022 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable metrics sdk #412

Stable metrics sdk #412

tsloughter commented Jul 19, 2022

codecov bot commented Jul 19, 2022 •

edited

Loading

tsloughter commented Jul 19, 2022

ferd left a comment

ferd Aug 4, 2022

tsloughter Aug 4, 2022

ferd Aug 4, 2022

ferd Aug 4, 2022

tsloughter Aug 7, 2022

ferd Aug 5, 2022

tsloughter Aug 5, 2022

ferd Aug 5, 2022 •

edited

Loading

ferd Aug 8, 2022

tsloughter Aug 8, 2022

ferd left a comment

tsloughter commented Aug 11, 2022

		min=infinity, %% works because any atom is > any integer
		max=neg_infinity, %% requires an explicit check

Stable metrics sdk #412

Stable metrics sdk #412

Conversation

tsloughter commented Jul 19, 2022

codecov bot commented Jul 19, 2022 • edited Loading

Codecov Report

tsloughter commented Jul 19, 2022

ferd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferd Aug 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferd left a comment

Choose a reason for hiding this comment

tsloughter commented Aug 11, 2022

codecov bot commented Jul 19, 2022 •

edited

Loading

ferd Aug 5, 2022 •

edited

Loading