Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable metrics sdk #412

Merged
merged 44 commits into from
Aug 11, 2022
Merged

Conversation

tsloughter
Copy link
Member

Opening as a draft because I'm not 100% on the direction of this but I think it is close enough to start getting feedback.

There will be a number of tasks after this PR is merged, like supporting floats in the aggregations, filtering attributes in views and adding the async/observable metrics, in addition to many clean up tasks.

Important things to review in this PR are design decisions. One being that each measurement is a message. I do not like, and don't think we can keep, the message going to the otel_meter_server because that also handles other messages and should remain responsive for those. I will likely create a process, with the possibility in the future to be a pool of processes assuming aggregations remain safe to act concurrently.

The Reader process isn't properly handled yet either because it is monitored by the otel_meter_server and not linked but the ets tables it uses are owned by the server. Either the reader needs to be able to run on its own or it should be linked. I'd prefer the former.

Also getting an Instruments no longer requires a call to the server process. Initially I thought it best if it did so that it could be matched to views at that point and not on each measurement but this is likely not useful, and certainly not useful compared to not having to make a call to get an Instrument. Note that future view matching could include attributes in the criteria, so then setting up matches based only on the Instrument isn't even an option.

An issue is that views can be added at any time and after an instrument is matched once (also an issue if we were to support criteria that matches on attributes) it won't be matched again, meaning views added after the first time an instrument is used will never be matched.

Note that filtering attributes is different from using them as criteria. Filtering still needs to be added but needs to be there before this moves out of experimental.

Not sure if the matchspec in checkpoint is efficient but I think checkpoint should be working against all metrics at once instead of one at a time, so it likely will be removed anyway.

@codecov
Copy link

codecov bot commented Jul 19, 2022

Codecov Report

Merging #412 (352cfad) into main (6990ff3) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
- Coverage   73.80%   73.76%   -0.05%     
==========================================
  Files          53       53              
  Lines        1680     1681       +1     
==========================================
  Hits         1240     1240              
- Misses        440      441       +1     
Flag Coverage Δ
api 68.99% <ø> (-0.12%) ⬇️
elixir 18.83% <ø> (-0.04%) ⬇️
erlang 75.23% <ø> (ø)
exporter 73.59% <ø> (ø)
sdk 78.95% <ø> (ø)
zipkin 53.84% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
apps/opentelemetry_api/lib/open_telemetry/span.ex 25.00% <0.00%> (-1.32%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tsloughter tsloughter force-pushed the stable-metrics-sdk branch from 3c9c024 to b7328db Compare July 19, 2022 13:17
@tsloughter tsloughter marked this pull request as ready for review July 19, 2022 20:53
@tsloughter tsloughter requested a review from a team July 19, 2022 20:53
@tsloughter
Copy link
Member Author

Eh, marking as ready for review. I don't think anything major needs to be changed before merging since it is all in experimental. So I'd like to get reviews and merge this so I can be doing smaller future PRs that are easier to review.

Copy link
Member

@ferd ferd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started with a review of the API, I'll try to do the internals a bit later this week.

apps/opentelemetry_api_experimental/src/otel_counter.erl Outdated Show resolved Hide resolved
apps/opentelemetry_api_experimental/src/otel_histogram.erl Outdated Show resolved Hide resolved
kind :: otel_instrument:kind(),
value_type :: otel_instrument:value_type(),
unit :: otel_instrument:unit() | undefined,
callback :: otel_instrument:callback() | undefined}).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this record meant to be publicly available outside the app? It is given its own type in otel_instrument so I'm assuming the record ought to be private (eg src/otel_metrics_rec.hrl which could include this file to keep the macro definitions public)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue is it is needed in the SDK app. So I'm not sure what is best to do since I'd rather not go the route of putting it all by a module for accessing each field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well we could cheat a bit I guess. Have the content is wrapped in a sort of:

-ifdef(OTEL_INTERNAL_MACRO).
-record(instrument,  ...)
-endif.

and then you could include the same file but running a -define(OTEL_INTERNAL_MACRO, true) prior to inclusion gives you visibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that's overkill though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, might be overkill. But might be worth it if we don't think of anything else.

For now though since this is in experimental I think we should move forward as is and can freely make the record private after this as long as its before it goes into the stable SDK.

@GregMefford GregMefford self-requested a review August 4, 2022 16:08
apps/opentelemetry_experimental/src/otel_metrics.hrl Outdated Show resolved Hide resolved
apps/opentelemetry_experimental/src/otel_metrics.hrl Outdated Show resolved Hide resolved
Comment on lines 41 to 42
min=infinity, %% works because any atom is > any integer
max=neg_infinity, %% requires an explicit check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm understanding right, those min/max values are the default until any data is encountered, right? So it's normal for min to be bigger than max? I think semantically undefined would make most sense because it wouldn't be a legitimate in-band value (though given Erlang doesn't support infinite values in floats this is sort of a moot point)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, these are the defaults until any data is encountered.

Going with undefined probably makes the most sense. I had hopes that there would be something I'd find that could be used for max that would always be less than any number, but since there isn't... wait, just realized that even though Erlang has no number size limit the Otel protocol does, so technically we could use -2^64 here. hmm

The reason I wanted a term that I could compare with < is so it could be done in a matchspec -- and it means only one comparison needed, not one to check if its undefined and then to check if its <.

%% unless this is changed this needs to be switched to a link
%% also need to be able to recover who the readers are when the meter server restarts
%% discovery by looking at a supervision tree could be a good idea as it would allow for added
%% readers to be returned to the server after a crash
Copy link
Member

@ferd ferd Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supervision tree would work though it can fail to restore fine if the server's failure lines up with a reader's failure; another idea is to have a parent-owned table recording which readers should exist, so in case of a restart the new server can restore state to where it should be?

If does rely on having stable predictable names though, and being able to map these to either processes, or supervisor entries.

apps/opentelemetry_experimental/src/otel_metric_reader.erl Outdated Show resolved Hide resolved
apps/opentelemetry_experimental/src/otel_metric_reader.erl Outdated Show resolved Hide resolved
apps/opentelemetry_experimental/src/otel_view.erl Outdated Show resolved Hide resolved
@tsloughter tsloughter requested a review from ferd August 7, 2022 23:50
bucket_counts=zero_buckets(length(Boundaries)),
record_min_max=RecordMinMax,
min=infinity, %% works because any atom is > any integer
max=?MIN_DOUBLE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the data model marks both the min and max value as optional does seem to mean we have to be careful not to use the MIN value unless you want it to be impossible to distinguish between a legit MIN and a placeholder (https://github.com/open-telemetry/opentelemetry-proto/blob/c5c8b28012583fda55b0cb16f73a820722171d49/opentelemetry/proto/metrics/v1/metrics.proto#L496-L501)

Is this logic referred to the collector or the exporter at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to distinguish because this record is only created when a measurement is taken. If minmax are to be recorded it will be set at that time.

Which reminds me, explicit_histogram_checkpoint should only be number().

The only time this may be an issue is a delta temporality, which will reset to the default when taking the checkpoint. That isn't supported yet.

@tsloughter tsloughter force-pushed the stable-metrics-sdk branch 6 times, most recently from 0153159 to 4a29466 Compare August 10, 2022 12:19
Copy link
Member

@ferd ferd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on my end, though I'd like to see what @GregMefford has to say about it

@tsloughter
Copy link
Member Author

Going to go ahead and merge. But it would be great if @GregMefford gets some time.

If @GregMefford and/or @bryannaegele are able to make the SIG meeting tomorrow we could also do a code walk through at that time.

Future PRs will be smaller.

@tsloughter tsloughter merged commit 2a700f8 into open-telemetry:main Aug 11, 2022
@tsloughter tsloughter deleted the stable-metrics-sdk branch August 11, 2022 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants