-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop histogram aggregation, default to explicit bucket histogram #2429
Conversation
To make sure I understand correctly:
Alternatively, users can always explicitly specify that they either want an explicit histogram with buckets that they themselves set up or an exponential histogram The default behavior (explicit bucket histogram with buckets as specified in the spec) will never change, this way the SDK will never produce incompatible data when updating SDK versions. All histograms, if not explicitly set to something else, will always be fixed bucket ones, and look exactly like defined in the spec today. Did I misunderstand something here? If we merge this change, we can never specify the ExponentialHistogram as the default for |
Yes that is what I'm suggesting.
Correct. The assertion is that SDKS shouldn't ever change the default behavior because doing so may be a breaking change for users. For example, backends might rely on explicit bucket histogram and not support exponential histogram, or some user experience might rely on the default bucket boundaries of explicit bucket histogram. In the java SIG we've decided that this type of change would be unacceptable and have gotten rid of the "best available" |
Here is what we should consider: We want to have "Hint API" (e.g. #1753 which was abandoned to scope down the initial spec release) in the Metrics API specification. And it is very likely that the instrumentation authors would want to provide the hint saying "by default this histogram should use base-2 exponential buckets", and we want to the SDK to respect the hint - unless the application owner explicitly specified "I WANT to use explicit buckets". And I think there is a way to avoid breaking change:
|
Adding the Hint API might still break some scenarios. Consider the following:
Do we think this is okay, since it happens during a package-upgrade process? It would be a change of output of the system without any explicit code change (except bumping package versions). Is this considered a breaking change, even if the code still compiles and runs? TL;DR: As far as I can tell, the fundamental question to address here is: Do we want to keep flexibility to change the default behavior in later releases? If the answer is No, then we will have to be very careful how we add Hints, or be very clear to make sure folks double check their instrumentation and add explicit defaults where necessary when the Hint API comes around, so their expected outputs don't change. |
@pirgeo Similar questions came up in making recommendations about handling duplicate instrument conflicts in #2317. The data model has this section that defines duplicates conflicts to occur for a single Resource, which to me allows point kind to change across runs of the same application through View reconfiguration. If we excluded Resource from that statement, it would prevent you from ever changing an aggregation without being considered a conflict. I think the fact that we have a Views API means we support the idea of changing Aggregation across process starts, but we want to ensure the defaults do not change unexpectedly, hence I support this PR. |
This seems okay to me. As user updated the version of the library, he can be affected by any changes in that library - not just limited to addition of Hints.. To be clear, user is not bumping the package version of the OTel SDK - if all they did was bump SDK version and behavior changes, thats a different story. |
@jmacd I agree that we do not want the defaults to change unexpectedly. I am trying to point out (and make sure I understood that correctly) that by merging this change we will bar ourselves forever from using any HistogramAggregation except the one specified today as the default without calling it a breaking change. Once this change is in, the default is an explicit bucket histogram as defined today, forever. If I want something else, I can just add a view, and specify the preferred aggreation. As soon as I specified something explicitly, the default doesn't apply anymore (which totally makes sense to me). This also means that when the Hint API is added, the default behavior cannot change - if I don't specify anything for my Histogram, it should always export the default bucket boundaries.
I think this might be the case for the version bump that introduces Hints: Take the following scenario (and I am not sure if its a valid one):
--> no code change, just a bump in SDK version and the output behavior changes. This of course may only ever happen once, when Hints are added to the SDK, so I am not sure how valid this example is, but that's why I am asking whether this is considered a breaking change. It feels to me like the Hint would allow us to have "conditional defaults", but if we set the default in stone here, I am not sure we can use those "conditional defaults" any more. @reyang do you see any problems with this? |
Here is my take: Scenario 1
I consider this as a breaking change in the instrumentation library, and the instrumentation library author should either:
OR
Scenario 2
I think this is in the gray area and personally I don't treat it as a breaking change in the SDK. Here is my analogy - if the customer is using an old SDK which does not understand certain environment variable, and later upgraded to a new SDK which starts to understand that environment variable. If that specific environment variable has been existing (even before the old SDK was used), and the upgrade triggered side effect (e.g. the environment variable is saying "disable sampling"), do we call that "breaking change"? |
@reyang I don't think it's possible to use hints to control histogram vs exponential. The exported data for both is completely different So whether to use histogram vs exponential is part of the contract between the app and backend, instrumentation could never pick appropriately between those. Hints could provide buckets for use the the explicit bucket histogram because that doesn't change the exported data format itself though - if the user had configured exponential histograms, then the hint would be ignored appropriately. Perhaps the solution will be for metric exporters to define their preferred histogram aggregation, which is used by the SDK, similar to the temporality preference. |
After reading the PR and the discussions here, I feel we might have a hidden problem:
The PR makes it clear that the default should be BUT: With the start of discussions on the "Hint API" I feel that it defeats the purpose of the changes proposed here.
Point two (emphasis mine), IMHO contradicts what's proposed here. If we agree this PR makes sense, to "shield" from possible SDKs changing implementations from one version to the next, then why it's ok to allow Hints changing the behavior after? Even worse, it's widening the scope of introducing breaking changes to not only the SDK now, but to any other instrumentation library in the future. Consider a case where I have a repo with 30+ microservices, all configured with the default as of now. By following point 2 above, I, as the application owner will have to change all my 30 services, just to stay on the current behavior. Thinking on the end-user experience of OTel, I don't think I would be happy if this was happening to me. The Hint Api might still be useful, for example if I don't know much about histograms and if the instrumentation library offers me a better alternative it's great. But IMHO it should be my decision still to do it, because as @jack-berg mentioned, it can break many things. If I get logs telling me, "hey you might want to use exponential histograms or xyz buckets to get better metrics", I can plan this ahead and work on it later. |
This is something we will need to explore/discuss once we started working on Hint API. From the API perspective both explicit bucket and exponential bucket histogram might share the same instrument type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to focus on @anuraaga's remarks above:
@reyang I don't think it's possible to use hints to control histogram vs exponential. The exported data for both is completely different
The "completely different" remark is doubtful, because (a) there is a non-lossy conversion from the exponential histogram data point to an explicit bucket histogram data point, and (b) a simple reconfiguration of explicit boundaries also makes for "completely different" data the way it is commonly used. Changing explicit buckets is likely to be a problem, in other words, yet this is something we want to be configurable.
I think we should expect "Compliant" consumers of OTLP to accept exponential histogram data; consumers are free to convert the data and downsample into explicit bucket histogram format as needed.
Speaking as a vendor, we will support both formats some time after the SDK specification lands.
So whether to use histogram vs exponential is part of the contract between the app and backend, instrumentation could never pick appropriately between those. Hints could provide buckets for use the the explicit bucket histogram because that doesn't change the exported data format itself though - if the user had configured exponential histograms, then the hint would be ignored appropriately.
We shouldn't debate what Hints are since they aren't formally proposed in any specification document.
Perhaps the solution will be for metric exporters to define their preferred histogram aggregation, which is used by the SDK, similar to the temporality preference.
Exporters already define the default Aggregation for each instrument type (from #2404). The spec for MetricReader reads The default output aggregation (optional), a function of instrument kind. If not configured, the default aggregation SHOULD be used.
I support your premise that, like Temporality, the choice of Aggregation is something the exporter should have control over, however this appears to be in a "limiting" fashion.
The SDK/Prometheus exporter today would supply the explicit bucket Histogram aggregator as the default when constructing the corresponding MetricReader.
The SDK/OTLP exporter will need (like Temporality) an option to choose the desired default histogram aggregation. This debate is about whether the default choice can later change for the OTLP exporter (particularly the SDK auto-configured one), and if so how. I agree with @reyang that this will require a major version number increase for OTel SDKs to accomplish, but I don't expect this will happen until a Collector's Prometheus Remote Write exporter has a way to write exponential histogram data points.
FWIW when it comes to "Hints" I see them as the programmatic equivalent of entering Views; however, I think @anuraaga is correct that changing the type of histogram aggregation is probably an Exporter-wide decision, like Temporality, and I (personally) would not expect a user to mix the two styles of histogram in a single session.
This leads me to suggest a different fix: Where the View clause currently includes an aggregation
, this is not actually a configurable parameter. Instead, the View aggregation
is just a section heading for the configuration that follows. This means you can't change the underlying aggregator type chosen for Histogram instruments through Views (or Hints), only provide configuration for relevant aggregator choices that will take effect when the aggregation
configured by the MetricReader matches.
+1 💯 |
I can't imagine we could make this change in a way that is backwards compatible with an initial stable release. This does seem like a good way to think about how to interpret aggregation options from a future hint API.
I agree with this. Perhaps there's a way to retain the flexibility to change the histogram aggregation for OTLP export while giving users a way to opt in to consistency, and always using explicit bucket histograms for prometheus export. What if we:
|
This doesn't seem appropriate if a version of the spec is released with a data model that includes stable explicit bucket histograms but experimental exponential histograms. We'd expect a system that targets a stable version of the spec to be compliant, and any additive changes after that without a major version change could never be expected for compliance. My reading of the spec is that it's acceptable for a backend to only read the explicit histogram field of an incoming OTLP message. But it is admittedly not clear how to reconcile the data model doc and protocol doc when deciding that, perhaps experimental in data model but stable in OTLP means we do actually require backends to support exponential histograms to be compliant. It's not my first reaction though.
I still disagree with this sort of dynamic approach since it requires the user to know what that best histogram is actually set to so they can confirm their backend reads that field from the OTLP message, at which point they could have just specified the histogram type itself. Unless we really double down on the statement that all OTLP receiving backends MUST read exponential histograms, and are OK with users having confusing problems if they run into a backend that does not comply with this. It would be good to document this, it probably means that backends only look at the proto to determine requirements, not the data model, and are compliant if they support all fields in the first stable release of a proto file. This is probably fine but I can't find it mentioned explicitly right now. |
Doesn't this lead to the same situation where, if users don't explicitly state that they want More generally, and after thinking about it for a little while longer I think I agree with @anuraaga. Are there scenarios where we want the histogram aggregation for a metric to change over time? The two different histograms are intended for somewhat different use-cases: The exponential one when I need a specific error bound, and the explicit one when I want to specify buckets. I think I have been convinced that I don't want them changing, and definitely not without my explicitly stating it. The more I think about it, the more I believe that the actual type of histogram has to be decided depending on its use-case, and there is no catch-all. Which is probably why we have two histogram types and this wording in the first place. Of course we will have to provide a default, but I am not sure what the right default is. In the interest of compatibility, the currently specced out explicit histogram would make sense, however, its layout will probably fail for a lot of cases (e.g. recording double values between 0 and 1). Are there users who want all of their histograms to be exponential? My guess is yes, so we should give an option to set it explicitly. If we change it in the spec, we are potentially breaking or at least changing the data format for folks with backends that don't support exponential histograms. |
I don't know the backstory here - the inconsistency between the data model and the protos is strange.
Yes it would. It relies on the assumption that OTLP receivers should work with either explicit or exponential histograms, and thus the ambiguity of which histogram is used isn't relevant. The idea of opting into a stable histogram was an attempt to be able to retain some sort of ability to use exponentials as the default, since I think exponential histograms will ultimately prove superior to explicit bucket histograms in their ability to represent a very different data distributions without configuration. With that said, the points that @pirgeo and @anuraaga make about a predictable histogram also resonate with me. While I think the ideal situation would be if we had exponential histograms as a stable part of the initial release and used as the default, I suppose a reasonable compromise would be to ensure that once exponential histograms are available, that there's an easy way to opt into them as the default. |
We spoke at the 4/5 Specification SIG and there appeared to be general consensus that retaining the "best possible histogram aggregation" whose definition is subject to change is a bad experience for users and something we should not have. The hesitation around removing "best possible histogram aggregation" stems from the belief that once available, exponential histograms will likely be a better default histogram option. We discussed that it would be valuable to add an option to make exponential histogram the default aggregation for the OTLP exporter. I've opened #2471 to capture this sentiment, which hopefully unblocks this PR. |
The discussion in today's Spec SIG leads to the following course of action:
|
@jmacd can you remove your change request? |
@reyang I like that you keep trying 👍😂 |
Resolves #2382.
Now that metrics is marked as stable, we're stuck with the imprecise
histogram
aggregation which gives SDKs the latitude to change implementations from one version to the next, potentially breaking backend integrations.I suggest we remove the histogram aggregation, and default histogram instruments to explicit bucket histogram.
Later once exponential histogram aggregation is specified, we can add back the histogram aggregation with its current "best available" semantics. However, we should have users opt-in to this uncertainty rather than making it the default.