Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distribution metric to semantic-core and ffwd-reporter #79

Merged
merged 13 commits into from
Sep 9, 2020
Merged

Add distribution metric to semantic-core and ffwd-reporter #79

merged 13 commits into from
Sep 9, 2020

Conversation

ao2017
Copy link
Contributor

@ao2017 ao2017 commented Aug 21, 2020

PR raison d'être

We are adding distribution support to resolve 2 problems.
1. Accurate Aggregated Histogram Data.
Currently percentile data provided by heroic are computed locally.
It is practically impossible to aggregate that data.
With this PR we are introducing a mechanism to record data and send data sketches
to heroic. A sketch of a dataset is a small data structure that let you approximate certain characteristics of the original
dataset. We use sketches to compute rank based statistics such as percentile. Sketches are mergeable and can be used
to compute any percentile on the entire data distribution.

2. Support sophisticated data point value.
We currently have many libraries that emit Open-census metrics. We cannot export Open-census metric to heroic because heroic supports only double data point value. With the introduction of Distribution we will be able to support sophisticated data point value such as Open-census metric distribution.

What's new ?
This PR introduces Distribution. Unlike most metric supported by SemanticMetric this metric is not a DropWizard metric.
However, Distribution implements DropWizard Metric and Counting interface. This is done so we can manage Distribution metric through Semantic Registry which is an extension of DropWizard Registry.

This PR also introduces SemanticMetricDistribution. SemanticMetricDistribution is an implementation of Distribution backed by Tdigest. Check the links below for the rational of using Tdigest.

This PR also refactors FastForwardReporter to add distribution support.

All efforts were made to ensure compatibility. This changes should be an Opt-in for Heroic Users. They shouldn't have to make any change to existing code unless they want to use distribution. At least that is the goal.

Related Issues and PR's
spotify/heroic#476
spotify/ffwd-client-java#13
#78
spotify/heroic#673
spotify/heroic#677

@ao2017 ao2017 requested a review from malish8632 August 21, 2020 13:13
@jsferrei jsferrei removed their request for review August 21, 2020 13:17
@ao2017 ao2017 requested review from sjoeboo and sming and removed request for spkrka August 21, 2020 13:18
@ao2017 ao2017 requested a review from smstone August 21, 2020 13:30
@sming
Copy link

sming commented Aug 29, 2020

This change add new interfaces and extend existing one. All efforts were made to ensure compatibility. Users shouldn't have to make any change to existing code unless they want to use distribution. At least that is the goal.

Hey @ao2017 - could you provide a Use Case-like description please? With links to the actual US/Issue and supporting links e.g. T-digest, codehale.

Also, I'm quite confused about SemanticMetricRegistryListener / MetricRegistryListener / SemanticMetricRegistryListenerV2. The first is in the README but the latter two are not. Could you help me understand what's the plan with these 3 interfaces? thanks.

/**
* A no-op implementation of {@link MetricRegistryListener}.
*/
abstract class Base implements MetricRegistryListener {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this NoOpMetricRegistryListener, or something else meaningful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That 's the original name Let me see if changing the name won't brake anything.

* @param name the meter's name
* @param derivingMeter the meter
*/
void onDerivingMeterAdded(String name, DerivingMeter derivingMeter);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a distribution a Deriving Meter? I can't quite tell. Here's the definition of a deriving meter:

 * A meter that takes the derivative of a value that is expected to be almost monotonically
 * increasing. A typical use case is to get the rate of change of a counter of the total number of
 * occurrences of something.
 * <p>
 * Implementations will ignore updates that are a decrease of the counter value. The rationale is
 * that the counter is expected to be monotonically increasing between infrequent resets (such as
 * when a process has been restarted). Thus, negative values should only happen on restart, and it
 * should be safe to discard those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it is not. Distribution captures the recorded data distribution. It is used to compute percentile in a distributed environment. Can one use distribution to compute rate ? I doubt it won't be accurate since distribution is just capturing an approximation not the actual count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do distributions relate to histograms? https://github.com/infusionsoft/yammer-metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Histogram.java

Do we need to have both? Is the newly added Distribution a superset of histogram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So distribution is a name for the type of Histogram that computes rank related stat using actual data distribution.
Dropwizard histogram doesn't support that feature. For more info, take a look at this issue spotify/heroic#476

Do we need to have both ?
At the moment we think yes. Some users may have a preference for local histogram. And I think it is the standard in the industry to support both at least DataDog and OpenTelemetry support both types.

@@ -107,6 +108,29 @@ public void onTimerRemoved(String name) {
tryRemove(name);
}


@Override
public void onDerivingMeterAdded(String name, DerivingMeter derivingMeter) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about onHistogramRemoved() and onHistogramAdded() above. Are they a different "kind" of histogram or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand this question :).

}

/**
* Returns the current count.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not terribly useful :) Either remove or explain what's being counted, cheers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, something is better than guessing :) . I will update.

private final AtomicReference<TDigest> distRef;

protected DistributionImpl() {
this.distRef = new AtomicReference<>(TDigest.createDigest(COMPRESSION_DEFAULT_LEVEL));
Copy link

@sming sming Aug 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the rationale behind using an AtomicReference? With AtomicRefernce, the only guarantee is that this reference will be updated/set in a thread-safe manner and AFAICT we don't update it...?

Also, AFAICT, multiple threads could still overwrite this.distRef here, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok, As long as each thread has its own copy we are good.


@Override
public void record(double val) {
distRef.get().add(val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TDigest.add() does not appear to be threadsafe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am looking into this. I checked couple of Dropwizard metrics there are threadsafe so I will make this thread safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could either

  • add synchronized around it and hope it's enough.
  • Use ThreadLocal and then merge all the thread local entries somehow inside getValueAndFlush
  • Push to a queue and have another thread popping them off and adding to the TDigest object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the implementation I added yesterday. I thought about adding threadLocal then change mind. ThreadLocal may introduce memory leak if getValueAndFlush isn't called often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work now I think (but it's hard to know if there will be much contention, some multithreaded JMH benchmarks might be useful).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure will do. Please take a look I need a +1

Copy link
Member

@mattnworb mattnworb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@malish8632 asked me to take a look since I've worked with these classes before)

The main question I have is - what is a Distribution? I think the initial PR summary is missing the motivation behind the addition or what it means to achieve (except adding new classes).

Is it necessary to have a separate interface and implementation for Distribution and DistributionImpl, when the other core types (Timer, Meter, Histogram) don't follow this pattern? Those classes implement interfaces, but those are more generic interfaces that have multiple implementations (i.e. Timer and Meter are implementations of Metered) which allows for abstraction in some places (and code reuse). I don't see the similar need for Distribution unless you plan to add more implementations of it later on, in which case DistributionImpl could use a more descriptive name.

A few suggestions/questions below - I don't think you need to go to the trouble of declaring V2 interfaces or new packages, it'd be simpler to update the version of the library to indicate changes to the API of SemanticMetricRegistry and friends.

* under the License.
*/

package com.spotify.metrics.core.codahale.metrics.ext;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add the new Distribution type to this package, instead of com.spotify.metrics.core? There are interfaces in core that refer to this type, so I think it would be natural to have it live in com.spotify.metrics.core as well

If you want to add it in a subpackage, I'd still recommend against this specific name, its not like any classes live in or com.spotify.metrics.core.codahale com.spotify.metrics.core.codahale.metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattnworb I don't see how SemanticMetricRegistry.addListener(SemanticMetricRegistryListener). is broken with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Matt - could we use com.spotify.metrics.core instead?

@@ -396,4 +338,101 @@ public void stop() {
public void close() {
stop();
}

public static final class Builder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably be better to make separate PRs for changes like reordering the structure of the class or whitespace changes, as it makes the git history harder to follow if its combined with this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this comment - pls separate this into diff PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

  1. Remove SemanticMetricRegistryListenerV2
  2. Added Default methods to SemanticMetricRegistryListener
  3. We understand that there is a chance for SemanticMetricRegistry to call a no op methods.
     This can lead to confusion hopefully users of this lib will look at the source code or documentation.
Copy link
Contributor

@malish8632 malish8632 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better. Some changes should help simplify this PR and clarify type packages.

Copy link
Member

@mattnworb mattnworb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for resolving all of my earlier comments, and I find the PR summary very helpful now - thanks for both of those.

I have a few small questions below

* is reset and a new recording starts.
* @return
*/
ByteBuffer getValueAndFlush();
Copy link
Member

@mattnworb mattnworb Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still curious about the idea of exposing the "value" of the Distribution as a ByteBuffer. It seems like it would couple this Distribution metric to the way in which the t-digest library serializes the underlying com.tdunning.math.stats.TDigest into bytes.

For instance, if there was an alternate implementation of Distribution in the future, it would have to return the values in the same byte format, since FastForwardReporter has the expectation that the results of getValueAndFlush() must look a certain way.

Is there any other data type that could be used? Or at least could the format of the bytes be documented here?

Copy link
Contributor Author

@ao2017 ao2017 Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Distribution as the return type then the serializer will have to know something about Distribution implementation. For instance, FFW reporter will have an implementation of this method for every Distribution type.
https://github.com/spotify/semantic-metrics/pull/79/files/4ae53fe81b12e35376529ab79184798e982e7ade#diff-455c8027b591558e13ad0a4d245b2518R265. That will not align with the current paradigm of not exposing the detail of the metric implementation to the serializer.

Documentation for bytes format is handled by FFW. There is a version number, we will use that number upstream to determine how to handle the bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Return something immutable like ByteString
  2. Instead of ByteString, return something more semantically meaningful like a DistributionSnapshot wrapper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am not sure about DistributionSnapshot, that will expose Distribution implementation to ffwReporter as noted above.
  2. Let's merge this branch. Bytebuffer is transformed into ByteString before serialization in ffw-client. I will change the return type later.

Copy link
Contributor

@malish8632 malish8632 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattnworb mattnworb dismissed their stale review September 4, 2020 20:43

my concerns were addressed

private void reportDistribution(final com.spotify.ffwd.v1.Metric metric,
final Distribution distribution) {
ByteBuffer byteBuffer = distribution.getValueAndFlush();
Value value = Value.distributionValue(byteBuffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this PR: spotify/ffwd-client-java#10
Can we update the ffwd API to use ByteString instead of ByteBuffer?
I think it should be safe to change that if no one is using the distribution types yet (since they are not implemented yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the entire PR you will see that ByteBuffer is transformed into ByteString before serialization. Moving byteString up in the stack is something we can do in the next PR.

@@ -0,0 +1,88 @@
/*
* Copyright (c) 2016 Spotify AB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update year

@ao2017 ao2017 merged commit 6244f29 into spotify:master Sep 9, 2020
Copy link

@sming sming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments @ao2017 .

* Just get an instance through SemanticMetricBuilder and record data.
*
* <p> {@link Distribution} is a good choice if you care about percentile accuracy in
* a distributed environment and you want to rely on P99 to set SLO.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent comment!

* From P99.9 to P99.999 the error rate is slightly higher than 2%.
*
*/
public final class SemanticMetricDistribution implements Distribution {

private static final int COMPRESSION_DEFAULT_LEVEL = 100;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we want to change this level? Should it be configuration-driven perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the time being it is better to keep every aspect of Tdigest private. Power users have the ability to extend this class and use a different compression level. The typical use case 100 is ok.

@@ -93,6 +93,16 @@ public void onDerivingMeterAdded(MetricId name, DerivingMeter derivingMeter) {
@Override
public void onDerivingMeterRemoved(MetricId name) {
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not a fan of this class name. Is it convention?

private TagExtractor tagExtractor;
private ScheduledExecutorService executorService;

private Set<Percentile> histogramPercentiles =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these particular numbers? Can we comment or const them please.

public Builder histogramQuantiles(double... quantiles) {
histogramPercentiles = new HashSet<>();
for (double q : quantiles) {
histogramPercentiles.add(new Percentile(q));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're not validating that the double is in range?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants