Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics-generator] filter out spans based on policy #2274

Merged
merged 44 commits into from
May 2, 2023

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented Mar 29, 2023

What this PR does:

Here we implement an approach to filtering out spans based on a policy, loosely based around the OTEL collector filterspan config format.

Which issue(s) this PR fixes:
Fixes #1482

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

cmd/tempo/app/config.go Outdated Show resolved Hide resolved
@knylander-grafana
Copy link
Contributor

Thank you for adding documentation!

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good! Thank you for adding them.

@zalegrala zalegrala force-pushed the metricsGeneratorFiltering branch from bf1a64f to dcdd76f Compare April 6, 2023 20:53
@zalegrala zalegrala marked this pull request as ready for review April 11, 2023 16:48
@zalegrala zalegrala force-pushed the metricsGeneratorFiltering branch 2 times, most recently from bca1054 to 126f390 Compare April 12, 2023 15:56
cmd/tempo/app/config.go Outdated Show resolved Hide resolved
modules/generator/overrides.go Outdated Show resolved Hide resolved
spanMetricsCallsTotal registry.Counter
spanMetricsDurationSeconds registry.Histogram
spanMetricsSizeTotal registry.Counter
spanMetricsFilterDropsTotal registry.Counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason we're pushing this and not just recording it as a normal prometheus metric in tempo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good call out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke too soon. For a generator instance, we don't have the ID available currently, so to create a metric where we include the tenant label isn't feasible without a bit of refactor. This currently pushes the metric to the remote endpoint and would increase the series count, so perhaps this isn't something we want to do. Though, if we host the metric on the generator itself, then we'd have access to know which tenants would be filtering which spans, but likely this wouldn't have the desired value for folks who don't have access to those metrics. I.e. if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i really think we shouldn't be pushing this using remote write. the metrics we push are the generated ones. this describes the operational state of tempo which would just publish normally.

if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.

we can push this back through billing and expose it to the end user. i think we may just need to push the tenant id down into the instance. alternatively you can push a counter metric into the instance that already has the tenant id configured

modules/generator/processor/spanmetrics/spanmetrics.go Outdated Show resolved Hide resolved
modules/generator/processor/spanmetrics/spanmetrics.go Outdated Show resolved Hide resolved
}

for _, policy := range p.filterPolicies {
if policy.Include != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could move the policy.include check into policyMatch() which would clean this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what policyMatch() should do in the case of a nil policy. I agree it would be a little cleaner, but include is the inverse of exclude here, so for example returning a true value from policyMatch() isn't as clean. Perhaps I can make exclude and include behave the same here, and then make the suggested adjustment.

modules/generator/processor/spanmetrics/spanmetrics.go Outdated Show resolved Hide resolved
modules/generator/processor/spanmetrics/spanmetrics.go Outdated Show resolved Hide resolved
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one broader question that just occurred to me: should all of this logic apply to both span metrics and service graph metrics? should we move this up a level?

@zalegrala
Copy link
Contributor Author

Thanks for the review @joe-elliott. If we move this up a level, do you suppose the processors should have an independent filter config, or share one? Sharing one would be simpler and likely more performant, but I wonder if folks would want to tune them independently.

@zalegrala
Copy link
Contributor Author

In lieu of an specific idea about how widely to apply the filtering, I'm going to restructure the span filtering into pkg/spanfilter and implement it in the spanmetrics processor. I think this will give is the ability to re-use the code elsewhere if we want to pick it up later and still give us the short term benefit of filtering on spans in the spanmetrics currently.

@zalegrala zalegrala force-pushed the metricsGeneratorFiltering branch from b2df1e4 to 6e15687 Compare April 14, 2023 15:22
@zalegrala zalegrala requested a review from joe-elliott April 18, 2023 15:12
@zalegrala zalegrala force-pushed the metricsGeneratorFiltering branch from ebced35 to 7fb6e5f Compare April 20, 2023 15:27
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really liking the refactor to pull out the filters. Some thoughts but this is close

spanMetricsCallsTotal registry.Counter
spanMetricsDurationSeconds registry.Histogram
spanMetricsSizeTotal registry.Counter
spanMetricsFilterDropsTotal registry.Counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i really think we shouldn't be pushing this using remote write. the metrics we push are the generated ones. this describes the operational state of tempo which would just publish normally.

if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.

we can push this back through billing and expose it to the end user. i think we may just need to push the tenant id down into the instance. alternatively you can push a counter metric into the instance that already has the tenant id configured

modules/generator/processor/spanmetrics/spanmetrics.go Outdated Show resolved Hide resolved
}
matches++
case traceql.IntrinsicKind:
if !stringMatch(policy.MatchType, span.GetKind().String(), pa.Value.(string)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway we can do int matches here and on status? would be much faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes, but it might be a little clumsy, since we'd need to take this as an int in the config I think. There is a special yaml struct tag we could use iirc to parse a string as an int, or perhaps some custom unmarshaling. For now I'm inclined to leave it and come back to it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can work this. I'll take a closer look next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit for this. It made some parts less readable, but seems okay to me. How's that look?

@zalegrala zalegrala force-pushed the metricsGeneratorFiltering branch from 5214024 to 95fd22e Compare April 24, 2023 18:07
zalegrala and others added 24 commits May 2, 2023 17:52
Signed-off-by: Zach Leslie <[email protected]>
Signed-off-by: Zach Leslie <[email protected]>
Signed-off-by: Zach Leslie <[email protected]>
@zalegrala zalegrala force-pushed the metricsGeneratorFiltering branch from 1cec33b to 79151bb Compare May 2, 2023 17:53
@zalegrala zalegrala merged commit 14848fd into grafana:main May 2, 2023
@zalegrala zalegrala deleted the metricsGeneratorFiltering branch May 2, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics-generator should filter out spans based upon attributes
4 participants