[metrics-generator] filter out spans based on policy #2274

zalegrala · 2023-03-29T18:27:21Z

What this PR does:

Here we implement an approach to filtering out spans based on a policy, loosely based around the OTEL collector filterspan config format.

Which issue(s) this PR fixes:
Fixes #1482

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

cmd/tempo/app/config.go

modules/generator/processor/spanmetrics/spanmetrics.go

modules/generator/processor/spanmetrics/config.go

knylander-grafana · 2023-03-30T17:24:42Z

Thank you for adding documentation!

docs/sources/tempo/metrics-generator/span_metrics.md

knylander-grafana

Docs look good! Thank you for adding them.

cmd/tempo/app/config.go

modules/generator/overrides.go

joe-elliott · 2023-04-12T19:55:13Z

modules/generator/processor/spanmetrics/spanmetrics.go

+	spanMetricsCallsTotal       registry.Counter
+	spanMetricsDurationSeconds  registry.Histogram
+	spanMetricsSizeTotal        registry.Counter
+	spanMetricsFilterDropsTotal registry.Counter


any particular reason we're pushing this and not just recording it as a normal prometheus metric in tempo?

Nope, good call out.

I spoke too soon. For a generator instance, we don't have the ID available currently, so to create a metric where we include the tenant label isn't feasible without a bit of refactor. This currently pushes the metric to the remote endpoint and would increase the series count, so perhaps this isn't something we want to do. Though, if we host the metric on the generator itself, then we'd have access to know which tenants would be filtering which spans, but likely this wouldn't have the desired value for folks who don't have access to those metrics. I.e. if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.

yeah, i really think we shouldn't be pushing this using remote write. the metrics we push are the generated ones. this describes the operational state of tempo which would just publish normally.

if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.

we can push this back through billing and expose it to the end user. i think we may just need to push the tenant id down into the instance. alternatively you can push a counter metric into the instance that already has the tenant id configured

modules/generator/processor/spanmetrics/spanmetrics.go

joe-elliott · 2023-04-12T20:03:35Z

modules/generator/processor/spanmetrics/spanmetrics.go

+	}
+
+	for _, policy := range p.filterPolicies {
+		if policy.Include != nil {


nit: could move the policy.include check into policyMatch() which would clean this up.

I'm not sure what policyMatch() should do in the case of a nil policy. I agree it would be a little cleaner, but include is the inverse of exclude here, so for example returning a true value from policyMatch() isn't as clean. Perhaps I can make exclude and include behave the same here, and then make the suggested adjustment.

modules/generator/processor/spanmetrics/spanmetrics.go

joe-elliott

one broader question that just occurred to me: should all of this logic apply to both span metrics and service graph metrics? should we move this up a level?

zalegrala · 2023-04-13T15:28:34Z

Thanks for the review @joe-elliott. If we move this up a level, do you suppose the processors should have an independent filter config, or share one? Sharing one would be simpler and likely more performant, but I wonder if folks would want to tune them independently.

zalegrala · 2023-04-13T18:13:03Z

In lieu of an specific idea about how widely to apply the filtering, I'm going to restructure the span filtering into pkg/spanfilter and implement it in the spanmetrics processor. I think this will give is the ability to re-use the code elsewhere if we want to pick it up later and still give us the short term benefit of filtering on spans in the spanmetrics currently.

joe-elliott

Really liking the refactor to pull out the filters. Some thoughts but this is close

docs/sources/tempo/metrics-generator/span_metrics.md

joe-elliott · 2023-04-21T13:55:36Z

modules/generator/processor/spanmetrics/spanmetrics.go

+	spanMetricsCallsTotal       registry.Counter
+	spanMetricsDurationSeconds  registry.Histogram
+	spanMetricsSizeTotal        registry.Counter
+	spanMetricsFilterDropsTotal registry.Counter


yeah, i really think we shouldn't be pushing this using remote write. the metrics we push are the generated ones. this describes the operational state of tempo which would just publish normally.

if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.

we can push this back through billing and expose it to the end user. i think we may just need to push the tenant id down into the instance. alternatively you can push a counter metric into the instance that already has the tenant id configured

modules/generator/processor/spanmetrics/spanmetrics.go

joe-elliott · 2023-04-21T14:18:21Z

pkg/spanfilter/spanfilter.go

+			}
+			matches++
+		case traceql.IntrinsicKind:
+			if !stringMatch(policy.MatchType, span.GetKind().String(), pa.Value.(string)) {


anyway we can do int matches here and on status? would be much faster

Probably yes, but it might be a little clumsy, since we'd need to take this as an int in the config I think. There is a special yaml struct tag we could use iirc to parse a string as an int, or perhaps some custom unmarshaling. For now I'm inclined to leave it and come back to it later.

I think I can work this. I'll take a closer look next week.

I've added a commit for this. It made some parts less readable, but seems okay to me. How's that look?

…nt overrides Signed-off-by: Zach Leslie <[email protected]>

Signed-off-by: Zach Leslie <[email protected]>

Co-authored-by: Joe Elliott <[email protected]>

Signed-off-by: Zach Leslie <[email protected]>

zalegrala commented Mar 29, 2023

View reviewed changes

cmd/tempo/app/config.go Outdated Show resolved Hide resolved

ie-pham reviewed Mar 30, 2023

View reviewed changes

modules/generator/processor/spanmetrics/spanmetrics.go Outdated Show resolved Hide resolved

modules/generator/processor/spanmetrics/config.go Outdated Show resolved Hide resolved