[Traceql Metrics] PR 1 - Engine #3251

mdisibio · 2023-12-19T16:14:44Z

What this PR does:
This is the first step in supporting TraceQL metrics queries {status=error} | rate() by (resource.service.name). It adds support to the language, parser, and engine for querying the two aggregates rate() and count_over_time(), and includes optional grouping by(a,b,c).

Features specifically not included:

This is just the raw language and engine pieces. They are not referenced by anything yet (querier, etc), that will be in followup PRs.
There is no customizable rate interval yet, it only uses the step from the query. This means rate() works, but not rate(5m). This is vastly simpler and the best way to get started. The step interval is easily set in the Grafana UI.

Some less-related updates:

Fixed the select statement to take a list of attributes/intrinsics and not generic FieldExpressions. Queries like select(1+.foo) would parse and "run", but not return anything. I believe this was an oversight, so I fixed it by making the language parsing more strict. The main driver is that I wanted to reuse the attributeList spec for both select(a,b,c) and rate/count by(a,b,c).
Proto linter? I think this was updated recently, so there is a lot of unrelated churn in the .proto.
Fixed some brittle tests

Notes
This is one entry in a set of chained PRs. The full body of work has been split into separate buckets to make reviews and updates more manageable.

Which issue(s) this PR fixes:
n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

mdisibio · 2023-12-19T22:32:15Z

pkg/traceql/ast.go

@@ -23,7 +33,8 @@ type typedExpression interface {
 }

 type RootExpr struct {
-	Pipeline Pipeline
+	Pipeline        Pipeline
+	MetricsPipeline metricsFirstStageElement


Having the metrics pipeline split from the spanset filter pipeline isn't ideal, but best for now. We send the eval callback of the spanset pipeline into the storage SecondPass, and the metrics pipeline has a different signature (output is time-series). Long-term this break needs to be closed and end up with one pipeline, but this current form will carry the functionality a long way. This will work until we add cross time-series arithmetic like:

({a} | rate()) / ({b} | rate())

mdisibio · 2023-12-19T22:32:48Z

pkg/traceql/engine_metrics.go

+	return e.metricsPipeline.result(), nil
+}
+
+// SpanDeduper2 is EXTREMELY LAZY. It attempts to dedupe spans for metrics


This will be an area where we need to continue experimenting. For now I've chosen the approach that I think offers the best trade-offs. We can easily increase correctness at the cost of performance.

Currently it dedupes using a hash of trace ID and span start time. The downside is that it's obviously not 100% unique, but the performance is great because it doesn't require the query to fetch any additional columns, we already load them for backend queries (trace ID will used for sharding). Deduping on trace ID/span ID is the obviously 100% correct method and where we started, but the span ID column is too hefty and significantly reduces performance. A middle-ground option would be to use the nested set span ID (integer) which is also unique but lighter-weight than span ID. However the downsides are that it's only populated for complete traces, so we'd need some fallback for partial traces (back to using start time?), and there's no good way to access it directly. We decided not to expose it through the Fetch layer.

Additionally the data structure used is important too. This current implementation is a bunch of maps of 32-bit hashes. It has good performance but increased chance of collision vs 64-bit hashes (or storing traceID + spanID in the maps directly). We use the last byte of the trace ID to perform a single layer of 1-byte sharding which reduces the pressure on any single map.

zalegrala

This looks like a good start to me.

pkg/traceql/storage.go

pkg/tempopb/tempo.proto

stoewer · 2024-01-10T08:11:20Z

pkg/traceql/engine_metrics.go

+// has no chance for collisions (whereas a hash32 has a non-zero chance of
+// collisions).  However it means we have to arbitrarily set an upper limit on
+// the maximum number of values.
+type FastValues [maxGroupBys]Static


This is used for by (foo, bar), right? Is the order of Static values in FastValues relevant?

Correct, there is a direct match between the group bys and the values in the array. by(foo,bar) will have values [ <f>, <b>, nil, nil, nil] The surrounding GroupingAggregator has knowledge of both and combines them to create the full time-series at the end: { foo="<f>", bar="<b>"}

mdisibio added 12 commits December 12, 2023 13:30

Initial commit from experimental branch

bd2d28f

Fix test

3bd5d7a

regen proto

7f7fea9

Oops forgot to install gofumpt on new computer

77e85c9

lint and clean up

102f128

update mod

c378c85

Fix serverless module references

02942d4

comments

a6fa053

changelog

112a2eb

Merge branch 'main' into traceql-metrics-1-engine

af25053

Fix brittle test

d3dc67f

Fix brittle test

81a2295

mdisibio commented Dec 19, 2023

View reviewed changes

mdisibio mentioned this pull request Dec 19, 2023

[TraceQL Metrics] Parser and engine #3227

Closed

3 tasks

mdisibio marked this pull request as ready for review December 19, 2023 22:51

mdisibio requested review from joe-elliott, annanay25, mapno, yvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners December 19, 2023 22:51

This was referenced Dec 21, 2023

[Traceql Metrics] PR 3 - Trace ID sharding #3258

Merged

[Traceql Metrics] PR 2 - API #3252

Merged

zalegrala approved these changes Jan 4, 2024

View reviewed changes

mdisibio mentioned this pull request Jan 9, 2024

[Traceql Metrics] PR 4 - Sampling #3275

Merged

3 tasks

stoewer reviewed Jan 10, 2024

View reviewed changes

Merge branch 'main' into traceql-metrics-1-engine

337e482

mdisibio added 5 commits January 10, 2024 16:19

Rename shard/of to shardID/shardCount

b50d06a

Add tests for metrics fetch requests

a6cfc9a

Remove unused RawExemplar

acde563

Fix windows builds?

b51e06a

oops forgot update-mod after last commit

a591f8c

mdisibio merged commit 19f6a6c into main Jan 11, 2024
15 checks passed

mdisibio deleted the traceql-metrics-1-engine branch January 11, 2024 15:20

adirmatzkin mentioned this pull request Feb 27, 2024

[TraceQL] Support parent span scope #2492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Traceql Metrics] PR 1 - Engine #3251

[Traceql Metrics] PR 1 - Engine #3251

mdisibio commented Dec 19, 2023 •

edited

Loading

mdisibio Dec 19, 2023

mdisibio Dec 19, 2023

zalegrala left a comment

stoewer Jan 10, 2024 •

edited

Loading

mdisibio Jan 10, 2024

[Traceql Metrics] PR 1 - Engine #3251

[Traceql Metrics] PR 1 - Engine #3251

Conversation

mdisibio commented Dec 19, 2023 • edited Loading

mdisibio Dec 19, 2023

Choose a reason for hiding this comment

mdisibio Dec 19, 2023

Choose a reason for hiding this comment

zalegrala left a comment

Choose a reason for hiding this comment

stoewer Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

mdisibio Jan 10, 2024

Choose a reason for hiding this comment

mdisibio commented Dec 19, 2023 •

edited

Loading

stoewer Jan 10, 2024 •

edited

Loading