Count over time Traceql query doubles the actual span count randomly #4257

imjibi · 2024-10-31T16:31:47Z

Describe the bug
When trying to count the number of spans using count_over_time the actual count gets doubled randomly. Query used was {} | count_over_time() by (name)

To Reproduce
Steps to reproduce the behavior:

Use https://github.com/grafana/docker-otel-lgtm/tree/main and make sure it uses Tempo version 2.6.0
Change backend to Azure blob storage and fill in required details. Mine was as follows,

  storage:
    trace:
      backend: azure
      azure:
        container_name: 'containername'
        storage_account_name: 'storageaccount'
        storage_account_key: 'key'

Start generating traffic using curl command that can be found here. Make sure to send traffic without a delay (wanted to replicate our production scenario hence chose to skip delay) and I triggered traffic from command line as follows

while true
do
curl -s http://localhost:8081/rolldice
done

After waiting for almost an hour, go to Grafana and choose Tempo data source and run {name="roll"} | count_over_time() by (name) . Please make sure step is set to 1m and time range is last 1 hour - this is really important as issue happens only when range is set to last 1 hour, if you choose an hour window in the past query works just fine. It will show a time series with a huge spike at a random point (if you don't see any spike at once just try hitting Run Query every minute ). Looking the value you could see that its exactly twice the actual value and the next minute the doubling behaviour moves to next point in time.
The following snap shot show the buggy behaviour
At 16:26 you could see that the count is 4.87K which is the actual value

After a moment later the count is changed to 9.74K which exactly the double of the actual value.

And then it continuously slides through the window.
Actual Count at 16:27 is 4.73K

Couple of minutes later it is 9.45K

Expected behavior
The count should n't get doubled when looking at the last 1 hour span count
Environment:

Infrastructure: k8s, docker-compose
Deployment tool: helm, docker

The text was updated successfully, but these errors were encountered:

joe-elliott · 2024-11-06T18:16:59Z

Unfortunately I have not had time to dig into this yet, but my belief is that spans are occasionally being double counted on the border between the metrics generators and the backend.

The frontend calculates a cut off here:

tempo/modules/frontend/metrics_query_range_sharder.go

Line 114 in 3449ef6

cutoff = time.Now().Add(-s.cfg.QueryBackendAfter)

and attempts to cleanly divide what is requested from the generators and backend. Perhaps if the timing of the query is aligned in a specific way we actually double count.

mdisibio · 2024-11-13T14:09:15Z

I was able to reproduce this and will take a look.

mdisibio self-assigned this Nov 13, 2024

mdisibio mentioned this issue Nov 14, 2024

TraceQL metrics time range fixes #4325

Merged

3 tasks

mdisibio closed this as completed in #4325 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count over time Traceql query doubles the actual span count randomly #4257

Count over time Traceql query doubles the actual span count randomly #4257

imjibi commented Oct 31, 2024 •

edited

Loading

joe-elliott commented Nov 6, 2024

mdisibio commented Nov 13, 2024

Count over time Traceql query doubles the actual span count randomly #4257

Count over time Traceql query doubles the actual span count randomly #4257

Comments

imjibi commented Oct 31, 2024 • edited Loading

joe-elliott commented Nov 6, 2024

mdisibio commented Nov 13, 2024

imjibi commented Oct 31, 2024 •

edited

Loading