Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count over time Traceql query doubles the actual span count randomly #4257

Closed
imjibi opened this issue Oct 31, 2024 · 2 comments · Fixed by #4325
Closed

Count over time Traceql query doubles the actual span count randomly #4257

imjibi opened this issue Oct 31, 2024 · 2 comments · Fixed by #4325
Assignees

Comments

@imjibi
Copy link

imjibi commented Oct 31, 2024

Describe the bug
When trying to count the number of spans using count_over_time the actual count gets doubled randomly. Query used was {} | count_over_time() by (name)

To Reproduce
Steps to reproduce the behavior:

  1. Use https://github.com/grafana/docker-otel-lgtm/tree/main and make sure it uses Tempo version 2.6.0
  2. Change backend to Azure blob storage and fill in required details. Mine was as follows,
  storage:
    trace:
      backend: azure
      azure:
        container_name: 'containername'
        storage_account_name: 'storageaccount'
        storage_account_key: 'key'
  1. Start generating traffic using curl command that can be found here. Make sure to send traffic without a delay (wanted to replicate our production scenario hence chose to skip delay) and I triggered traffic from command line as follows
while true
do
curl -s http://localhost:8081/rolldice
done
  1. After waiting for almost an hour, go to Grafana and choose Tempo data source and run {name="roll"} | count_over_time() by (name) . Please make sure step is set to 1m and time range is last 1 hour - this is really important as issue happens only when range is set to last 1 hour, if you choose an hour window in the past query works just fine. It will show a time series with a huge spike at a random point (if you don't see any spike at once just try hitting Run Query every minute ). Looking the value you could see that its exactly twice the actual value and the next minute the doubling behaviour moves to next point in time.
  2. The following snap shot show the buggy behaviour
    At 16:26 you could see that the count is 4.87K which is the actual value
    Image
    After a moment later the count is changed to 9.74K which exactly the double of the actual value.
    Image
    And then it continuously slides through the window.
    Actual Count at 16:27 is 4.73K
    Image
    Couple of minutes later it is 9.45K
    Image

Expected behavior
The count should n't get doubled when looking at the last 1 hour span count
Environment:

  • Infrastructure: k8s, docker-compose
  • Deployment tool: helm, docker
@joe-elliott
Copy link
Member

Unfortunately I have not had time to dig into this yet, but my belief is that spans are occasionally being double counted on the border between the metrics generators and the backend.

The frontend calculates a cut off here:

cutoff = time.Now().Add(-s.cfg.QueryBackendAfter)

and attempts to cleanly divide what is requested from the generators and backend. Perhaps if the timing of the query is aligned in a specific way we actually double count.

@mdisibio mdisibio self-assigned this Nov 13, 2024
@mdisibio
Copy link
Contributor

I was able to reproduce this and will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants