-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in the 5.9.8 Release #304
Comments
Hi @rajatvig , thanks for raising the issue. There was one change to the Prometheus backend (#288) which may explain the issue you're seeing - all metrics now have a Can you share the exact PromQL query? |
I did see that PR merged but wasn't able to tie it back to the issue we are seeing. We are not yet running clustered agents. The full PromQL we use is
That gives us a count of running and scheduled jobs that help us determine how many agents we need to run. While the |
I see, interesting. The metric being stuck could be related to #296, which removed a well-intended but heavy-handed gauge reset. Is the metric stuck for all queues, or a particular queue? Is it stuck for queues that were deleted? |
It was stuck for queues that were deleted, i.e. no builds were running. |
I just gave 5.9.9 a try and still seeing similar behaviour. I setup 2 jobs on the |
Issue Details
Post an upgrade from 5.9.4 to 5.9.8, we noticed that the metrics for running builds are not getting updated after the builds complete. This behaviour causes a change in scaling behavior as metric calculation we use sums running and scheduled builds for a queue to decide if there are enough agents running. The metric we see the same value for is
buildkite_queues_running_jobs_count
.Setup
We are running unclustered agents and running the agent metrics binary to export metrics to Prometheus.
The text was updated successfully, but these errors were encountered: