You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The querier/query-frontend relationship is using code originally developed for Cortex many years ago and it is likely showing it's age. For larger queries Tempo will often create 10s of thousands of jobs which are piped from the query-frontend to the queriers one at a time. Currently, I believe, there is a bottleneck in delivering these jobs to the queriers at scale.
On the query-frontend side a goroutine is started for every Process call above and they all block on a call to GetNextRequestForQuerier. The end result is often 10k + goroutines in the query frontend waiting on one mutex to deliever one job a time downstream to a querier.
Metrics
This is a graph of requests/second serviced by queriers during a longer traceql query. Notice how the least active querier starts 20-30 seconds later than the first queriers and how slow the ramp up is across the course of the query:
It should be noted that CPU or network saturation could also cause an effect like this.
Possible Solutions
It's possible just by reducing contention on the mutex linked above we could see improved querier performance. Perhaps we can find a way to efficiently shard that queue to spread the load across N mutexes.
Rewrite the relationship between these two components. Perhaps, upon connection, the querier could pass the number of jobs it is willing to take and the query-frontend could deliver a batch of jobs at once that the querier would respond to one at at time.
We are currently seeing a fair amount of querier imbalance and slow spin up for larger queries. Unlocking this bottleneck would likely have a large positive impact on performance.
The text was updated successfully, but these errors were encountered:
joe-elliott
changed the title
Improve Query Frontend -> Querier Job Throughput
[Search Perf] Improve Query Frontend -> Querier Job Throughput
May 12, 2023
A number of PRs went in that have been merged to improve this situation. Closing this issue as any future improvements would require a dedicated redesign of the relationship between the queriers and fronted and should be their own issue.
The querier/query-frontend relationship is using code originally developed for Cortex many years ago and it is likely showing it's age. For larger queries Tempo will often create 10s of thousands of jobs which are piped from the query-frontend to the queriers one at a time. Currently, I believe, there is a bottleneck in delivering these jobs to the queriers at scale.
Code
In the querier you can control the number of jobs that it will do in parallel using max_concurrent_queries. For every concurrent query the querier starts a new goroutine and opens a grpc connection to the query frontend.
On the query-frontend side a goroutine is started for every
Process
call above and they all block on a call to GetNextRequestForQuerier. The end result is often 10k + goroutines in the query frontend waiting on one mutex to deliever one job a time downstream to a querier.Metrics
This is a graph of requests/second serviced by queriers during a longer traceql query. Notice how the least active querier starts 20-30 seconds later than the first queriers and how slow the ramp up is across the course of the query:
It should be noted that CPU or network saturation could also cause an effect like this.
Possible Solutions
It's possible just by reducing contention on the mutex linked above we could see improved querier performance. Perhaps we can find a way to efficiently shard that queue to spread the load across N mutexes.
Rewrite the relationship between these two components. Perhaps, upon connection, the querier could pass the number of jobs it is willing to take and the query-frontend could deliver a batch of jobs at once that the querier would respond to one at at time.
We are currently seeing a fair amount of querier imbalance and slow spin up for larger queries. Unlocking this bottleneck would likely have a large positive impact on performance.
The text was updated successfully, but these errors were encountered: