Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend batching #2677

Merged
merged 22 commits into from
Jul 20, 2023
Merged

Frontend batching #2677

merged 22 commits into from
Jul 20, 2023

Conversation

joe-elliott
Copy link
Member

@joe-elliott joe-elliott commented Jul 19, 2023

What this PR does:
Batches jobs in the requests from the query-frontend queue to the queriers. Previously, the frontend would send each job 1 at at time with an individual http request. This PR adds a configurable parameter to allow the frontend to send more than one request at once.

Other changes:

  • Docs of course! Including an update to the search performance tuning doc with some more current information.
  • Adds a new histogram metric tempo_query_frontend_actual_batch_size to track the actual size of the batches being farmed to the queriers
  • Better testing of the queues and frontend worker.
  • Added the ability for the querier to signal to the frontend the features it supports for seamless rollouts.

Performance testing
The goal with the setup was to create a cluster that could execute the 36k jobs created by the test query simultaneously. This way job throughput from frontend -> querier could be tested more directly.

  • 80 queriers
  • 500 jobs per querier
  • Total cluster capacity 40k jobs
  • No reliance on serverless

Results

batch size    overall query latency     p99 job time in queue
1             8.5s                      4.9s
2             7.6s                      2.4s
5             6.7s                      1s
10            9s                        4.4s
1*            9.6s                      9s

*current image

The overall latency of queries where total jobs > total cluster capacity was not as impressively reduced, but this is a good step in the right direction.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Copy link
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I like the way it is controlled by querier features. A few small q's, but none blocking and will go ahead and approve.

modules/frontend/queue/queue.go Outdated Show resolved Hide resolved
modules/frontend/v1/request_batch.go Show resolved Hide resolved
modules/querier/worker/frontend_processor.go Outdated Show resolved Hide resolved
modules/frontend/v1/frontend.go Show resolved Hide resolved
modules/frontend/v1/frontend.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zalegrala zalegrala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, nice improvement. Will be interesting to see the results on the dashboard. Had a question about the context handling but not blocking.

// then error out this upstream request _and_ stream.
case err := <-errs:
req.err <- err
err = reportResponseUpstream(reqBatch, errs, resps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a context to pass? Wondering if it might simplify the context handling below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the streaming GRPC server connection itself drops or context is cancelled then the .Send() returns an error and this case is hit:
https://github.com/grafana/tempo/pull/2677/files#diff-0914703aed52090bd72851004df203444207d9d48677c10860b0459afef1a0b9R311

If the request is cancelled upstream then this case is hit:
https://github.com/grafana/tempo/pull/2677/files#diff-0914703aed52090bd72851004df203444207d9d48677c10860b0459afef1a0b9R304

If the requests are cancelled downstream then we get an http response and this case is hit:
https://github.com/grafana/tempo/pull/2677/files#diff-0914703aed52090bd72851004df203444207d9d48677c10860b0459afef1a0b9R304

I think everything is covered.

Signed-off-by: Joe Elliott <[email protected]>
Signed-off-by: Joe Elliott <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants