Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry middleware in query-frontend #814

Merged
merged 2 commits into from
Jul 14, 2021
Merged

Conversation

yvrhdn
Copy link
Member

@yvrhdn yvrhdn commented Jul 14, 2021

What this PR does:
Requests from the query-frontend to the querier might fail when a querier is starting or stopping. Simply retrying this request usually solves this.

This PR adds a retry middleware which will relaunch requests (up to 5 times by default) if the request failed (http 500 or error).

Which issue(s) this PR fixes:
Related to #761

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Requests from the query-frontend to the querier might fail when a querier is starting or stopping. Simply retrying this request usually solves this.
@yvrhdn
Copy link
Member Author

yvrhdn commented Jul 14, 2021

This PR also fixes context propagation in the other middleware functions.

Before and after:

Screenshot 2021-07-14 at 18 02 30

Screenshot 2021-07-14 at 18 03 17

Copy link
Member

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.

My concern is if this change could lead to hammering down a slow or overloaded querier, making the issue worse. It might be worth considering using a backoff for the retries or other mitigation alternatives.

@yvrhdn
Copy link
Member Author

yvrhdn commented Jul 14, 2021

My concern is if this change could lead to hammering down a slow or overloaded querier, making the issue worse. It might be worth considering using a backoff for the retries or other mitigation alternatives.

In Cortex the querier pulls work from the queue in query-frontend. The amount of queries that are processed concurrently by a querier is limited by MaxConcurrentQueries.
Retrying a request will put this request on the queue again so another querier can pick this up.

BTW, this PR is pretty much the same code as Cortex uses in their queryrange processor which shards requests by day.
https://github.com/cortexproject/cortex/blob/master/pkg/querier/queryrange/retry.go

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit. nice addition.

}

span.LogFields(ot_log.String("msg", "error processing request"), ot_log.Int("try", triesLeft), ot_log.Error(err))
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue?

does this do anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it continues the for loop 😛

I've removed it, I refactored my loop a couple of times for clarity.

@mapno
Copy link
Member

mapno commented Jul 14, 2021

My concern is if this change could lead to hammering down a slow or overloaded querier, making the issue worse. It might be worth considering using a backoff for the retries or other mitigation alternatives.

In Cortex the querier pulls work from the queue in query-frontend. The amount of queries that are processed concurrently by a querier is limited by MaxConcurrentQueries.
Retrying a request will put this request on the queue again so another querier can pick this up.

BTW, this PR is pretty much the same code as Cortex uses in their queryrange processor which shards requests by day.
https://github.com/cortexproject/cortex/blob/master/pkg/querier/queryrange/retry.go

Thanks for the explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants