-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
querier: Discover and use topology for fun and profit #3576
Comments
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Thanks, @zecke for the detailed explanation. The changes such as this require a proposal . We'd be happy to review proposed improvements. So help wanted and PRs are welcome. |
If I understood this correctly then this could be solved by having something like promxy's server groups. But, for that we should have the ability to specify |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
Is your proposal related to a problem?
I very much like the idea that a client can decide to sacrifice correctness to gain availability and that the query component is stateless. Unfortunately the current implementation provides neither correctness nor availability.
A brief summary of problems I noticed when exploring thanos.
Threats to correctness:
A storegw will be forgotten with a single failing Info query and then partial_response=0 succeeds.
The store-strict option assumes a store has no labels and covers no time by default (store panics with "slice bounds out of range" #3057). Restarting a querier while a store has issues (one of the store's QoD because of bad chunks) will result in partial_response=0 succeeding while it is not correct.
Stale stores (e.g. failing to synchronize with the bucket for a long time, longer than the overlap) leading to gaps. Today the store will log SyncBlocks failures but not action on them (send a warning, go unhealthy, expose a counter and gauge for last success).
The correctness issues make using the thanos query frontend slightly more dangerous (e.g. with large TTLs of cached results).
Threats to availability:
A single slow store will make partial_response=1 fail. It's partially mitigated with the --store.reply-timeout but that brings us to the next item.
Per query timeout smaller than --store.reply-timeout will lead to query timeouts. The store/proxy.go will wait for all streams to finish or timeout and this might be after the actual query has timed out. Let's assume the network/bucket has some form of issue then I am happy to get any (cached) result over no data at all.
Imagine we have a simple Prometheus HA setup (N+2 replicas, multiple storegw covering the timeline) but a single store/sidecar/prometheus has issues (restart, crash, IAM misconfigured) then thanos will either fail (partial_response=0) or warn prominently (partial_response=1) about the query. This seems to defeat the purpose of the HA setup that should be able to tolerate single instance failures.
(Write your answer here.)
Describe the solution you'd like
I would like to reach the point where partial_response=0 are (more) correct but can tolerate a single machine failure (because we have N+2 instances running) and that partial_response=1 honors the query timeout and gives genuine partial responses (e.g. only the cached ones).
Taking the topology into account in proxy/store.go could also be a latency (or load) improvement.
(Describe your proposed solution here.)
Describe alternatives you've considered
I have looked at cortex metric's query component but I am worried about frontend/backend having to both sync the object storage (and agree on the view) and the complexities coming from that (e.g. slow restart of a query frontend, needing to manage disk space)
(Write your answer here.)
Additional context
(Write your answer here.)
The text was updated successfully, but these errors were encountered: