querier: Discover and use topology for fun and profit #3576

zecke · 2020-12-12T07:40:48Z

Is your proposal related to a problem?

I very much like the idea that a client can decide to sacrifice correctness to gain availability and that the query component is stateless. Unfortunately the current implementation provides neither correctness nor availability.

A brief summary of problems I noticed when exploring thanos.

Threats to correctness:

A storegw will be forgotten with a single failing Info query and then partial_response=0 succeeds.
The store-strict option assumes a store has no labels and covers no time by default (store panics with "slice bounds out of range" #3057). Restarting a querier while a store has issues (one of the store's QoD because of bad chunks) will result in partial_response=0 succeeding while it is not correct.
Stale stores (e.g. failing to synchronize with the bucket for a long time, longer than the overlap) leading to gaps. Today the store will log SyncBlocks failures but not action on them (send a warning, go unhealthy, expose a counter and gauge for last success).
The correctness issues make using the thanos query frontend slightly more dangerous (e.g. with large TTLs of cached results).

Threats to availability:

A single slow store will make partial_response=1 fail. It's partially mitigated with the --store.reply-timeout but that brings us to the next item.
Per query timeout smaller than --store.reply-timeout will lead to query timeouts. The store/proxy.go will wait for all streams to finish or timeout and this might be after the actual query has timed out. Let's assume the network/bucket has some form of issue then I am happy to get any (cached) result over no data at all.
Imagine we have a simple Prometheus HA setup (N+2 replicas, multiple storegw covering the timeline) but a single store/sidecar/prometheus has issues (restart, crash, IAM misconfigured) then thanos will either fail (partial_response=0) or warn prominently (partial_response=1) about the query. This seems to defeat the purpose of the HA setup that should be able to tolerate single instance failures.

(Write your answer here.)

Describe the solution you'd like

I would like to reach the point where partial_response=0 are (more) correct but can tolerate a single machine failure (because we have N+2 instances running) and that partial_response=1 honors the query timeout and gives genuine partial responses (e.g. only the cached ones).

Taking the topology into account in proxy/store.go could also be a latency (or load) improvement.

(Describe your proposed solution here.)

Describe alternatives you've considered

I have looked at cortex metric's query component but I am worried about frontend/backend having to both sync the object storage (and agree on the view) and the complexities coming from that (e.g. slow restart of a query frontend, needing to manage disk space)

(Write your answer here.)

Additional context

(Write your answer here.)

stale · 2021-02-12T05:54:17Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun · 2021-02-12T10:05:48Z

Thanks, @zecke for the detailed explanation.

The changes such as this require a proposal . We'd be happy to review proposed improvements. So help wanted and PRs are welcome.

GiedriusS · 2021-03-27T19:38:53Z

If I understood this correctly then this could be solved by having something like promxy's server groups. But, for that we should have the ability to specify StoreAPI nodes via config files first IMHO

stale · 2021-06-02T16:04:08Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-06-16T22:04:50Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

zecke added the feature request/improvement label Dec 12, 2020

GiedriusS added the component: query label Dec 13, 2020

stale bot added the stale label Feb 12, 2021

kakkoyun added help wanted and removed stale labels Feb 12, 2021

stale bot added the stale label Jun 2, 2021

stale bot closed this as completed Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

querier: Discover and use topology for fun and profit #3576

querier: Discover and use topology for fun and profit #3576

zecke commented Dec 12, 2020

stale bot commented Feb 12, 2021

kakkoyun commented Feb 12, 2021

GiedriusS commented Mar 27, 2021

stale bot commented Jun 2, 2021

stale bot commented Jun 16, 2021

querier: Discover and use topology for fun and profit #3576

querier: Discover and use topology for fun and profit #3576

Comments

zecke commented Dec 12, 2020

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

stale bot commented Feb 12, 2021

kakkoyun commented Feb 12, 2021

GiedriusS commented Mar 27, 2021

stale bot commented Jun 2, 2021

stale bot commented Jun 16, 2021