Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

querier: Discover and use topology for fun and profit #3576

Closed
zecke opened this issue Dec 12, 2020 · 5 comments
Closed

querier: Discover and use topology for fun and profit #3576

zecke opened this issue Dec 12, 2020 · 5 comments

Comments

@zecke
Copy link
Contributor

zecke commented Dec 12, 2020

Is your proposal related to a problem?

I very much like the idea that a client can decide to sacrifice correctness to gain availability and that the query component is stateless. Unfortunately the current implementation provides neither correctness nor availability.

A brief summary of problems I noticed when exploring thanos.

Threats to correctness:

  • A storegw will be forgotten with a single failing Info query and then partial_response=0 succeeds.

  • The store-strict option assumes a store has no labels and covers no time by default (store panics with "slice bounds out of range" #3057). Restarting a querier while a store has issues (one of the store's QoD because of bad chunks) will result in partial_response=0 succeeding while it is not correct.

  • Stale stores (e.g. failing to synchronize with the bucket for a long time, longer than the overlap) leading to gaps. Today the store will log SyncBlocks failures but not action on them (send a warning, go unhealthy, expose a counter and gauge for last success).

  • The correctness issues make using the thanos query frontend slightly more dangerous (e.g. with large TTLs of cached results).

Threats to availability:

  • A single slow store will make partial_response=1 fail. It's partially mitigated with the --store.reply-timeout but that brings us to the next item.

  • Per query timeout smaller than --store.reply-timeout will lead to query timeouts. The store/proxy.go will wait for all streams to finish or timeout and this might be after the actual query has timed out. Let's assume the network/bucket has some form of issue then I am happy to get any (cached) result over no data at all.

  • Imagine we have a simple Prometheus HA setup (N+2 replicas, multiple storegw covering the timeline) but a single store/sidecar/prometheus has issues (restart, crash, IAM misconfigured) then thanos will either fail (partial_response=0) or warn prominently (partial_response=1) about the query. This seems to defeat the purpose of the HA setup that should be able to tolerate single instance failures.

(Write your answer here.)

Describe the solution you'd like

I would like to reach the point where partial_response=0 are (more) correct but can tolerate a single machine failure (because we have N+2 instances running) and that partial_response=1 honors the query timeout and gives genuine partial responses (e.g. only the cached ones).

Taking the topology into account in proxy/store.go could also be a latency (or load) improvement.

(Describe your proposed solution here.)

Describe alternatives you've considered

I have looked at cortex metric's query component but I am worried about frontend/backend having to both sync the object storage (and agree on the view) and the complexities coming from that (e.g. slow restart of a query frontend, needing to manage disk space)

(Write your answer here.)

Additional context

(Write your answer here.)

@stale
Copy link

stale bot commented Feb 12, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Feb 12, 2021
@kakkoyun
Copy link
Member

Thanks, @zecke for the detailed explanation.

The changes such as this require a proposal . We'd be happy to review proposed improvements. So help wanted and PRs are welcome.

@GiedriusS
Copy link
Member

If I understood this correctly then this could be solved by having something like promxy's server groups. But, for that we should have the ability to specify StoreAPI nodes via config files first IMHO

@stale
Copy link

stale bot commented Jun 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@stale
Copy link

stale bot commented Jun 16, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants