Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: Allow Thanos backup when local compaction is enabled #206

Open
V3ckt0r opened this issue Feb 12, 2018 · 34 comments
Open

sidecar: Allow Thanos backup when local compaction is enabled #206

V3ckt0r opened this issue Feb 12, 2018 · 34 comments

Comments

@V3ckt0r
Copy link
Contributor

V3ckt0r commented Feb 12, 2018

Hey @Bplotka @fabxc
Still pretty new to the Thanos code base. Going through it, one thing I've noticed with the backup behaviour is it seems to only dump the initial 2 hourly blocks. In my instance I've stood up thanos against an existing Prom server. My data dir looks as follows:

data]# ls -ltr
total 108
drwxr-xr-x 3 root    root       4096 Jan 28 19:00 01C4Z28176WR17K7PH37K7FG9V
drwxr-xr-x 3 root    root       4096 Jan 29 13:00 01C5101JDEX1TK8CMSC4NQK8KP
drwxr-xr-x 3 root    root       4096 Jan 30 07:00 01C52XV3T2ETVSG93K73HVP6D1
drwxr-xr-x 3 root    root       4096 Jan 31 01:00 01C54VMN5R94ZM7N7F08J20DDA
drwxr-xr-x 3 root    root       4096 Jan 31 19:00 01C56SE59DGWBV587GG9M2W99W
drwxr-xr-x 3 root    root       4096 Feb  1 13:00 01C58Q7Q5G4R0DFY5HDGD3XC9Y
drwxr-xr-x 3 root    root       4096 Feb  2 07:00 01C5AN1885H7B9J1W11DXB137A
drwxr-xr-x 3 root    root       4096 Feb  3 01:00 01C5CJTST135PPXDYDWTKEPYTD
drwxr-xr-x 3 root    root       4096 Feb  3 19:00 01C5EGMASFVT63NC0QZTJSABFJ
drwxr-xr-x 3 root    root       4096 Feb  4 13:00 01C5GEDW21PTYQ8WNKYRCAVQJX
drwxr-xr-x 3 root    root       4096 Feb  5 07:00 01C5JC7DGMKPJVQD2DZH0JT7QJ
drwxr-xr-x 3 root    root       4096 Feb  6 01:00 01C5MA0ZC22NSD0H7S8G207JFF
-rw------- 1 root    root          6 Feb  6 12:29 lock
drwxr-xr-x 3 root    root       4096 Feb  6 19:00 01C5P7TGEMP97FSN779S5E5AYH
drwxr-xr-x 3 root    root       4096 Feb  7 13:00 01C5R5M1ANW6RYY81S91VC0F75
drwxr-xr-x 3 root    root       4096 Feb  8 07:00 01C5T3DJX67W411JZ9FP745B2Q
drwxr-xr-x 3 root    root       4096 Feb  9 01:00 01C5W173WNJA1F01TVJJPT5B93
drwxr-xr-x 3 root    root       4096 Feb  9 19:00 01C5XZ0MQ5BSY2W5K9KKAGF5N2
drwxr-xr-x 3 root    root       4096 Feb 10 13:00 01C5ZWT693CHSK8KEKBW04SSDX
drwxr-xr-x 3 root    root       4096 Feb 11 07:00 01C61TKQF3YXY1XT01JN2X0A3W
drwxr-xr-x 3 root    root       4096 Feb 12 01:00 01C63RD8T9Y2Z2C7G98YX9RBXV
drwxr-xr-x 3 root    root       4096 Feb 12 07:00 01C64D0D1Y2B9P3QHHH9XCF8NV
drwxr-xr-x 3 root    root       4096 Feb 12 09:00 01C64KW317054P6TNH8DCNGRP9
drwxr-xr-x 3 root    root       4096 Feb 12 11:00 01C64TQT974JFMQV24CH9060XW
drwxrwxrwx 2 root    root       4096 Feb 12 11:56 wal

When standing up Thanos I see:

./thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /opt/prometheus/promv2/data/ --s3.bucket=thanos --s3.endpoint=xxxxxxxx --s3.access-key=xxxxxxx --s3.secret-key=xxxxxx
level=info ts=2018-02-12T12:25:49.329654785Z caller=sidecar.go:293 msg="starting sidecar" peer=01C64ZMYG5728WQFXTVCD0F70V
level=info ts=2018-02-12T12:25:49.652116167Z caller=shipper.go:179 msg="upload new block" id=01C64KW317054P6TNH8DCNGRP9
level=info ts=2018-02-12T12:25:51.747570129Z caller=shipper.go:179 msg="upload new block" id=01C64TQT974JFMQV24CH9060XW

only the last two blocks are uploaded. From the flags for thanos sidecar I don't see a mechanism for specifying a period for backdating. Perhaps I am doing something wrong? Is this intentional for some reason (compute/performance)? Or am I simply listing a feature request here?

Thanks.

@fabxc
Copy link
Collaborator

fabxc commented Feb 12, 2018

Hey, thanks for trying out Thanos.

Your are doing nothing wrong. Early on we hardcoded the sidecar to only upload blocks of comapction level 0, i.e. those that never got compacted.

With the garbage collection behavior the compactor has nowadays, it should be safe though to also upload historic data and potentially double-upload some data without lasting consequences. Just didn't get to changing the bahvior yet.

In the meantime, you could just manually upload those blocks to the bucket of course.

@bwplotka
Copy link
Member

bwplotka commented Feb 12, 2018

Yea we could now safely drop the rule of uploading only blocks with compaction level 0, however, I am just curious, is there any use case or reason why anyone want to do compaction on local, Prometheus level instead of bucket level?

By default we recommend to set

--storage.tsdb.max-block-duration=...
--storage.tsdb.min-block-duration=...

to the same value to turn off local compaction at all.

If one decide to actually do local compaction I can see some (unlikely, but..) race condition, when sidecar is not able to upload for some time from various reasons and Prometheus will have enough time to compact some blocks and kill 0-level block. This way nothing would be uploaded.

More importantly, our rule of uploading only 0 level makes it more difficult to use thanos sidecar on already running Prometheus instances.

I think we should just upload all levels by default

@V3ckt0r
Copy link
Contributor Author

V3ckt0r commented Feb 12, 2018

Cool, cheers guys.

@Bplotka in regards to your question about local tsdb compaction. I'm guessing those who are not using an object store would still want local compaction, to save disk. I've heard of teams out there with a year+ worth of time series.

Thanks for flipping PR 207. I've tried testing this out in my fork, but I am not seeing the the desired uploads.

@bwplotka
Copy link
Member

bwplotka commented Feb 12, 2018

@V3ckt0r let's move this discussion to the #207 PR then.

I think you cannot see these uploads because thanos marked these as "uploaded" in thanos.shipper.json (because they were level compaction 2+). The easiest way is to manually remove blocks with compaction lvl 2+ which are NOT actually uploaded from the thanos.shipper.json uploaded list.

I know that marking them as uploaded, when they were not was, bit weird, maybe initially we should name it processed

@bwplotka
Copy link
Member

Seems to work for @V3ckt0r

@bwplotka
Copy link
Member

hm just rethinking... @V3ckt0r

@Bplotka in regards to your question about local tsdb compaction. I'm guessing those who are not using an object store would still want local compaction, to save disk. I've heard of teams out there with a year+ worth of time series.

In that case they can specify to not upload things and this issue will never occur (:

@bwplotka bwplotka reopened this Feb 13, 2018
@bwplotka
Copy link
Member

OK new approach emerged:
In v2.2.1 we added min-block-size compaction delay to prometheus. This will help in 99% cases to avoid conflicts between thanos hardlink before upload & prometheus local compaction if enabled. To be even more sure we should use Snapshot API, but that will require special flag (admin endpoint) for Prometheus.

On every upload attempt Thanos should:

  1. Check if all sources of compacted blocks are in object storage. Upload all compacted blocks that include missing sources. Obiously upload all not compacted blocks which are not in object store too.
  2. If admin endpoint is enabled - trigger snapshot to make sure no compaction is in place in the same time.

@fabxc Do we want to require admin endpoint if user does configure local compaction? And otherwise just error sidecar? I think so.
The question is what to do for versions prior to v2.2.1

@bwplotka
Copy link
Member

As @TimSimmons mentioned, we should add more info in docs as well, how to configure Prometheus for the best experience.

@bwplotka bwplotka changed the title thanos sidecar backup behaviour sidecar: Thanos backup behaviour when local compaction is enabled Mar 20, 2018
@mattbostock
Copy link
Contributor

mattbostock commented Apr 13, 2018

I'm looking at how to integrate the sidecar with our existing Prometheus instances and I'm wondering whether the sidecar should try to only ship the most compacted blocks, within the limits of the data retention window.

The advantages:

  • Reduces the load on the global compactor in Thanos, thus reducing requests/bandwidth on the object storage.
  • Prometheus retains the performance gains achieved from local compaction.
  • Reduced configuration required in Prometheus to work with Thanos.

The disadvantages:

  • It's difficult to avoid race conditions between the sidecar and Prometheus' compaction. As mentioned above, we'd probably need to interact with an API in Prometheus to signal/coordinate between the sidecar and Prometheus.
  • We'd have to be careful to allow sufficient time to ship the data to object storage (including time for retries/failures) before the data retention horizon is reached. Instead of allowing the maximum amount of time to attempt to ship data to the object store, we'd be minimising that time window.
  • Uploads to the object store would be less frequent, but the uploads would be larger (rather than more frequent, smaller requests, which could be of concern for some on-premise installations in terms of disk/network throughput).
  • Data not stored in object storage will be retrieved from Prometheus via the sidecar, and since the data is older it may no longer be in-memory. Thus the retrieval latency may be higher, depending on the performance of the object store and round-trip time from the query instance to the sidecar.

@bwplotka
Copy link
Member

Sorry for delay @mattbostock! Regarding mentioned benefits:

Reduces the load on the global compactor in Thanos, thus reducing requests/bandwidth on the object storage.

True, but not sure if req/band of object store is actually an issue.

Prometheus retains the performance gains achieved from local compaction.

Is that really the needed if you keep scraper small? (24h retention?)

Reduced configuration required in Prometheus to work with Thanos.

Don't get that, what would be simplified?

I am afraid all the disadvantages you mentioned are true, and they are "winning" with the benefits. There are couple of more problems:

  • We had recently problems with compaction being broken - it would be a lot of harder to debug issues if the sources of the issue could be not only compactor but also sidecars.
  • @jacksontj mentioned interesting issue: Fetching large series are not really efficient for Prometheus: JSON Marshaling of large responses is excessively expensive prometheus/prometheus#3601 , so making sidecar retention longer is not really beneficial.
  • Global compactor would be still required to downsample data and probably compact data for longer blocks (2w)
  • We would need to enforce some kind of compact levels, otherwise user would be able to shoot itself in the foot by changing to some non-standard levels (2,5h -> 8,5 day, etc) that will conflict with global compactor.

@mattbostock
Copy link
Contributor

mattbostock commented Apr 23, 2018

True, but not sure if req/band of object store is actually an issue.

Agree. #294 should help to determine the exact usage in access logs.

Is that really the needed if you keep scraper small? (24h retention?)

Reducing Prometheus' data retention means that you're reducing the window for recovery during a disaster recovery scenario.

For example, if you're using an on-premise object store that has a catastrophic failure (e.g. datacentre goes up in flames) and Prometheus' data retention is 30 days, that gives you more time to configure a new object store in a new datacentre and configure Thanos to send the data to the new object store.

The sample could apply to cloud storage if/when a provider had/has a significantly long outage.

There are mitigations for this (the most obvious being to not run a single object store in a single datacentre), but most of these are more complex and more costly than retaining data for longer in Prometheus (at least for on-premise installations). I'm not suggesting one over the other, but highlighting that the retention period is a factor to consider.

Reduced configuration required in Prometheus to work with Thanos.

Don't get that, what would be simplified?

Thanos currently requires that compaction is disabled in Prometheus, which means setting a command-line flag for Prometheus.

We would need to enforce some kind of compact levels, otherwise user would be able to shoot itself in the foot by changing to some non-standard levels (2,5h -> 8,5 day, etc) that will conflict with global compactor.

Good point, this is a significant downside.

@bwplotka
Copy link
Member

bwplotka commented Apr 23, 2018

Cool, I see the disaster recovery goal for some users, but not sure if Prometheus scraper should be treated as a backup solution to your on-premise/cloud/magic object storage. There must be some dedicated tools for that.

The main important use case I can see for local compaction is when user want to migrate to thanos and upload all old blocks already compacted by their vanilla Prometheus servers with long retention.

The easiest way would be to just glue sidecar to existing Prom server that will be smart enough to detect what "sources" are missing in an object store. This way, we can allow local compaction for longer local storage if you wish, and upload sources that are not upstreamed yet. This is what I proposed here: #206 (comment)

There are some pain points here that needs to be solved, though.

  • what if object store got source A, C but not B and Prometheus already compacted locally A,B,C together? We should upload ABC and deal with that on compactor (vertical compaction!)
  • local compact & upload in the same time could be racy -> we can leverage snapshot API to avoid that.

If you really want to keep longer retention for Prometheus long term -> nothing blocks you from that with above logic, except one more thing mentioned here: #283
and here #82

Maybe it sounds reasonable to add some arbitrary min-time flag to sidecar to expose only fresh metrics?

@bwplotka bwplotka changed the title sidecar: Thanos backup behaviour when local compaction is enabled sidecar: Allow Thanos backup when local compaction is enabled May 3, 2018
@bwplotka
Copy link
Member

bwplotka commented May 25, 2018

Ok, migration goal moved to another ticket: #348

This tickets stays as ability to run Prometheus with local compaction + sidecar for long period and use it as it is. However this does not make sense that much. since longer retention is not good for this reason: #82

This blocker makes me think that this (long retention + local compaction for longer usage) might be not in our scope.

@asbjxrn
Copy link

asbjxrn commented Jul 11, 2018

A bit late perhaps, but without knowing the internals of the tsdb I think the suggestion from @mattbostock makes sense.

Compaction would be the limiting factor in how much data can be stored in a bucket. Not only does it have to download and upload all the data that get stored in the bucket multiple times, unlike the sidecars uploading in parallel the compaction is supposed to run as a singleton. This can be solved by splitting the data into more buckets, but that means more complicated setups where users have to manage which servers go to each buckets.

To me it seems like most issues with uploading compacted blocks also apply to uploading raw blocks.
If all blocks are compacted by prometheus instead of thanos, wouldn't issues like #82 and #283 be reduced/simpler?
Performance issue prometheus/prometheus#3601 relates to the query returning gigabytes of data, is there any reason to think thanos store would handle that better?
Compaction levels are already enforced (no compaction) so enforcing that isn't something new.

Is global compaction still needed? The sidecars could downsample the data before uploading, I don't know about 2w blocks.

It's also worth to consider that compacting before uploading would solve issues too, like #377 and #271 (long timerange queries across all promeheus sources while the compactor is running is very unreliable as some blocks are almost bound to have been removed.)

@bwplotka
Copy link
Member

bwplotka commented Jul 13, 2018

Yea, local only compaction would solve some issues, nice idea, but unfortunately:

Is global compaction still needed? The sidecars could downsample the data before uploading, I don't know about 2w blocks.

Yes.

Reasons:

  • We (and most likely lots of other users) have ingestion at the level that even 9d retention is too long, not mentioning additional downsampling procedure that takes lot's of memory (however might be optimized). This being said not all compaction levels are able to accumulate enough to be compacted (like 2w)
  • Compactor is singleton not without a reason. You really want only one guy in the system having "delete" operations (this does not apply if your idea means sidecar doing all work and upload only ready compacted blocks)
  • It is really useful to have scrapers a lot lighter than casual Prometheus with 30d retention and compaction enabled. Reduction of cost and maintenance. You no longer need 500 or 1TB persistent SSD with some additional backup logic for these (!). Having Prometheus being just a scraper with some 1d buffer makes life easier.

I totally see we want to allow local compaction with Thanos for some use cases, but we need to invest some time to solve it, but I think WITH global compactor ):

@bwplotka
Copy link
Member

BTW:

Performance issue prometheus/prometheus#3601 relates to the query returning gigabytes of data, is there any reason to think thanos store would handle that better?

  1. Definitely not worse than Prometheus itself -> if it will be improved for Prometheus, Thanos will have the gains as well.
  2. Yes - it will handle it better. We have downsampling. (:

@asbjxrn
Copy link

asbjxrn commented Jul 13, 2018

We (and most likely lots of other users) have ingestion at the level that even 9d retention is too long, not mentioning additional downsampling procedure that takes lot's of memory (however might be optimized). This being said not all compaction levels are able to accumulate enough to be compacted (like 2w)

Yup, one drawback is of course that it means prometheus needs longer retention times which may not work for all deployments.

FWIW, I am experimenting with this by adding a flag to the sidecar that specifies what compaction level the sidecar is uploading. I plan to upload only compaction level 5 where a block has about a weeks worth of data.

I then won't run the global compactor, but will run "thanos downsample" or try to patch the sidecar to downsample before upload (which would have the drawback of only one level of downsampling...) I understand this is not the direction the project want to go for several reasons, but I think it's a worthwhile experiment that could provide some useful data as the alternative for us is to split the data into several buckets so compaction can keep up ( and I really like the simplicity of thanos and want to keep the deployment simple too :)

@asbjxrn
Copy link

asbjxrn commented Sep 5, 2018

Just a FYI.
We had issues with compaction performance, starting with a 1.5 week backlog it took almost 6 weeks for compaction to catch up. This was partly due to crashes and restarts caused by occasional timeouts as well as downtime as the disk filled up at times due to compactor not always cleaning up.

It then started the downsampling process which I estimated would take another 3 weeks to complete before the cycle would start over.

I then aborted the whole process, let prometheus start compacting the data, got a new bucket and added a small patch to upload only blocks where compaction level == *flagShipperLevel to the sidecar. With a 3 week backlog, the whole compaction and upload process now took less than 4 hours.

While one solution is to have multiple buckets to distribute the compaction load that way, would it be possible to add a flag to the sidecar to only upload blocks at a certain compaction level for those that have large enough prometheus servers?

@bwplotka
Copy link
Member

Totally missed this, sorry.

While one solution is to have multiple buckets to distribute the compaction load that way, would it be possible to add a flag to the sidecar to only upload blocks at a certain compaction level for those that have large enough prometheus servers?

Yes, definitely valid use case, probably to separate issues. And useful as one time job. We are actually working on it as part of: observatorium/thanos-replicate#7

We also added sharding for the compactor, so you can deploy many that operates on different blocks.

@daixiang0
Copy link
Member

@bwplotka better to add compactor label?

@stale
Copy link

stale bot commented Feb 8, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 8, 2020
@stale stale bot closed this as completed Feb 15, 2020
@bwplotka
Copy link
Member

It would be amazing to get back to this. @daixiang0 no as it's Prometheus compaction, not Compactor really.

@bwplotka
Copy link
Member

I think the solution is vertical compaction which is quite stable as long as data is 1:1

@daixiang0
Copy link
Member

/reopen

@yeya24
Copy link
Contributor

yeya24 commented Oct 16, 2021

Question:
I can see we can totally upload overlapped or compacted blocks to the object storage as we already support vertical compaction now.
But still the question on the shipper side:
Is it safe to enable local Prometheus compaction? Will this race condition happen? (I can see this is possible).

How can we solve this via the SnapShot API? Using snapshot seems still not 100% safe as we cannot control TSDB compaction behavior at the Prometheus side. The new block might be compacted & deleted once it is created. Maybe we can have something like compact-deletion-delay in Prometheus for keeping compaction source blocks around for some time?

Besides, the current snapshot API is not flexible enough as we usually only want to do snapshot for the newly created block.

@yeya24 yeya24 reopened this Oct 16, 2021
@GiedriusS
Copy link
Member

Maybe my idea is stupid but in Prometheus blocks we already have:

        "compaction": {
                "level": 1,
                "sources": [
                        "01FJZD9XCD6394RK2PZ3181CP5"
                ]
        },

Given that Thanos Sidecar locally also "knows" what block IDs have been already uploaded, maybe it could take a look at the same meta.json file to see whether all of the sources have been already uploaded?

We already have thanos.shipper.json with the list of IDs. If a user deletes that file then all bets are off. Also, we cannot do anything after the blocks have been uploaded to remote object storage because they might get picked up by Thanos Compactor. So, with vertical compaction + such checking as outlined before, we could safely enable local compaction with the caveat that vertical compaction might happen if a user deletes thanos.shipper.json?

@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@stale
Copy link

stale bot commented Mar 2, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Mar 2, 2022
@GiedriusS GiedriusS removed the stale label Mar 3, 2022
@GiedriusS GiedriusS reopened this Mar 3, 2022
@prymitive
Copy link

We have storage.tsdb.min-block-duration & storage.tsdb.max-block-duration set as we use sidecar with Prometheus and the side effect of this is that seems to be that all HEAD chunks reach max duration at the same time, which then creates a huge number of new chunks at the same time, which then slows down HEAD and causes timeouts in rule evaluation.
It would be great if thanos didn't require setting storage.tsdb.min-block-duration & storage.tsdb.max-block-duration to the same value as that would help avoid this issue.

@GiedriusS
Copy link
Member

I agree. @fpetkovski is looking into this at the moment, I believe.

@fpetkovski
Copy link
Contributor

fpetkovski commented May 2, 2022

This is what I understood so far:

  1. From a Thanos compactor perspective, it should be safe to upload already compacted blocks.
  2. From a Thanos sidecar perspective, it is still unclear to me whether there is potential race condition where the Prometheus compactor can delete a block before the sidecar has had a chance to upload it. However, even if that happens, because of 1 I don't think it should be a problem.

I might enable this setup on one of our staging clusters and monitor for a bit to see if anything suspicious comes up.

@yeya24
Copy link
Contributor

yeya24 commented May 2, 2022

Prometheus will exclude the most recent block from compaction planning. Which means a block has 2h time window to upload for the sidecar. I think 2h is enough usually.

@prymitive
Copy link

Another downside of forced --storage.tsdb.(min|max)-block-duration flags is that if we stop Prometheus from merging blocks together then the index size becomes a very noticeable overhead.

Let's say I have default retention of 15d and each block has 1GB of index data. Without those flags Prometheus will merge blocks up to 10% of 15d = 36h, with --storage.tsdb.(min|max)-block-duration=2h blocks will be 2h.
This means that with my 15d retention I'll have 10 of 1GB of index files if I don't force 2h blocks, and 12*15=180 of 1GB of index files if I force 2h blocks. That's 10GB vs 180GB of disk space and files to read into memory when running queries.
It could save a lot resources if Thanos was able to get rid of those flags.

@stale
Copy link

stale bot commented Sep 21, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants