-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805
Comments
This is not a retryable error since we don't have enough data on disk when compaction runs and this occurs. We have been running into this the past month or so but only on our staging environment. We have found after #2637 that:
My best theory, for now, is that the transfer gets cut off abruptly, and After that PR, you should be able to inspect the state of files on disk and see if this happens for you as well. Help welcome! |
I further analyzed one of the halts:
thanos-store did not have any issues with this block. With two days in-between I do not really suspect any issues with eventual consistency. This looks more like connection issues during download, but shouldn't the client library recognize such issues? If there's anything more we can do, we're glad to jump in. For now, we'll likely switch to |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Still an issue with us, and automatically restarting/not halting compactor make compaction continue pretty much immediately without any issues. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily. |
Are you using S3 as well? What underneath is implementing this API i.e. are you using AWS or something else? |
@GiedriusS We are using Ceph object store which is S3 compatible. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
We are experience the same issue with Thanos v 0.17.2 and the Ceph Object Gateway. |
From a quick look at Minio internals (very quick), I suspect this the case as well. |
Should be fixed with #3795. |
Still caught the same error during compaction time. Wondering is there a way for me to overcome this issue or delete the bad series only using the bucket rewrite tool.
|
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Hi, |
Thanos, Prometheus and Golang version used:
Thanos 0.13 rc2, currently upgrading to 0.13 but commits do not indicate any relevant change since the tag was crafted
Object Storage Provider:
Internal Cloudian S3 service
What happened:
Compactor halts with following error message:
level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"
After restarting compactor, compacting continues successfully (usually hours later, we only get warned if compactor does not run for quite some time -- so I cannot tell whether this is a pretty short inconsistency window and lots of bad luck involved).
What you expected to happen:
If this is really a transient issue based on eventual consistency issues, retry and don't halt.
How to reproduce it (as minimally and precisely as possible):
Slightly overload your S3 service and wait. We have a fail rate of estimated once a week for a single compactor. Mostly happens in installations with some network distance (ping >= 200ms)
Full logs to relevant components:
Anything else we need to know:
Pretty sure related to #2128, this and prometheus/prometheus#6040 (and our observations) make me think this might be a transient issue. Hard to tell since we cannot really reproduce, is #2637 maybe fixing this already?
The text was updated successfully, but these errors were encountered: