Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix deadlock in metadata fetcher (#8092)
This is a fix for a bug we encountered in our production deployment at $WORK, where we experience Thanos Store to spontaneously stop refreshing its metadata cache every time our (Ceph-based) object storage starts rate-limiting Thanos's requests too much. This causes the Store to permanently stop discovering new blocks in the bucket, and keep trying to access old, long-gone blocks (which have already been compacted and removed), breaking subsequent queries to the Store and mandating its manual restart. More details about the bug and the fix below. Initially, the `GetActiveAndPartialBlockIDs` method spawns fixed number of 64 goroutines to issue concurrent requests to a remote storage. Each time the storage causes the `f.bkt.Exists` call to return an error, one of goroutines exits returning said error, effectively reducing concurrency of processing remaining block IDs in `metaChan`. While it's not a big problem in case of one or two errors, it is entirely possible that, in case of prolonged storage problems, all 64 goroutines quit, resulting in `metaChan` filling up and blocking the `f.bkt.Iter` iterator below. This causes the whole method to be stuck indefinitely, even if the storage becomes fully operational again. This commit fixes the issue by allowing the iterator to return as soon as a single processing goroutine errors out, so that the method can reliably finish, returning the error as intended. Additionally, the processing goroutines are adjusted as well, to make them all quit early without consuming remaining items in `metaChan`. While the latter is not strictly necessary to fix this bug, it doesn't make sense to let any remaining goroutines keep issuing requests to the storage if the method is already bound to return nil result along with the first encountered error. Signed-off-by: Piotr Śliwka <[email protected]>
- Loading branch information