Skip to content

Commit

Permalink
Fix deadlock in metadata fetcher (#8092)
Browse files Browse the repository at this point in the history
This is a fix for a bug we encountered in our production deployment at
$WORK, where we experience Thanos Store to spontaneously stop refreshing
its metadata cache every time our (Ceph-based) object storage starts
rate-limiting Thanos's requests too much. This causes the Store to
permanently stop discovering new blocks in the bucket, and keep trying
to access old, long-gone blocks (which have already been compacted and
removed), breaking subsequent queries to the Store and mandating its
manual restart. More details about the bug and the fix below.

Initially, the `GetActiveAndPartialBlockIDs` method spawns fixed number
of 64 goroutines to issue concurrent requests to a remote storage. Each
time the storage causes the `f.bkt.Exists` call to return an error, one
of goroutines exits returning said error, effectively reducing
concurrency of processing remaining block IDs in `metaChan`. While it's
not a big problem in case of one or two errors, it is entirely possible
that, in case of prolonged storage problems, all 64 goroutines quit,
resulting in `metaChan` filling up and blocking the `f.bkt.Iter`
iterator below. This causes the whole method to be stuck indefinitely,
even if the storage becomes fully operational again.

This commit fixes the issue by allowing the iterator to return as soon
as a single processing goroutine errors out, so that the method can
reliably finish, returning the error as intended. Additionally, the
processing goroutines are adjusted as well, to make them all quit early
without consuming remaining items in `metaChan`. While the latter is not
strictly necessary to fix this bug, it doesn't make sense to let any
remaining goroutines keep issuing requests to the storage if the method
is already bound to return nil result along with the first encountered
error.

Signed-off-by: Piotr Śliwka <[email protected]>
  • Loading branch information
psliwka authored Feb 7, 2025
1 parent c25a356 commit a13fc75
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions pkg/block/fetcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -258,8 +258,8 @@ func (f *ConcurrentLister) GetActiveAndPartialBlockIDs(ctx context.Context, ch c
continue
}
select {
case <-ctx.Done():
return ctx.Err()
case <-gCtx.Done():
return gCtx.Err()
case ch <- uid:
}
}
Expand All @@ -273,8 +273,8 @@ func (f *ConcurrentLister) GetActiveAndPartialBlockIDs(ctx context.Context, ch c
return nil
}
select {
case <-ctx.Done():
return ctx.Err()
case <-gCtx.Done():
return gCtx.Err()
case metaChan <- id:
}
return nil
Expand Down

0 comments on commit a13fc75

Please sign in to comment.