Fix deadlock in metadata fetcher (#8092) · thanos-io/thanos@a13fc75

Commit

Fix deadlock in metadata fetcher (#8092)

This is a fix for a bug we encountered in our production deployment at
$WORK, where we experience Thanos Store to spontaneously stop refreshing
its metadata cache every time our (Ceph-based) object storage starts
rate-limiting Thanos's requests too much. This causes the Store to
permanently stop discovering new blocks in the bucket, and keep trying
to access old, long-gone blocks (which have already been compacted and
removed), breaking subsequent queries to the Store and mandating its
manual restart. More details about the bug and the fix below.

Initially, the `GetActiveAndPartialBlockIDs` method spawns fixed number
of 64 goroutines to issue concurrent requests to a remote storage. Each
time the storage causes the `f.bkt.Exists` call to return an error, one
of goroutines exits returning said error, effectively reducing
concurrency of processing remaining block IDs in `metaChan`. While it's
not a big problem in case of one or two errors, it is entirely possible
that, in case of prolonged storage problems, all 64 goroutines quit,
resulting in `metaChan` filling up and blocking the `f.bkt.Iter`
iterator below. This causes the whole method to be stuck indefinitely,
even if the storage becomes fully operational again.

This commit fixes the issue by allowing the iterator to return as soon
as a single processing goroutine errors out, so that the method can
reliably finish, returning the error as intended. Additionally, the
processing goroutines are adjusted as well, to make them all quit early
without consuming remaining items in `metaChan`. While the latter is not
strictly necessary to fix this bug, it doesn't make sense to let any
remaining goroutines keep issuing requests to the storage if the method
is already bound to return nil result along with the first encountered
error.

Signed-off-by: Piotr Śliwka <[email protected]>

Loading branch information

psliwka authored Feb 7, 2025

1 parent c25a356 commit a13fc75

pkg/block/fetcher.go

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -258,8 +258,8 @@ func (f *ConcurrentLister) GetActiveAndPartialBlockIDs(ctx context.Context, ch c
  
    					continue

    				}

    				select {

    				case <-ctx.Done():

    					return ctx.Err()

    				case <-gCtx.Done():

    					return gCtx.Err()

    				case ch <- uid:

    				}

    			}

    @@ -273,8 +273,8 @@ func (f *ConcurrentLister) GetActiveAndPartialBlockIDs(ctx context.Context, ch c
  
    			return nil

    		}

    		select {

    		case <-ctx.Done():

    			return ctx.Err()

    		case <-gCtx.Done():

    			return gCtx.Err()

    		case metaChan <- id:

    		}

    		return nil

0 comments on commit `a13fc75`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `a13fc75`

Commit

There are no files selected for viewing

0 comments on commit a13fc75

0 comments on commit `a13fc75`