Check cached postings TTL before returning from cache + metrics #822

pracucci · 2025-01-17T16:41:31Z

In this PR I propose two changes to PostingsForMatchers cache:

Check if the TTL for cached postings is still valid before returning it from cache, to fix a race condition that could happen between a goroutine running expire() and another one skipping the expire() execution because it's already in progress.
Add metrics to PostingsForMatchers cache, to have better visibility over it. This is something I wanted to do since a long time. The design I picked is to allow to pass the metrics struct as DB options, so that we can use 1 single struct for all per-tenant TSDBs in a Mimir ingester.

The following benchmark shows the difference betweeen:

01bb37aae: the commit before Reduce PostingsForMatchersCache.expire() pressure on mutex #734
6c2603082: the commit at Reduce PostingsForMatchersCache.expire() pressure on mutex #734
main
This PR

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Apple M3 Pro
                                                          │ 01bb37aae.txt │           6c2603082.txt            │              main.txt              │               pr.txt               │
                                                          │    sec/op     │   sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
PostingsForMatchersCache/no_evictions-11                      594.8n ± 3%   584.2n ± 1%   -1.78% (p=0.015 n=6)   602.9n ± 3%        ~ (p=0.589 n=6)   644.6n ± 1%   +8.38% (p=0.002 n=6)
PostingsForMatchersCache/high_eviction_rate-11                10.52µ ± 5%   10.39µ ± 1%   -1.28% (p=0.002 n=6)   10.85µ ± 1%        ~ (p=0.065 n=6)   10.77µ ± 0%        ~ (p=0.065 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11    1411.5n ± 2%   301.1n ± 1%  -78.67% (p=0.002 n=6)   306.2n ± 1%  -78.31% (p=0.002 n=6)   331.2n ± 2%  -76.54% (p=0.002 n=6)
geomean                                                       2.067µ        1.222µ       -40.86%                 1.260µ       -39.03%                 1.320µ       -36.15%

                                                          │ 01bb37aae.txt │             6c2603082.txt             │               main.txt                │               pr.txt                │
                                                          │     B/op      │     B/op      vs base                 │     B/op      vs base                 │     B/op      vs base               │
PostingsForMatchersCache/no_evictions-11                       958.0 ± 0%     958.0 ± 0%        ~ (p=1.000 n=6) ¹     958.0 ± 0%        ~ (p=1.000 n=6) ¹     974.0 ± 0%   +1.67% (p=0.002 n=6)
PostingsForMatchersCache/high_eviction_rate-11               26.38Ki ± 0%   26.38Ki ± 0%        ~ (p=1.000 n=6) ¹   26.38Ki ± 0%        ~ (p=1.000 n=6) ¹   26.32Ki ± 0%   -0.24% (p=0.002 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11     1441.0 ± 0%    1017.0 ± 0%  -29.42% (p=0.002 n=6)      1014.0 ± 0%  -29.63% (p=0.002 n=6)      1024.5 ± 0%  -28.90% (p=0.002 n=6)
geomean                                                      3.263Ki        2.905Ki       -10.97%                   2.902Ki       -11.05%                   2.926Ki       -10.33%
¹ all samples are equal

                                                          │ 01bb37aae.txt │            6c2603082.txt            │              main.txt               │               pr.txt                │
                                                          │   allocs/op   │ allocs/op   vs base                 │ allocs/op   vs base                 │ allocs/op   vs base                 │
PostingsForMatchersCache/no_evictions-11                       20.00 ± 0%   20.00 ± 0%        ~ (p=1.000 n=6) ¹   20.00 ± 0%        ~ (p=1.000 n=6) ¹   20.00 ± 0%        ~ (p=1.000 n=6) ¹
PostingsForMatchersCache/high_eviction_rate-11                 48.00 ± 0%   48.00 ± 0%        ~ (p=1.000 n=6) ¹   48.00 ± 0%        ~ (p=1.000 n=6) ¹   46.00 ± 0%   -4.17% (p=0.002 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11      29.00 ± 0%   21.00 ± 0%  -27.59% (p=0.002 n=6)     21.00 ± 0%  -27.59% (p=0.002 n=6)     20.00 ± 5%  -31.03% (p=0.002 n=6)
geomean                                                        30.31        27.22       -10.20%                   27.22       -10.20%                   26.40       -12.89%
¹ all samples are equal

charleskorn

Approach LGTM.

It'd be interesting to benchmark this and compare the cost of checking the done channel with the approach prior to #734.

charleskorn · 2025-01-19T22:55:19Z

tsdb/postings_for_matchers_cache.go

+			case <-oldPromise.done:
+				if c.timeNow().Sub(oldPromise.evaluationCompletedAt) >= c.ttl {
+					// The cached promise already expired, but it has not been evicted.
+					// TODO trace + metric


(not something for this PR) Metrics for the PFMC in general would be handy - would be nice to see the rate of cache hits and misses, as well as the rate of cache evictions and their reason (TTL vs cache growing too large).

as well as the rate of cache evictions and their reason (TTL vs cache growing too large)

Will do in a separate PR.

Signed-off-by: Marco Pracucci <[email protected]>

…otherGoroutineIsEvictingTheCache Signed-off-by: Marco Pracucci <[email protected]>

Signed-off-by: Marco Pracucci <[email protected]>

charleskorn

Overall LGTM

tsdb/postings_for_matchers_cache_test.go

charleskorn · 2025-01-23T03:53:52Z

tsdb/postings_for_matchers_cache.go

@@ -211,13 +254,15 @@ func (c *PostingsForMatchersCache) postingsForMatchersPromise(ctx context.Contex
 			}
 		}

+		c.metrics.hits.Inc()
 		span.AddEvent("using cached postingsForMatchers promise", trace.WithAttributes(
 			attribute.String("cache_key", key),


Would be good to include evaluationCompletedAt here as well, to replace what was reverted from #820.

Unfortunately we can't do it, because there's no guarantee the promise already completed at this step and so it's not safe to access evaluationCompletedAt

I added the timestamp in another span tho: 6da4d10

I think we need to find a way to log evaluationCompletedAt when we use a cached entry, but doing this doesn't need to block this PR.

find a way to log evaluationCompletedAt when we use a cached entry

Added to the issue: https://github.com/grafana/pir-actions/issues/307

Signed-off-by: Marco Pracucci <[email protected]>

charleskorn reviewed Jan 19, 2025

View reviewed changes

Check cached postings TTL before returning from cache

69f78e0

Signed-off-by: Marco Pracucci <[email protected]>

pracucci force-pushed the check-cached-postings-ttl-before-returning-from-cache branch from b826f27 to 69f78e0 Compare January 22, 2025 11:47

pracucci added 4 commits January 22, 2025 16:34

Added TestPostingsForMatchersCache_ShouldNotReturnStaleEntriesWhileAn…

6feb67b

…otherGoroutineIsEvictingTheCache Signed-off-by: Marco Pracucci <[email protected]>

Added PostingsForMatchers cache metrics

3ca2ea6

Signed-off-by: Marco Pracucci <[email protected]>

Do not use testify/assert

5129e22

Signed-off-by: Marco Pracucci <[email protected]>

Fixed panic in tests

a2123a9

Signed-off-by: Marco Pracucci <[email protected]>

pracucci changed the title ~~Check cached postings TTL before returning from cache~~ Check cached postings TTL before returning from cache + metrics Jan 22, 2025

pracucci added 2 commits January 22, 2025 19:03

Fix TestCompactHead

725d489

Signed-off-by: Marco Pracucci <[email protected]>

Fixed performance regression

99be72a

Signed-off-by: Marco Pracucci <[email protected]>

pracucci marked this pull request as ready for review January 22, 2025 20:04

pracucci mentioned this pull request Jan 22, 2025

Check cached postings TTL before returning from cache and expose some metrics grafana/mimir#10500

Merged

4 tasks

pracucci requested a review from charleskorn January 22, 2025 20:13

charleskorn reviewed Jan 23, 2025

View reviewed changes

Fix comment and add attribute to a span

6da4d10

Signed-off-by: Marco Pracucci <[email protected]>

pracucci enabled auto-merge (squash) January 23, 2025 07:47

charleskorn approved these changes Jan 23, 2025

View reviewed changes

pracucci merged commit 0cc2978 into main Jan 23, 2025
8 checks passed

pracucci deleted the check-cached-postings-ttl-before-returning-from-cache branch January 23, 2025 07:58

This was referenced Jan 27, 2025

Track evictions in the PostingsForMatchers cache #824

Merged

Track PostingsForMatchers cached promise completion timestamp in trace #825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check cached postings TTL before returning from cache + metrics #822

Check cached postings TTL before returning from cache + metrics #822

pracucci commented Jan 17, 2025 •

edited

Loading

charleskorn left a comment

charleskorn Jan 19, 2025

pracucci Jan 22, 2025

charleskorn left a comment

charleskorn Jan 23, 2025

pracucci Jan 23, 2025

pracucci Jan 23, 2025

charleskorn Jan 23, 2025

pracucci Jan 23, 2025 •

edited

Loading

Check cached postings TTL before returning from cache + metrics #822

Check cached postings TTL before returning from cache + metrics #822

Conversation

pracucci commented Jan 17, 2025 • edited Loading

charleskorn left a comment

Choose a reason for hiding this comment

charleskorn Jan 19, 2025

Choose a reason for hiding this comment

pracucci Jan 22, 2025

Choose a reason for hiding this comment

charleskorn left a comment

Choose a reason for hiding this comment

charleskorn Jan 23, 2025

Choose a reason for hiding this comment

pracucci Jan 23, 2025

Choose a reason for hiding this comment

pracucci Jan 23, 2025

Choose a reason for hiding this comment

charleskorn Jan 23, 2025

Choose a reason for hiding this comment

pracucci Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

pracucci commented Jan 17, 2025 •

edited

Loading

pracucci Jan 23, 2025 •

edited

Loading