Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check cached postings TTL before returning from cache + metrics #822

Merged
merged 8 commits into from
Jan 23, 2025

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jan 17, 2025

In this PR I propose two changes to PostingsForMatchers cache:

  1. Check if the TTL for cached postings is still valid before returning it from cache, to fix a race condition that could happen between a goroutine running expire() and another one skipping the expire() execution because it's already in progress.
  2. Add metrics to PostingsForMatchers cache, to have better visibility over it. This is something I wanted to do since a long time. The design I picked is to allow to pass the metrics struct as DB options, so that we can use 1 single struct for all per-tenant TSDBs in a Mimir ingester.

The following benchmark shows the difference betweeen:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Apple M3 Pro
                                                          │ 01bb37aae.txt │           6c2603082.txt            │              main.txt              │               pr.txt               │
                                                          │    sec/op     │   sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
PostingsForMatchersCache/no_evictions-11                      594.8n ± 3%   584.2n ± 1%   -1.78% (p=0.015 n=6)   602.9n ± 3%        ~ (p=0.589 n=6)   644.6n ± 1%   +8.38% (p=0.002 n=6)
PostingsForMatchersCache/high_eviction_rate-11                10.52µ ± 5%   10.39µ ± 1%   -1.28% (p=0.002 n=6)   10.85µ ± 1%        ~ (p=0.065 n=6)   10.77µ ± 0%        ~ (p=0.065 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11    1411.5n ± 2%   301.1n ± 1%  -78.67% (p=0.002 n=6)   306.2n ± 1%  -78.31% (p=0.002 n=6)   331.2n ± 2%  -76.54% (p=0.002 n=6)
geomean                                                       2.067µ        1.222µ       -40.86%                 1.260µ       -39.03%                 1.320µ       -36.15%

                                                          │ 01bb37aae.txt │             6c2603082.txt             │               main.txt                │               pr.txt                │
                                                          │     B/op      │     B/op      vs base                 │     B/op      vs base                 │     B/op      vs base               │
PostingsForMatchersCache/no_evictions-11                       958.0 ± 0%     958.0 ± 0%        ~ (p=1.000 n=6) ¹     958.0 ± 0%        ~ (p=1.000 n=6) ¹     974.0 ± 0%   +1.67% (p=0.002 n=6)
PostingsForMatchersCache/high_eviction_rate-11               26.38Ki ± 0%   26.38Ki ± 0%        ~ (p=1.000 n=6) ¹   26.38Ki ± 0%        ~ (p=1.000 n=6) ¹   26.32Ki ± 0%   -0.24% (p=0.002 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11     1441.0 ± 0%    1017.0 ± 0%  -29.42% (p=0.002 n=6)      1014.0 ± 0%  -29.63% (p=0.002 n=6)      1024.5 ± 0%  -28.90% (p=0.002 n=6)
geomean                                                      3.263Ki        2.905Ki       -10.97%                   2.902Ki       -11.05%                   2.926Ki       -10.33%
¹ all samples are equal

                                                          │ 01bb37aae.txt │            6c2603082.txt            │              main.txt               │               pr.txt                │
                                                          │   allocs/op   │ allocs/op   vs base                 │ allocs/op   vs base                 │ allocs/op   vs base                 │
PostingsForMatchersCache/no_evictions-11                       20.00 ± 0%   20.00 ± 0%        ~ (p=1.000 n=6) ¹   20.00 ± 0%        ~ (p=1.000 n=6) ¹   20.00 ± 0%        ~ (p=1.000 n=6) ¹
PostingsForMatchersCache/high_eviction_rate-11                 48.00 ± 0%   48.00 ± 0%        ~ (p=1.000 n=6) ¹   48.00 ± 0%        ~ (p=1.000 n=6) ¹   46.00 ± 0%   -4.17% (p=0.002 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11      29.00 ± 0%   21.00 ± 0%  -27.59% (p=0.002 n=6)     21.00 ± 0%  -27.59% (p=0.002 n=6)     20.00 ± 5%  -31.03% (p=0.002 n=6)
geomean                                                        30.31        27.22       -10.20%                   27.22       -10.20%                   26.40       -12.89%
¹ all samples are equal

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM.

It'd be interesting to benchmark this and compare the cost of checking the done channel with the approach prior to #734.

case <-oldPromise.done:
if c.timeNow().Sub(oldPromise.evaluationCompletedAt) >= c.ttl {
// The cached promise already expired, but it has not been evicted.
// TODO trace + metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not something for this PR) Metrics for the PFMC in general would be handy - would be nice to see the rate of cache hits and misses, as well as the rate of cache evictions and their reason (TTL vs cache growing too large).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well as the rate of cache evictions and their reason (TTL vs cache growing too large)

Will do in a separate PR.

@pracucci pracucci force-pushed the check-cached-postings-ttl-before-returning-from-cache branch from b826f27 to 69f78e0 Compare January 22, 2025 11:47
…otherGoroutineIsEvictingTheCache

Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci changed the title Check cached postings TTL before returning from cache Check cached postings TTL before returning from cache + metrics Jan 22, 2025
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

tsdb/postings_for_matchers_cache_test.go Outdated Show resolved Hide resolved
@@ -211,13 +254,15 @@ func (c *PostingsForMatchersCache) postingsForMatchersPromise(ctx context.Contex
}
}

c.metrics.hits.Inc()
span.AddEvent("using cached postingsForMatchers promise", trace.WithAttributes(
attribute.String("cache_key", key),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to include evaluationCompletedAt here as well, to replace what was reverted from #820.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't do it, because there's no guarantee the promise already completed at this step and so it's not safe to access evaluationCompletedAt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the timestamp in another span tho: 6da4d10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a way to log evaluationCompletedAt when we use a cached entry, but doing this doesn't need to block this PR.

Copy link
Collaborator Author

@pracucci pracucci Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find a way to log evaluationCompletedAt when we use a cached entry

Added to the issue: https://github.com/grafana/pir-actions/issues/307

@pracucci pracucci enabled auto-merge (squash) January 23, 2025 07:47
@pracucci pracucci merged commit 0cc2978 into main Jan 23, 2025
8 checks passed
@pracucci pracucci deleted the check-cached-postings-ttl-before-returning-from-cache branch January 23, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants