Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching to package details page vulnerabilities query #8580

Merged
merged 5 commits into from
May 20, 2021

Conversation

drewgillies
Copy link
Contributor

Addresses: #8543

Note that this change is against dev and will need to be rebased to main if we wish to deploy it as a hotfix.

This introduces the PackageVulnerabilitiesCacheService which is modelled on TypoSquatting caching. I've moved the query logic into the cache service, using an arbitrary 30-minute caching length per query result.

@drewgillies drewgillies requested a review from a team as a code owner May 13, 2021 08:52
@joelverhagen
Copy link
Member

Could we do caching at a higher level and use the same caching method used elsewhere?
HttpContext.Cache
e.g.

private PackageDependents GetPackageDependents(string id)
{
PackageDependents dependents;
var cacheDependentsCacheKey = "CacheDependents_" + id.ToLowerInvariant();
var cacheDependents = HttpContext.Cache.Get(cacheDependentsCacheKey);
// Cache doesn't contain PackageDependents so PackageDependents gets put in the cache
if (cacheDependents == null)
{
dependents = _packageService.GetPackageDependents(id);
// note: this is a per instance cache
HttpContext.Cache.Add(
cacheDependentsCacheKey,
dependents,
null,
DateTime.UtcNow.AddSeconds(_contentObjectService.CacheConfiguration.PackageDependentsCacheTimeInSeconds),
Cache.NoSlidingExpiration,
CacheItemPriority.Default, null);
}
// Cache contains PackageDependents
else
{
dependents = (PackageDependents)cacheDependents;
// TODO Make cache time configurable / slidy
// https://github.com/NuGet/NuGetGallery/issues/4718
}
return dependents;
}

This has expiration logic built in. There are other example usages of this API.

@drewgillies
Copy link
Contributor Author

Could we do caching at a higher level and use the same caching method used elsewhere?
HttpContext.Cache

@joelverhagen is this preferable to the typosquatting approach (i.e. https://github.com/NuGet/NuGetGallery/blob/main/src/NuGetGallery/Services/TyposquattingCheckListCacheService.cs)? What are the merits of each?

@joelverhagen
Copy link
Member

HttpContext.Cache: automatically handles concurrency and cache invalidation time. Smaller code. Perhaps more consistent with other caching scenarios, but it's hard to discover all instances of the alternative.
TyposquattingCheckListCacheService: not dependent on ASP.NET/HTTP types. Probably easier to UT.

@joelverhagen
Copy link
Member

What do you think about those merits? Perhaps you can take a look at pros/cons as well? I think it's possible that HttpContext.Cache automatically evicts the cache in the event of memory pressure but I'm not 100% sure.

@zhhyu
Copy link
Contributor

zhhyu commented May 13, 2021

I think the use case of "Typosquatting" cache may be different compared with this. The "Typosquatting" cache only contains one entry, which is a list of package names and only used when a package is pushed. For vulnerabilities, it is a very similar scenario as "Used By" info. So I agree that maybe using the highly-optimized caching component provided by the framework is better.

@skofman1
Copy link
Contributor

Another thought: we have a perf problem with Vulnerabilities when it comes to the 'Manage Packages' page as well (#8361). This might help to fix both.
Btw @drewgillies , should we re-open #8361 ? It's currently disabled due to perf, right?

@drewgillies
Copy link
Contributor Author

@skofman1 yes, that's right--I closed because we needed an entirely new approach. I've reopened it just now so we can investigate this idea. I'd be (pleasantly) surprised if caching is going to get us over the line on that one, as my understanding is that any regression at all is unacceptable given how poor the page's perf is already. And introducing a new table (let alone 3) regresses perf whichever way we slice it.
@joelverhagen I like the notion of http caching on the strength of what you're saying--UTs will still cover the logic, and if I can make the change more elegantly in code, I'm up for that.

@skofman1
Copy link
Contributor

I made the comment before going over the code. My initial thinking was that we can cache in memory all the vulnerability information, and load it from DB on start-up. What do you think?

@drewgillies
Copy link
Contributor Author

That's an interesting idea, @skofman1. I'll look into it.

@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from 2ab3b17 to 28f6555 Compare May 17, 2021 01:16
@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch 4 times, most recently from 842628b to 2f59b6a Compare May 17, 2021 08:29
@drewgillies
Copy link
Contributor Author

Interested in thoughts on this approach @joelverhagen @skofman1 @zhhyu --based on @skofman's idea, I'm loading the cache on startup, and have chosen an arbitrary 1 day for cache invalidation. This approach could then be extended to the manage packages page if we wish to experiment. More testing required but there is UT coverage here and the cache loads nicely. I went for the complex linq statement to get the best perf out of EF, so commented heavily.

@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from 2f59b6a to 2e4e873 Compare May 17, 2021 08:35

private bool ShouldCachedValueBeUpdated(string id) => !vulnerabilitiesByIdCache.ContainsKey(id) ||
vulnerabilitiesByIdCache[id].cachedAt
.AddMinutes(CachingLimitMinutes) < DateTime.Now;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DateTimeOffset.UtcNow instead of DateTime.Now for clarity

= new Dictionary<string, (DateTime, Dictionary<int, IReadOnlyList<PackageVulnerability>>)>();

private readonly IPackageVulnerabilitiesManagementService _packageVulnerabilitiesManagementService;
public PackageVulnerabilitiesCacheService(IPackageVulnerabilitiesManagementService packageVulnerabilitiesManagementService)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether this dependency even makes much sense. The two new methods introduced on IPackageVulnerabilitiesManagementService are very small so it might be clearer to just depend on IEntitiesContext here.

Copy link
Contributor Author

@drewgillies drewgillies May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separation of responsibilities was initially related to DI scopes. PackageVulnerabilitiesManagementService is InstancePerLifetimeScope, and this is a SingleInstance (in order for caching to work), initially with the management service passed to caching service methods (rather than having it stored in a field). I expect I'll return to this model if I'm unable to run the Initialize method from the constructor, which in that sense I like better.

The reason we have a caching service class which is separate from PackageVulnerabilitiesService (rather than just adding caching capabilities to it) is the same --PackageVulnerabilitiesService is InstancePerLifetimeScope.

public PackageVulnerabilitiesCacheService(IPackageVulnerabilitiesManagementService packageVulnerabilitiesManagementService)
{
_packageVulnerabilitiesManagementService = packageVulnerabilitiesManagementService;
Initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do SQL in the constructor

}

return vulnerabilitiesByIdCache[id].vulnerabilitiesById.Any()
? new ReadOnlyDictionary<int, IReadOnlyList<PackageVulnerability>>(vulnerabilitiesByIdCache[id].vulnerabilitiesById)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can just return a Dictionary<TKey, TValue> which itself implements IReadOnlyDictionary<TKey, TValue>. No need to construct a ReadOnlyDictionary.

throw new ArgumentException("Must have a value.", nameof(id));
}

if (ShouldCachedValueBeUpdated(id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dictionary read could operate on the cache dictionary while it is being modified on line 49. Better to use a ConcurrentDictionary. See https://stackoverflow.com/a/24586924.

.SelectMany(x => x.Packages.Select(p => new {PackageKey = p.Key, x.Vulnerability}))
.GroupBy(pv => pv.PackageKey, pv => pv.Vulnerability)
.ToDictionary(pv => pv.Key,
pv => pv.ToList().AsReadOnly() as IReadOnlyList<PackageVulnerability>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need AsReadOnly. You can just cast to IReadOnlyList.

@drewgillies
Copy link
Contributor Author

drewgillies commented May 17, 2021

background refresh

@joelverhagen I love this idea. That's the missing piece. Thanks for all of your granular feedback above as well.

@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from 6ac49fa to 6f3cce4 Compare May 18, 2021 01:18
@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from f906559 to a4032d1 Compare May 18, 2021 04:42
@drewgillies
Copy link
Contributor Author

@joelverhagen @skofman1 @zhhyu I've implemented the background refresh on the cache, and I'll have it report load time telemetry like the downloads refresh does--it would be good to track this over time as the cache grows in size.

HostingEnvironment.QueueBackgroundWorkItem(_ => packageVulnerabilitiesCacheService.RefreshCache());
if (configuration.StorageType == StorageType.AzureStorage)
{
jobs.Add(new PackageVulnerabilitiesCacheRefreshJob(TimeSpan.FromMinutes(30),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30

what's the frequency of the GitHub job? Why 30 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was arbitrary - it could easily be a day and work quite well. I thought that if we were to extend this cache's use to the manage packages page, a customer would want to see more "live" updates, so 30 minutes seemed like the happy medium.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the GitHub job runs once a day to get updates, there's no point in having this cache updated every 30 mins. If the job runs frequently (say every 10 minutes), 30 minutes definetly makes sense.

@@ -91,6 +91,7 @@ public class Events
public const string ABTestEvaluated = "ABTestEvaluated";
public const string PackagePushDisconnect = "PackagePushDisconnect";
public const string SymbolPackagePushDisconnect = "SymbolPackagePushDisconnect";
public const string VulnerabilitiesCacheRefreshDuration = "VulnerabilitiesCacheRefreshDuration";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add units to the metric name for debuggability. i.e. VulnerabilitiesCacheRefreshDurationMs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing--I was modelling it off TrackDownloadJsonRefreshDuration - I can change that too if you like. Also, do you want the string value changed or just the name of the const?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to both. When you look at telemetry it will be super useful to know the units without going to the code. Regarding TrackDownloadJsonRefreshDuration , it will be nice to change that, assuming no one is depending on the event name for dashboards/monitoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! I might hang off on ...DownloadJson... then, but yes, will change mine.

private readonly ITelemetryService _telemetryService;

public PackageVulnerabilitiesCacheService(
IPackageVulnerabilitiesManagementService packageVulnerabilitiesManagementService,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packageVulnerabilitiesManagementService

if I understand correctly, the IPackageVulnerabilitiesManagementService will be created only once, which is not good because it has a dependency on EntityContext (DB connection), meaning we need a new one with every request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifecycle pieces of this are the real trick (and the reason the cache service exists at all)--I may have the Job pass the management service each time to get around this.

if (packageVulnerabilitiesCacheService != null)
{
// Perform initial refresh + schedule new refreshes every 30 minutes
HostingEnvironment.QueueBackgroundWorkItem(_ => packageVulnerabilitiesCacheService.RefreshCache());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know too much about what happens for this behind the scene. This seems running as a background thread, and the query may take some time to run. Will it have a race condition between the application and the query? For example, the application starts accepting requests while the query has not finished running. That may not mean customer impacts because we have functional/e2e tests but if we have some tests for vulnerabilities, the test may fail because the result has not been loaded yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look in the RefreshCache you'll see that this is catered for in the _isRefreshing sync'd field. The first refresh is run in AppActivator so we should have data available (and I don't believe any E2E tests cover vulnerabilities data).


stopwatch.Stop();

_telemetryService.TrackVulnerabilitiesCacheRefreshDuration(stopwatch.ElapsedMilliseconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides logging the duration, I am thinking whether it will be better if we also log the status of the query, for example, possible exceptions thrown from the DB query (Timeout, etc..).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll trace exceptions in the query code. Good idea.


// We need to build a dictionary of dictionaries. Breaking it down:
// - this give us a list of all vulnerable package version ranges
_vulnerabilitiesByIdCache = _packageVulnerabilitiesManagementService.GetAllVulnerableRanges()
Copy link
Contributor

@zhhyu zhhyu May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a heavy query for DB. I wonder it may cause glitches for DB performance. We may need some stress tests to understand the behavior.

Besides, my understanding is that each time we reload everything from DB again. I am not sure whether we can only load the difference (for example, depend on the timestamp somehow) because it sounds like most of existing vulnerabilities should not need to be updated with a high frequency, but the effort may not be worth if the performance is ok.

Copy link
Contributor Author

@drewgillies drewgillies May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to track telemetry on this and see how it performs generally, before introducing performance tweaks. It's only heavy in post-query transformation--the query itself loads records from the VulnerablePackageVersionRange table and related PackageVulnerabilities records--this should be relatively lightweight. But we should watch telemetry. If it turns out we need to find ways to speed up load, we could start looking at timestamping vulnerable range records, but of course, the table and all of the records would need to be loaded in order to test for changes (but it would reduce PackageVulnerabilities reads)--I think the wins would be small there. Let's investigate down the road if we need to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option for reducing DTUs is, of course, increasing the refresh interval from 30 minutes to something far longer.

{
private IDictionary<string,
Dictionary<int, IReadOnlyList<PackageVulnerability>>> _vulnerabilitiesByIdCache
= new ConcurrentDictionary<string, Dictionary<int, IReadOnlyList<PackageVulnerability>>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussions: I can see that the only write action now is to update the reference to another new dictionary when we refresh the cache. Do we still need to use "ConcurrenctDictionary"? We don't update the granularity such as some entries in the dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call--I'm reverting this to a Dictionary.

@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from 83f0372 to 7064b80 Compare May 19, 2021 06:39
@drewgillies
Copy link
Contributor Author

@skofman1 I've worked with @joelverhagen and @agr to land on a more elegant solution here--we're caching a scope factory in the PackageVulnerabilitiesCacheService which will spin up an entirely new scope every time we refresh, and generate a new EntitiesContext instance each time we do. I believe I've addressed all feedback now (including @zhhyu's). Interested in how close you feel we are to an acceptable design. I'll be running some more tests on it before merging.

var packageVulnerabilitiesCacheService =
DependencyResolver.Current.GetService<IPackageVulnerabilitiesCacheService>() as
PackageVulnerabilitiesCacheService;
if (packageVulnerabilitiesCacheService != null)
Copy link
Contributor

@skofman1 skofman1 May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packageVulnerabilitiesCacheService

why would packageVulnerabilitiesCacheService ever be null? If it's null it means we have a bug and forgot to register it. In that case, it would be better to fail fast then to get into a state where the cache isn't refreshed. I suggest to drop the null check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make the necessary methods just available on the interface? I don't see why any cast is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep--good call. The same thing is being done for cloudDownloadCountService above, but it is redundant. I'll remove that.


public PackageVulnerabilitiesCacheService(ITelemetryService telemetryService)
{
_telemetryService = telemetryService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetryService

null check

Copy link
Contributor

@skofman1 skofman1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Otherwise looks good :)

/// Track how long it takes to populate the vulnerabilities cache
/// </summary>
/// <param name="milliseconds">Refresh duration for vulnerabilities cache</param>
void TrackVulnerabilitiesCacheRefreshDurationMs(long milliseconds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pass TimeSpan here, don't mention "ms". The implementation below can make a "ms" specific metric emitted to AI but the interface can be cleaner than that. It's not a big deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so confused :) #8580 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeSpan makes sense but is inconsistent with other telemetry methods. I'd like us to retrofit code if we pivot on practice, so other devs can follow the lead of what's there more efficiently. I'm happy to do this here but it may be good to consider this going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking further, TimeSpan here would mean removing Ms from the method name, which would conflict with @skofman1's comment as well. I think I'll leave it as-is for now? When we arrive at consensus on how we want our telemetry to function for duration tracking, perhaps we can go through and change them all as a separate change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick visual scan: TrackDownloadJsonRefreshDuration is the only outlier. Perhaps I could change both of them in this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I would still leave Ms off the method name as it's no longer ms, and would make them both use TimeSpan)

@@ -265,12 +265,29 @@ private static void BackgroundJobsPostStart(IAppConfiguration configuration)

if (configuration.StorageType == StorageType.AzureStorage)
{
var cloudDownloadCountService = DependencyResolver.Current.GetService<IDownloadCountService>() as CloudDownloadCountService;
var cloudDownloadCountService =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please minimize changes to unrelated lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure of our policy on cleanup. Thanks.

// Perform initial refresh + schedule new refreshes every 30 minutes
var serviceScopeFactory = DependencyResolver.Current.GetService<IServiceScopeFactory>();
HostingEnvironment.QueueBackgroundWorkItem(_ => packageVulnerabilitiesCacheService.RefreshCache(serviceScopeFactory));
if (configuration.StorageType == StorageType.AzureStorage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this predicated on AzureStorage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem to make sense to have this running on local environments. If you think it should be all cases, I'm happy to remove it.

@@ -460,6 +460,11 @@ protected override void Load(ContainerBuilder builder)
.As<IPackageVulnerabilitiesManagementService>()
.InstancePerLifetimeScope();

builder.Register(c => new PackageVulnerabilitiesCacheService(c.Resolve<ITelemetryService>()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just registry the type, i.e. RegistryType. You don't need explicit ctor here right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm sure there was a reason I thought this was appropriate when it was more complex, but on reflection even then I was mistaken. Will clean up.

@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from 4452155 to 4c4f43d Compare May 20, 2021 00:21
public PackageVulnerabilitiesCacheRefreshJob(TimeSpan interval,
PackageVulnerabilitiesCacheService packageVulnerabilitiesCacheService,
IServiceScopeFactory serviceScopeFactory)
: base("", interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use explicit parameter name for ambiguous values like this empty string


public override Task Execute()
{
return new Task(() => _packageVulnerabilitiesCacheService.RefreshCache(_serviceScopeFactory));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a new task? Who calls Execute? Is it already in a background task? If so, you can just synchronously execute RefreshCache and return Task.CompletedTask.

@drewgillies drewgillies force-pushed the dg-pvacache-packagedetailspage branch from 4c4f43d to 5a92c9f Compare May 20, 2021 01:34
@drewgillies drewgillies merged commit 5c5e8f9 into dev May 20, 2021
@drewgillies drewgillies deleted the dg-pvacache-packagedetailspage branch May 20, 2021 11:23
@zhhyu zhhyu mentioned this pull request May 24, 2021
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants