-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching to package details page vulnerabilities query #8580
Conversation
Could we do caching at a higher level and use the same caching method used elsewhere? NuGetGallery/src/NuGetGallery/Controllers/PackagesController.cs Lines 1025 to 1055 in f248171
This has expiration logic built in. There are other example usages of this API. |
@joelverhagen is this preferable to the typosquatting approach (i.e. https://github.com/NuGet/NuGetGallery/blob/main/src/NuGetGallery/Services/TyposquattingCheckListCacheService.cs)? What are the merits of each? |
|
What do you think about those merits? Perhaps you can take a look at pros/cons as well? I think it's possible that |
I think the use case of "Typosquatting" cache may be different compared with this. The "Typosquatting" cache only contains one entry, which is a list of package names and only used when a package is pushed. For vulnerabilities, it is a very similar scenario as "Used By" info. So I agree that maybe using the highly-optimized caching component provided by the framework is better. |
Another thought: we have a perf problem with Vulnerabilities when it comes to the 'Manage Packages' page as well (#8361). This might help to fix both. |
@skofman1 yes, that's right--I closed because we needed an entirely new approach. I've reopened it just now so we can investigate this idea. I'd be (pleasantly) surprised if caching is going to get us over the line on that one, as my understanding is that any regression at all is unacceptable given how poor the page's perf is already. And introducing a new table (let alone 3) regresses perf whichever way we slice it. |
I made the comment before going over the code. My initial thinking was that we can cache in memory all the vulnerability information, and load it from DB on start-up. What do you think? |
That's an interesting idea, @skofman1. I'll look into it. |
2ab3b17
to
28f6555
Compare
842628b
to
2f59b6a
Compare
Interested in thoughts on this approach @joelverhagen @skofman1 @zhhyu --based on @skofman's idea, I'm loading the cache on startup, and have chosen an arbitrary 1 day for cache invalidation. This approach could then be extended to the manage packages page if we wish to experiment. More testing required but there is UT coverage here and the cache loads nicely. I went for the complex linq statement to get the best perf out of EF, so commented heavily. |
2f59b6a
to
2e4e873
Compare
|
||
private bool ShouldCachedValueBeUpdated(string id) => !vulnerabilitiesByIdCache.ContainsKey(id) || | ||
vulnerabilitiesByIdCache[id].cachedAt | ||
.AddMinutes(CachingLimitMinutes) < DateTime.Now; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DateTimeOffset.UtcNow
instead of DateTime.Now
for clarity
= new Dictionary<string, (DateTime, Dictionary<int, IReadOnlyList<PackageVulnerability>>)>(); | ||
|
||
private readonly IPackageVulnerabilitiesManagementService _packageVulnerabilitiesManagementService; | ||
public PackageVulnerabilitiesCacheService(IPackageVulnerabilitiesManagementService packageVulnerabilitiesManagementService) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether this dependency even makes much sense. The two new methods introduced on IPackageVulnerabilitiesManagementService
are very small so it might be clearer to just depend on IEntitiesContext
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separation of responsibilities was initially related to DI scopes. PackageVulnerabilitiesManagementService
is InstancePerLifetimeScope
, and this is a SingleInstance
(in order for caching to work), initially with the management service passed to caching service methods (rather than having it stored in a field). I expect I'll return to this model if I'm unable to run the Initialize
method from the constructor, which in that sense I like better.
The reason we have a caching service class which is separate from PackageVulnerabilitiesService
(rather than just adding caching capabilities to it) is the same --PackageVulnerabilitiesService
is InstancePerLifetimeScope
.
public PackageVulnerabilitiesCacheService(IPackageVulnerabilitiesManagementService packageVulnerabilitiesManagementService) | ||
{ | ||
_packageVulnerabilitiesManagementService = packageVulnerabilitiesManagementService; | ||
Initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not do SQL in the constructor
} | ||
|
||
return vulnerabilitiesByIdCache[id].vulnerabilitiesById.Any() | ||
? new ReadOnlyDictionary<int, IReadOnlyList<PackageVulnerability>>(vulnerabilitiesByIdCache[id].vulnerabilitiesById) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can just return a Dictionary<TKey, TValue>
which itself implements IReadOnlyDictionary<TKey, TValue>
. No need to construct a ReadOnlyDictionary
.
src/NuGetGallery/Services/PackageVulnerabilitiesCacheService.cs
Outdated
Show resolved
Hide resolved
throw new ArgumentException("Must have a value.", nameof(id)); | ||
} | ||
|
||
if (ShouldCachedValueBeUpdated(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dictionary read could operate on the cache dictionary while it is being modified on line 49. Better to use a ConcurrentDictionary
. See https://stackoverflow.com/a/24586924.
.SelectMany(x => x.Packages.Select(p => new {PackageKey = p.Key, x.Vulnerability})) | ||
.GroupBy(pv => pv.PackageKey, pv => pv.Vulnerability) | ||
.ToDictionary(pv => pv.Key, | ||
pv => pv.ToList().AsReadOnly() as IReadOnlyList<PackageVulnerability>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need AsReadOnly
. You can just cast to IReadOnlyList
.
@joelverhagen I love this idea. That's the missing piece. Thanks for all of your granular feedback above as well. |
6ac49fa
to
6f3cce4
Compare
f906559
to
a4032d1
Compare
@joelverhagen @skofman1 @zhhyu I've implemented the background refresh on the cache, and I'll have it report load time telemetry like the downloads refresh does--it would be good to track this over time as the cache grows in size. |
HostingEnvironment.QueueBackgroundWorkItem(_ => packageVulnerabilitiesCacheService.RefreshCache()); | ||
if (configuration.StorageType == StorageType.AzureStorage) | ||
{ | ||
jobs.Add(new PackageVulnerabilitiesCacheRefreshJob(TimeSpan.FromMinutes(30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was arbitrary - it could easily be a day and work quite well. I thought that if we were to extend this cache's use to the manage packages page, a customer would want to see more "live" updates, so 30 minutes seemed like the happy medium.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the GitHub job runs once a day to get updates, there's no point in having this cache updated every 30 mins. If the job runs frequently (say every 10 minutes), 30 minutes definetly makes sense.
@@ -91,6 +91,7 @@ public class Events | |||
public const string ABTestEvaluated = "ABTestEvaluated"; | |||
public const string PackagePushDisconnect = "PackagePushDisconnect"; | |||
public const string SymbolPackagePushDisconnect = "SymbolPackagePushDisconnect"; | |||
public const string VulnerabilitiesCacheRefreshDuration = "VulnerabilitiesCacheRefreshDuration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add units to the metric name for debuggability. i.e. VulnerabilitiesCacheRefreshDurationMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing--I was modelling it off TrackDownloadJsonRefreshDuration
- I can change that too if you like. Also, do you want the string value changed or just the name of the const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to both. When you look at telemetry it will be super useful to know the units without going to the code. Regarding TrackDownloadJsonRefreshDuration , it will be nice to change that, assuming no one is depending on the event name for dashboards/monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do! I might hang off on ...DownloadJson... then, but yes, will change mine.
private readonly ITelemetryService _telemetryService; | ||
|
||
public PackageVulnerabilitiesCacheService( | ||
IPackageVulnerabilitiesManagementService packageVulnerabilitiesManagementService, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lifecycle pieces of this are the real trick (and the reason the cache service exists at all)--I may have the Job pass the management service each time to get around this.
if (packageVulnerabilitiesCacheService != null) | ||
{ | ||
// Perform initial refresh + schedule new refreshes every 30 minutes | ||
HostingEnvironment.QueueBackgroundWorkItem(_ => packageVulnerabilitiesCacheService.RefreshCache()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know too much about what happens for this behind the scene. This seems running as a background thread, and the query may take some time to run. Will it have a race condition between the application and the query? For example, the application starts accepting requests while the query has not finished running. That may not mean customer impacts because we have functional/e2e tests but if we have some tests for vulnerabilities, the test may fail because the result has not been loaded yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look in the RefreshCache
you'll see that this is catered for in the _isRefreshing
sync'd field. The first refresh is run in AppActivator
so we should have data available (and I don't believe any E2E tests cover vulnerabilities data).
|
||
stopwatch.Stop(); | ||
|
||
_telemetryService.TrackVulnerabilitiesCacheRefreshDuration(stopwatch.ElapsedMilliseconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides logging the duration, I am thinking whether it will be better if we also log the status of the query, for example, possible exceptions thrown from the DB query (Timeout, etc..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll trace exceptions in the query code. Good idea.
|
||
// We need to build a dictionary of dictionaries. Breaking it down: | ||
// - this give us a list of all vulnerable package version ranges | ||
_vulnerabilitiesByIdCache = _packageVulnerabilitiesManagementService.GetAllVulnerableRanges() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a heavy query for DB. I wonder it may cause glitches for DB performance. We may need some stress tests to understand the behavior.
Besides, my understanding is that each time we reload everything from DB again. I am not sure whether we can only load the difference (for example, depend on the timestamp somehow) because it sounds like most of existing vulnerabilities should not need to be updated with a high frequency, but the effort may not be worth if the performance is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to track telemetry on this and see how it performs generally, before introducing performance tweaks. It's only heavy in post-query transformation--the query itself loads records from the VulnerablePackageVersionRange
table and related PackageVulnerabilities
records--this should be relatively lightweight. But we should watch telemetry. If it turns out we need to find ways to speed up load, we could start looking at timestamping vulnerable range records, but of course, the table and all of the records would need to be loaded in order to test for changes (but it would reduce PackageVulnerabilities
reads)--I think the wins would be small there. Let's investigate down the road if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option for reducing DTUs is, of course, increasing the refresh interval from 30 minutes to something far longer.
{ | ||
private IDictionary<string, | ||
Dictionary<int, IReadOnlyList<PackageVulnerability>>> _vulnerabilitiesByIdCache | ||
= new ConcurrentDictionary<string, Dictionary<int, IReadOnlyList<PackageVulnerability>>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For discussions: I can see that the only write action now is to update the reference to another new dictionary when we refresh the cache. Do we still need to use "ConcurrenctDictionary"? We don't update the granularity such as some entries in the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call--I'm reverting this to a Dictionary
.
83f0372
to
7064b80
Compare
@skofman1 I've worked with @joelverhagen and @agr to land on a more elegant solution here--we're caching a scope factory in the |
var packageVulnerabilitiesCacheService = | ||
DependencyResolver.Current.GetService<IPackageVulnerabilitiesCacheService>() as | ||
PackageVulnerabilitiesCacheService; | ||
if (packageVulnerabilitiesCacheService != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make the necessary methods just available on the interface? I don't see why any cast is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep--good call. The same thing is being done for cloudDownloadCountService
above, but it is redundant. I'll remove that.
|
||
public PackageVulnerabilitiesCacheService(ITelemetryService telemetryService) | ||
{ | ||
_telemetryService = telemetryService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Otherwise looks good :)
/// Track how long it takes to populate the vulnerabilities cache | ||
/// </summary> | ||
/// <param name="milliseconds">Refresh duration for vulnerabilities cache</param> | ||
void TrackVulnerabilitiesCacheRefreshDurationMs(long milliseconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pass TimeSpan
here, don't mention "ms". The implementation below can make a "ms" specific metric emitted to AI but the interface can be cleaner than that. It's not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so confused :) #8580 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimeSpan
makes sense but is inconsistent with other telemetry methods. I'd like us to retrofit code if we pivot on practice, so other devs can follow the lead of what's there more efficiently. I'm happy to do this here but it may be good to consider this going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking further, TimeSpan
here would mean removing Ms
from the method name, which would conflict with @skofman1's comment as well. I think I'll leave it as-is for now? When we arrive at consensus on how we want our telemetry to function for duration tracking, perhaps we can go through and change them all as a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a quick visual scan: TrackDownloadJsonRefreshDuration
is the only outlier. Perhaps I could change both of them in this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I would still leave Ms
off the method name as it's no longer ms, and would make them both use TimeSpan
)
@@ -265,12 +265,29 @@ private static void BackgroundJobsPostStart(IAppConfiguration configuration) | |||
|
|||
if (configuration.StorageType == StorageType.AzureStorage) | |||
{ | |||
var cloudDownloadCountService = DependencyResolver.Current.GetService<IDownloadCountService>() as CloudDownloadCountService; | |||
var cloudDownloadCountService = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please minimize changes to unrelated lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure of our policy on cleanup. Thanks.
// Perform initial refresh + schedule new refreshes every 30 minutes | ||
var serviceScopeFactory = DependencyResolver.Current.GetService<IServiceScopeFactory>(); | ||
HostingEnvironment.QueueBackgroundWorkItem(_ => packageVulnerabilitiesCacheService.RefreshCache(serviceScopeFactory)); | ||
if (configuration.StorageType == StorageType.AzureStorage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this predicated on AzureStorage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't seem to make sense to have this running on local environments. If you think it should be all cases, I'm happy to remove it.
@@ -460,6 +460,11 @@ protected override void Load(ContainerBuilder builder) | |||
.As<IPackageVulnerabilitiesManagementService>() | |||
.InstancePerLifetimeScope(); | |||
|
|||
builder.Register(c => new PackageVulnerabilitiesCacheService(c.Resolve<ITelemetryService>())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just registry the type, i.e. RegistryType
. You don't need explicit ctor here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm sure there was a reason I thought this was appropriate when it was more complex, but on reflection even then I was mistaken. Will clean up.
4452155
to
4c4f43d
Compare
public PackageVulnerabilitiesCacheRefreshJob(TimeSpan interval, | ||
PackageVulnerabilitiesCacheService packageVulnerabilitiesCacheService, | ||
IServiceScopeFactory serviceScopeFactory) | ||
: base("", interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use explicit parameter name for ambiguous values like this empty string
|
||
public override Task Execute() | ||
{ | ||
return new Task(() => _packageVulnerabilitiesCacheService.RefreshCache(_serviceScopeFactory)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a new task? Who calls Execute
? Is it already in a background task? If so, you can just synchronously execute RefreshCache
and return Task.CompletedTask
.
4c4f43d
to
5a92c9f
Compare
Addresses: #8543
Note that this change is against
dev
and will need to be rebased tomain
if we wish to deploy it as a hotfix.This introduces the
PackageVulnerabilitiesCacheService
which is modelled on TypoSquatting caching. I've moved the query logic into the cache service, using an arbitrary 30-minute caching length per query result.