-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfill repo metadata tool #6288
Conversation
|
||
<add key="Gallery.SqlServer" value="Data Source=(localdb)\mssqllocaldb; Initial Catalog=NuGetGallery; Integrated Security=True; MultipleActiveResultSets=True"/> | ||
<add key="Gallery.SupportRequestSqlServer" value="Data Source=(localdb)\mssqllocaldb; Initial Catalog=NuGetGallery; Integrated Security=True; MultipleActiveResultSets=True"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Gallery.SupportRequestSqlServer
added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultDependenciesModule can't be created without it.
Spoke offline about this, but we should have a README.md that is next to the csproj that describes how this command works. Granted, the other commands don't have docs but we can improve this as we go on. |
CommandOption fileName = config.Option("-f | --file", "File to use", CommandOptionType.SingleValue); | ||
|
||
config.HelpOption("-? | -h | --help"); | ||
config.OnExecute(() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OnExecute
can accept a Func<Task<int>>
. This is means the whole method can be async
and you can avoid .GetAwaiter().GetResult()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
{ | ||
Log.LogMessage("Initializing"); | ||
|
||
_context = new EntitiesContext(connectionString, readOnly: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, readonly. nice.
|
||
Log.LogMessage($"Start time: {startTime.ToString("u")}"); | ||
|
||
var packagesRepository = new EntityRepository<Package>(_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have DI above, why no use it? Makes this more flexible for the future (future backfills...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain?
|
||
var packagesRepository = new EntityRepository<Package>(_context); | ||
|
||
var allPackages = packagesRepository.GetAll().Where(p => p.Created < lastCreateTime && p.Created > startTime && p.PackageStatusKey == PackageStatus.Available); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can miss validating packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or less importantly failed validation packages that we reflow?
Paging over Edited
will mean you never miss a package.
Also, consider adding a 5-30 minute subtract when first starting due to clock skew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LastEdited is null for ~1M packages. Will stay with Created. Added support for validating packages. Don't see how clock skew is a factor if we compare always to Created that's DB generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those can come in out of order. There is no guarantee that a sequence of Created timestamps are committed in the same order that they are generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can order by LastEdited, PackageKey then. But it's fine if you include validating.
|
||
private static async Task<RepositoryMetadata> GetRepositoryMetadata(string id, string normalizedVersion) | ||
{ | ||
string nuspecUri = $"{_flatContainerUri}{id.ToLowerInvariant()}/{normalizedVersion.ToLowerInvariant()}/{id.ToLowerInvariant()}.nuspec"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: trim trailing '/' and include '/' in the string format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_flatContainerUri
might have trailing /
-- or might not. It's an easy config mistake to make.
just a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
private static string GetFlatContainerUri(Uri serviceDiscoveryUri) | ||
{ | ||
var serviceDiscoveryClient = new ServiceDiscoveryClient(serviceDiscoveryUri); | ||
return serviceDiscoveryClient.GetEndpointsForResourceType("PackageBaseAddress/3.0.0").GetAwaiter().GetResult().First().AbsoluteUri; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be async
methods. .GetAwaiter().GetResult()
should only be in one place per app. (In most cases)
|
||
private static async Task WriteMetadata(DateTime creationDate, string packageId, string packageVersion, RepositoryMetadata repositoryMetadata) | ||
{ | ||
await _metadataFileStreamWriter.WriteLineAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider one JSON object per line to avoid escaping problems. Or use a CSV library. Repository metadata could have a comma (type in particular).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using CsvHelper
} | ||
|
||
private static async Task CommitBatch(DateTime cursorTime) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might run into memory problems since the EF context will hold onto all of the old entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can recreate the EF context each batch to mitigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is indeed an issue for the "collect" part. However, introducing batching there is too much work. Will leave as is. Thanks for the cursor the job can continue after crash.
|
||
CommandOption lastCreateTimeOption = config.Option("-l | --lastcreatetime", "The latest creation time of package we should check", CommandOptionType.SingleValue); | ||
CommandOption collectData = config.Option("-c | --collect", "Collect Repository metadata and save in file", CommandOptionType.NoValue); | ||
CommandOption updateDB = config.Option("-u | --update", "Update DB with Repository metadata", CommandOptionType.NoValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a less aggressive starting point for how quickly updateDB
updates batches. This could be done with batchSize
and sleepSeconds
parameters. Then you can do some batches slowly on PROD before kicking it into high gear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the same batching config as I did for "hash" Gallery tool. It saved a 100 and had no sleep. There was not DB impact.
counter = 0; | ||
} | ||
|
||
metadata = await TryReadNextMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do ... while
should be able to eliminate the duplicated metadata = await TryReadNextMetadata();
line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will leave as is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⌚️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This tool collect repository metadata for all packages in the DB from nuspec files in V3 flat container and updates DB with this data.
/// Usage:
/// 1. To collect repository metadata:
/// a. Configure app.config with DB information and service index url
/// b. Run this tool with: GalleryTools.exe -c
/// This will create a file repositoryMetadata.txt with all collected data. You can stop the job anytime and restart. cursor.txt contains current position.
///
/// 2. To update DB:
/// a. Run GalleryTools.exe -u
/// This will update DB from file repositoryMetadata.txt. You can stop the job anytime and restart.