-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy queries cause excessive network requests. #11934
Comments
I think this is personally reasonable. I was surprised when I saw what lengths it went through. |
Here's a list of names on crates.io that use a mix of dash and underscore. There are 623 entries. That's not as many as I feared. Click to expand list of 623 entries
|
I did some more analysis on the names on crates.io. The currently limit of 1024 is definitely excessive, as nothing uses more than 8 dashes/underscores. We'd need to set the limit to at least 16 in order to have fewer misses than the suggested algorithm of (original, underscores, dashes).
|
Stop using UncanonicalizedIter for QueryKind::Exact `QueryKind::Exact` causes unnecessary HTTP requests when querying for crates that don't exist. Even though the query is `Exact`, when fetching `Summaries`, Cargo still uses `UncanonicalizedIter` and requests additional possible crate names if the first one isn't found. This change moves the use of `UncanonicalizedIter` further up the call stack such that we have the `QueryKind` available, and only do the additional queries for `QueryKind::Fuzzy`. The impact is most noticeable when publishing a new crate that contains `-` or `_`. Since Cargo waits for publish by querying the registry, if the crate name is `my-example-test-crate`, Cargo currently makes 8 HTTP requests each second while waiting for the crate to be available. With this fix, Cargo makes only 1 request per second. cc #11934
What about doing the requests in parallel? This is also how If one wants to do "escalation" then one can first make a request for all names with 1 bit flipped, then wait for the answer, then if nothing is found make another request for all names with 2 bits flipped, etc, with increasing numbers of bits flipped. This is the binomial coefficients game so it decreases again after you reach the half, so maybe at that point one can make one last request with all the bit flippings exceeding n/2. |
Cargo does issue the requests in parallel. The motivation for investigating this was the high number of 404s that infra was seeing in logs. |
@arlosi I see. Is there some public discussion about the 404 issue? I suppose if only 623 entries are affected then it's okay to cut some corners and emit suggestion-free diagnostics in those cases, in all other cases the - and _ only queries will show up the right result. Ideally one would have used the migration to http indices to migrate to using the canonicalized name (as suggested in 2020), but I guess now it's too late for that. |
Yes, on the crates-io zulip. |
Sparse registries have the same format as a checkout of the git index. You can transition by checking out the git index and running a static file server in the resulting folder. That transition path was important enough to me to justify not fixing known problems with the index format. Feel free to blame me for that decision, especially as the decision was made knowingly. Another decision that complicates things is that alternative registries are allowed to have multiple packages whose names only differ by Thinking about it more, the currently stable implementation tries all kinds of files and returns the first one it finds. (All kinds of files include ridiculously incorrect things like Assuming all of this is correct, I suggest that:
|
do not try an exponential number of package names re #11934, and as discussed in the cargo team meeting, this changes the strategy to "the original, all underscore, and all dashes". I was excessively proud of the `hyphen_combination_num` based implementation when I came up with it. But it's always been a hack. I'm glad to be the one to remove it.
When cargo can't find a crate name in a registry, it will also query every permutation of
-
and_
replacements (up to 1024 times) to see if there is a different match. When this code was written, it was assumed it would just be a simple query to git, but now that the sparse index does a network round trip for each one, it probably isn't a good assumption that doing 1024 is OK.I think at a minimum it should lower the limit significantly.
Another option is to only do up to 3 queries, the original, all underscore, and all dashes.
The text was updated successfully, but these errors were encountered: