Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network based dependency resolution #5597

Closed
ishitatsuyuki opened this issue May 31, 2018 · 23 comments
Closed

Network based dependency resolution #5597

ishitatsuyuki opened this issue May 31, 2018 · 23 comments
Labels
A-dependency-resolution Area: dependency resolution and the resolver

Comments

@ishitatsuyuki
Copy link
Contributor

This is a proposal to move to a API call based dependency resolution, rather than the Git based index clone. This is similar to what NPM uses.

Mainly, git based index clone has scalability issues:

  • The initial clone is a burden. Even after squashing it still takes a few megabytes, which is poor UX on throttled mobile network. A NPM-like solution consumes much acceptable traffic.
  • It doesn't scale. If we had as much packages as NPM, the index will be too large to download.

Then, here are some random notes:

  • Resolution caching should be done at HTTP level. This approach is good because it is also honoured by the CDN (if we eventually use one).
  • When working offline we can rely on crates we cached locally to resolve dependencies.
  • To replace the functionality of the crates.io-index repository, we can have daily database dumps, and offer an API and/or UI to display the latest updated crates.
@est31
Copy link
Member

est31 commented Jun 8, 2018

The initial clone is a burden. Even after squashing it still takes a few megabytes
It doesn't scale. If we had as much packages as NPM, the index will be too large to download.

I don't think we'll ever get as many packages as NPM. Systems programming is a small field compared to the giant field that npm targets (basically everything). I also think that with a smart (custom) binary format, we can stay in the range of megabytes for tens of millions of packages/package versions. These days when I visit one website like gitter or crates.io I'm downloading as much data already.

As a further optimizarion suggestion: Right now every .crate file you download contains the entire tests/ directory, sometimes with giant test assets. One of the great violators are rls-analysis where each .crate file takes around 1 mb.
For crater, this directory is interesting so it should not completely be left out when uploading, but it should not be downloaded when you only want to compile the crate as a dependency.

@matklad
Copy link
Member

matklad commented Jun 8, 2018

As a further optimizarion suggestion: Right now every .crate file you download contains the entire tests/ directory, sometimes with giant test assets. One of the great violators are rls-analysis where each .crate file takes around 1 mb.

There's exclude key in the manifest for it. @est31 might be a good idea to send PR to rls-analysis!

@ishitatsuyuki
Copy link
Contributor Author

@matklad I think I've already sent one: rust-dev-tools/rls-analysis#132

@est31
Copy link
Member

est31 commented Jun 8, 2018

Oh right. I was looking at rls-analysis-0.9.3.crate which is 932K large but rls-analysis-0.13.0.crate is only 20K large. Thanks @ishitatsuyuki !

Anyway, the problem exists with all crates that have test data, not just rls-analysis. My point was only using rls-analysis as an example :p.

@ishitatsuyuki
Copy link
Contributor Author

ishitatsuyuki commented Jun 8, 2018

@est31

I don't think we'll ever get as many packages as NPM. Systems programming is a small field compared to the giant field that npm targets (basically everything).

This sounds like an underestimation. It's true that we're different from Node, but Rust is just as flexible as Node, and it will be continuously improved. For example, once the futures ecosystem improves, I'd expect more web related crates.

I also think that with a smart (custom) binary format, we can stay in the range of megabytes for tens of millions of packages/package versions.

I experimented with a custom SQLite based format, and the result was that it didn't improve much over the squashed Git/JSON based index. Plus, you need to handle delta transfer somehow. Also tens of millions of entries are definitely unlikely to be compressed to an acceptable file size. tldr, it doesn't scale.

For crater, this directory is interesting so it should not completely be left out when uploading,

So far, I don't know any usecases apart from crater, thus changing the semantics of a published crate is questionable.

@matklad
Copy link
Member

matklad commented Jun 10, 2018

I've chatted recently with @kornelski, and he claims that clones from GitHub are actually quite slow, compared to just hosting repository yourself. So an interesting middle-ground, before embarking on the implementation of a custom binary protocol and supporting server, is to try to host a mirror of the index outside of GitHub, and see if that would be noticeably faster.

@kornelski
Copy link
Contributor

kornelski commented Jun 10, 2018

Part of it is that GitHub discourages use of shallow clones (e.g. brew had to move from shallow to full clones). A shallow clone would go a long way to solving the pain of initial download.

Another option that I think would be interesting is to use a non-git protocol for updates, e.g. http://zsync.moria.org.uk/

@dralley
Copy link

dralley commented Jun 17, 2018

I work on an open source project named Pulp, which is a repository management solution similar to Artifactory. The git repository based index presents a lot of problems for us at a fundamental level.

The general idea behind repository management utilities is that they give an admin the ability to do things like:

  • create multiple repositories (e.g.development / staging / production)
  • upload your own packages (e.g. stuff you don't want to publish on crates.io)
  • sync in packages from external repositories, like crates.io, npm, PyPI, etc. as a bulk operation
    • using a whitelist of certain specific packages
    • using a blacklist of certain banned packages (like typo-squatters)
    • using multiple external repositories, to combine them together
  • be able to make a series of changes to the repository without immediately making them available to clients
    • have control over exactly when the "live" repository state changes
  • maintain a history of repository state, and be able to roll back to a known good version if necessary
  • be able to "promote" the state of the "development" repository to "staging", and "staging" to "production" and so forth
  • have one solution for managing multiple different content types, instead of using crane + cargo + gem + docker + ...
    • manage testing of a diverse "production" environment all at the same time.
    • have assurances that nothing about the production environment will change unless you want it to change

Cargo provides similar functionality in many cases (e.g. "yank"), which makes a repository management solution a little less useful than with, say, Debian or RPM packages, but there is still a potential value-add.

The problem is that by making the repository index a git repository, the general workflow of many of these items becomes difficult or impossible. For instance:

  • managing a mirror which you're adding your own packages to becomes more difficult
  • rolling back the repository becomes difficult / impossible because the git history would diverge from what the clients are expecting.
  • promoting repository state between repositories becomes difficult / impossible for the same reason

I understand that not every use case listed here is necessarily something you would want to do in a language where the tooling already provides a lot of this functionality, but as I mentioned, there's still a value add, and very very few other types of content are restricted in this same way.

I don't see any fundamental reasons why it's important for the index to be a git repository. With that being the case, I'm not really sure how to write a Cargo plugin for Pulp which would not be essentially restricted to mirroring the crates.io API and providing almost no additional functionality, and not integrating with any of the core features. The idea has "here be dragons" written all over it, IMO.

edit: I have no specific opinion on the proposal, just echoing an objection to the current git-based solution.

@acmcarther
Copy link
Contributor

acmcarther commented Jun 18, 2018

I don't have major meaningful design input except to say that this has wider implications than just making dependency resolution better in some ways. Due to the thread-bare nature of rust's stdlib, being able to resolve and pull down crates is essential for any would be build tool of Rust. Burying dependency resolution into a backend would add more hoops than already currently exist in front of folks who want to build Rust with something that isn't Cargo.

I don't expect this to be a major consideration in either direction -- I just want to put the idea into everyone's minds that some poor fools (like me) might be trying to do crate resolution in weird ways and having that be possible 100% locally is very nice for our usecases.

EDIT/Amended: Though, conversely, this might open a door to decoupling some implementations from the cargo binary, which would be interesting.

@jonhoo
Copy link
Contributor

jonhoo commented Oct 22, 2018

While I quite like the idea of moving to something that doesn't require you to down load the entire index in general, it's worth pointing out that it'll make things like docs.rs much harder. docs.rs currently just keeps track of the latest index commit it built, which makes it super easy to keep track of what has or hasn't been built. /cc @onur @QuietMisdreavus

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 22, 2018

@dralley technically cargo does a hard checkout of master. The registry need not return the same history every time. It is allowed to diverge from what the clients are expecting. At the coast of losing the efficient diff transfer provided by git. Alternatively, make the history consistent and linear but completely replace the contents of the registry.

@jonhoo Yes this would have to go with a crates.io endpoint for monitoring changes.

Another design constraint is that most of the time when generating a lockfile we are not just going to look up one package, we are going to look up hundreds to thousands of packages, and that we don't know what we will need to query next. For example: we have to fulfill regex = "*" from our toml

  1. query the index for regex = "*", start investigating 1.0.5
  2. query the index for aho-corasick = "^0.6.7", start investigating 0.6.8 (possibly batched with regex v1.0.5 other direct dependencies.)
  3. query the index for memchr ="^2", start investigating 2.1.0
  4. query the index for cfg-if = "^0.1.5", start investigating 0.1.6(possibly batched withmemchr v2.1.0` other direct dependencies.)
    ....

Hard details that could make this faster.

  • The resolver could be asynchronous and respond to the queries as they are returned. It is already ridiculously complicated, and I don't know If this is where we want to spend the complexity budget.
  • The resolver could identify batches of queries to send to the registry. Not as hard as the previous, if the registry soports batches.
  • Crates.io could also be running a resolver and return all the tree of things you'll need to resolve your query in isolation, not just the direct answer. The risks of running a NP-Hard as a service, invites a denial of service attack.

O(n) network trips makes me less sure that the full index clone is so expensive by comparison. Note that if NPM uses this strategy then they must already have a solution for this, so we should find out what it is.

@dralley
Copy link

dralley commented Oct 22, 2018

@Eh2406 Yeah, in the time since that post I did some experimentation and found that while a normal "git pull" took ~30 seconds, at "git pull --depth 1" only took ~2 seconds, so it's not as bad as I was initially thinking.

@ishitatsuyuki
Copy link
Contributor Author

@Eh2406 The O(n) network round trips is not much an issue; the n here is normally under a reasonable amount, where we don't have to use batching. We already make as many requests for downloading packages. Also, yeah we don't want to do dependency resolution on server side because it's simply too much computation even if it wasn't a worst case.

So, moving on to providing a good user experience, what we need is basically to reduce latency so the queries can run nearly as fast as a local index.

  • Use a CDN to cache the requests for popular packages. NPM recently switched to CloudFlare (I don't know what they used before).
  • If needed, geo-replicate all index data. This can be either done by replicating DB (NoSQL are better on this), or fully offload to cross region replicated S3 buckets. This reduces worse case latency.

And finally, network latency is still higher than disk in most cases, so as you said we should make Cargo async ready.

@mx-shift
Copy link

mx-shift commented Oct 23, 2018 via email

@ishitatsuyuki
Copy link
Contributor Author

@kc8apf Let's go through them one by one.

I travel quite a bit and routinely find myself on connections where latency can easily be 150+ ms. A moderate size dependency graph will quickly rack up many seconds of latency.

That indeed sounds like a problem. Though, I'm not sure what's the "acceptable" line here - given the latency git operations are going to be slow as well, so we should test out to see if we need additional optimizations on this.

There's also a question of how to cache the results for offline builds and then invalidate those results once back online.

I don't think that's necessary - once Cargo.lock is generated we don't need access to index anymore until the dependency tree is modified again.

How will a client determine that a registry supports a different mechanism without fetching the v1 index over git?

This is unfortunate indeed since the custom registry RFC has been accepted. Though, moving formats doesn't need to be that hard - I think we can simply add a config property to specify the registry protocol version.

@kornelski
Copy link
Contributor

kornelski commented Oct 23, 2018

Perhaps the fix could be to keep resolution driven by the client, but change the way index is downloaded and updated to allow having a partial index on the client.

Roughly, change update model from git-like to svn-like. Instead of downloading the whole thing ahead of time, download and cache bits of the index as needed, but still operate on the local copy.

I think that would scale even to NPM's size, since the partial index on the client wouldn't have to be much larger than the working set of crates people actually use.

Partial fetch of the index as-needed in the worst case would be similar to O(n) network trips resolution method mentioned before, but with local caching, this would become fast and even work offline after enough of the index is cached.

To make things faster:

  • Cargo could come preloaded with a mini partial version of the index that contains top N most popular crates.
  • The update protocol could allow server to push to the client's cache more than asked for, e.g. client: "I want re/ge, because I'm looking up regex"; server: "here's re/ge and ah/o-". This being entirely opportunistic and client-driven doesn't require server to do NP-Hard computation.

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 23, 2018

We often need to query exponentially more crates then we download, ( That is part of the definition of NP-Hard ) and we are already working on batching the download process because it has two much latency #6005. We can get hard numbers by instrumenting the number of times we get to here vs the len of the last section of the lock file.

Making the resolver async ready, will require a pretty major rewrite. I will keep in mind this new design constraint in my dreams of a new resolver. So far all such dreams are early conversations of possible vaporware. To be fair I thout the same thing about public & private dependencies and 8 months of thinking about how to do it later it is almost done with out needing a manger design.

@mx-shift
Copy link

mx-shift commented Oct 23, 2018 via email

@mx-shift
Copy link

mx-shift commented Oct 23, 2018 via email

@kornelski
Copy link
Contributor

We often need to query exponentially more crates then we download, ( That is part of the definition of NP-Hard )

I'm assuming that when you get a crate info, you get the same file as currently in the index, which contains dependency information for all versions of the crate.

So if you had to make 1 request per 1 version of a crate, that would be lots of requests. But if one request returns info about all version of a crate, I think that's not too bad. While different versions may have different dependencies, it's exponentially bad only in theory. In practice people don't replace all their dependencies on every release with dependencies that replace all their dependencies on every release, etc., so the worst case will be you'll uncover and cache somewhat larger superset of dependencies.

@ishitatsuyuki
Copy link
Contributor Author

But if one request returns info about all version of a crate, I think that's not too bad.

That's exactly what NPM do. @Eh2406 Resolving dependencies is NP-hard (if there are two dependency requesting conflicting dependencies it may be unable to reason about it in acceptable time), but the amount of index data is not.

we are already working on batching the download process

@Eh2406 Please do not call it batching because it's confusing. It's just concurrency, we still make the same amount of requests. Batching is more like api/crates?id=a,b,c, to eliminate all the overhead of middleboxes and maybe N+1 queries.

Perhaps the fix could be to keep resolution driven by the client, but change the way index is downloaded and updated to allow having a partial index on the client.

Roughly, change update model from git-like to svn-like. Instead of downloading the whole thing ahead of time, download and cache bits of the index as needed, but still operate on the local copy.

@kornelski I'm not sure how it can be achieved. We need consistency to some degree on the index so for example when num publishes a new major release we also see the new num-traits release that it depends on. Which means that we need to validate the cache before use; that's probably achieved by a HTTP conditional request, which requires the same round trip.

Cargo could come preloaded with a mini partial version of the index that contains top N most popular crates.

This makes sense. I think what you propose is similar to HTTP/2 push, so there's a chance we can implement that optimization on the top of HTTP/2 spec.

@kornelski
Copy link
Contributor

kornelski commented Oct 24, 2018

We need consistency to some degree on the index so for example when num publishes a new major release we also see the new num-traits release that it depends on.

This is indeed tricky. Couple of solutions come to my mind.

Revision map solution

Let's assume we have an index revision, a monotonically increasing number, bumped every time any crate is updated/published anywhere.

The client could be given a mapping between crates and the latest revision in which they were modified. num => 2, num-traits => 3, so the client would see "I've last checked num-traits when the revision was 2, so I need to update".

Because it's safe for the mapping to contain a revision that's too high (it'll just cause unnecessary check), it can be compressed in a lossy way, e.g. only send max revision per subdirectory, or max revision based on a short hash of crate name. The mapping itself can be updated incrementally, sending revisions of crates that have been recently published.

Merkele hash tree

Similar to mapping between crates and revisions, but it'd use content hashes instead of revisions.

The downside is that hashes won't compress in a lossy way, so the index would grow. The upside is that the hashes can be built into a tree, similar to git's tree and commit hashes, so they could be used to verify not only whether crates are up to date, but also verify their integrity.

Update-everything-since solution

The client would check that the oldest crate it has is from revision 3, and ask the server to send metadata of crates that changed between rev. 3 and whatever is latest. The response would contain irrelevant crates (just as git update does today), but the diffs are still small compared to the whole index. The client can choose to ignore or delete irrelevant crates.

(I'd go with that solution, since it's easiest to implement)

Reverse revision information

Each crate's file could contain a revision number that is the latest revision of any of its dependencies. So if you install num, it'd say that any of its dependencies and dependencies of dependencies last changed in rev 3. The client would know it doesn't have to contact server for deps it has checked more recently.

The implementation of it would depend on the server updating dependents of a crate every time a crate is published. Doing this is feasible, e.g. on crates.rs I resolve all dependencies for all crates, and build reverse index, so I can tell libc is an indirect dependency of 9,795 crates. Yeah, the downside is that new version of libc would require server to update 9795 files :)

@ehuss ehuss added the A-dependency-resolution Area: dependency resolution and the resolver label Sep 21, 2019
@Eh2406
Copy link
Contributor

Eh2406 commented Feb 7, 2023

Closing in favor of #9069

@Eh2406 Eh2406 closed this as completed Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dependency-resolution Area: dependency resolution and the resolver
Projects
None yet
Development

No branches or pull requests

10 participants