Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Zero-install & repository size #180

Closed
brillout opened this issue May 19, 2019 · 25 comments
Closed

[Discussion] Zero-install & repository size #180

brillout opened this issue May 19, 2019 · 25 comments
Labels
discussion Discussion about the project

Comments

@brillout
Copy link

I really like the concept of getting rid of the yarn install step for deploying to production.

But I have projects that have 500MB of dependencies. Adding .yarn would dramatically increase the repo size.

The problem is that git hosts have limits on repo size. (E.g. GitHub recommends repo sizes to be <1GB.)

Git LFS could be a solution but it's seem to be fairly expensive.

I'm curious about yours thoughts on this.

@arcanis
Copy link
Member

arcanis commented May 19, 2019

At the time I started considering this option I contacted some folks at GitHub to make sure it wouldn't become a problem on their side. From what they told me it should be perfectly fine to host this amount of binary data (at least on GitHub).

What's nice with this approach (and maybe it would be even more true in the wake of the GitHub package registry) is that it's open to various optimizations. For example, assuming that many projects use the same version of the Lodash archive, I would assume GitHub could be able to eventually merge them in a single version in their "store". From the consumer perspective it wouldn't change a thing, except that their storage wouldn't be affected by the number of package they use.

Finally, Zero-Install is optional and as always there's a tradeoff. Smaller libraries might not really need it, and they can just use the common yarn install workflow everyone is used to. Enterprise applications and large projects with many contributors, however, will likely be fine with trading some MB against the improved DevX and guaranteed stability.

@arcanis arcanis changed the title Zero-install & repository size [Discussion] Zero-install & repository size May 19, 2019
@arcanis arcanis added the discussion Discussion about the project label May 19, 2019
@bgotink
Copy link
Member

bgotink commented May 19, 2019

While cloud-based services like github and bitbucket might support large repositories because they've got a lot of resources, on-prem solutions are more limited. Our enterprise git server becomes totally unresponsive when a new designer is on-boarded and cloning the designs repo, leading to developer frustration, CI failures and—most importantly—CD failures. While an initial size of ± 250 MB is still okay, this size will increase significantly once we've updated our dependencies a couple of times.

Zero install is still a great feature though. It solves the "I switched branch and now a dependency is missing" problem, it speeds up CI significantly and it lowers our dependence on our on-prem npm registry.

I've been thinking about this for a while now, mostly when trying to sleep, and this is where I'm at:

  • Commit them into git
    • + no install necessary, "clone & go"
    • - the size of the git repository increases significantly upon updating dependencies
    • - on-disk duplication of dependencies between projects
  • Use git-lfs
    • + no install necessary, "clone & go"
    • + size of the git repo stays small
    • ? is this going to work well? we don't have large files but many files (berry's .yarn contains 2208 zipfiles, the entire folder is 180MB so on average less than 81.5kB per zip)
    • - on-disk duplication of dependencies between projects
  • Set yarn cache folder to a system folder
    • + size of the git repo stays small
    • + no on-disk duplication of dependencies between projects
    • - install necessary
  • Use a separate git project mounted as .yarn submodule
    • + no install necessary, "clone --recurse-submodules & go"
    • + size of main git repo stays small, dependencies project can be retired and replaced once it becomes too large
    • + dependencies project can be used in multiple projects, sharing dependencies
    • - submodules are hard
    • - on-disk duplication of dependencies between projects, unless you have git fu

@brillout
Copy link
Author

A library cache-manager that manages the git submodule would be nice.

It would manage a .cache git submodule to be used by tools such as yarn or parcel.

Symlinks would be taken care of:

$ file .yarn
.yarn: symbolic link to ./cache/yarn
$ file frontend/.parcel
frontend/.parcel: symbolic link to ../cache/parcel

The library would abstract away the git submodule complexity and, from the user perspective, it would just work. The only thing the user would need to do is to save the cache repo address in a file .cacherepo:

$ cat .cacherepo
[email protected]:brillout/my-awesome-app__cache

Every time yarn runs, and prior to any operation, it would call require('cache-manager').init('.yarn/') where .yarn/ is the path of yarn's cache directory. It then automatically sets up the symbolic link and initializes the git submodule.

It would also occasionally git push --force and squash old commits to reduce the cache repo size.

If the user doesn't set the .cacherepo file then cache-manager is disabled and no .cache git submodule is created.

It would declutter the code repo while reaping the benefits of zero-install.

@arcanis would yarn be interested in using such library?

@bgotink what do you think? You seem to have thought a lot about this.

Would be nice to have other tools on board, such as parcel.

@arcanis
Copy link
Member

arcanis commented Jun 11, 2019

It's an interesting idea. The most complicated part would be this:

It would also occasionally git push --force and squash old commits to reduce the cache repo size.

At the moment the cache is implemented within the core and cannot be replaced, but I'd like to offer a way for plugins to replace it by whatever implementation they'd like. It's not too hard technically, the only subtlety is that contrary to how plugins currently work we could only have one cache system at a time.

Under this approach you wouldn't need symlinks etc - your cache implementation would just use the submodule as it is.

@sheerun
Copy link

sheerun commented Jun 14, 2019

Heh. Let's do quick math what to expect with this feature:

  1. Say project has 250MB of dependencies (not uncommon)
  2. Say during project's history dependencies are updated 20 times (not uncommon)

It means that to clone this project you need to download around 5GB of data vs downloading only 250MB (possibly cached from other projects) with yarn install. I didn't even mention branches.

Other concerning things:

  1. Cloning dependencies from GitHub is far less performant than downloading them from CDN that lives 10ms from you instead of other side of ocean. Git is also not perfect at parallelising these downloads.
  2. People who worked in data science and tried to put "big" datasets in git repositories know git operations become slow when there are big files committed: checkout, merge, rebase you name it. It quickly becomes annoying.
  3. When working with monorepos you don't always need all dependencies of all projects available. It's quicker to clone monorepo and install only what you need
  4. For production you don't need devDependencies only production dependencies

So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them".

@arcanis
Copy link
Member

arcanis commented Jun 14, 2019

Your analysis isn't much better than a guess. There are various factors in play. For one, you assume that each and every package will be upgraded 20 times. From my experience this is rarely the case. Yarn (and even moreso the v2) is pretty good at reusing packages during upgrades. Upgrading from Webpack 3 to Webpack 4 yields an addition of only 80 packages, compared to the ~400 parts of Webpack itself.

Of course, there's no denying that a git clone will be slower with more data - but it's still faster than a clone and an install - especially when you factor the amount of time you make a clone per day versus the amount of time you make an install. The balance might tip off at some point, but it remains to be seen how much time it takes in practical cases, and whether the possible solutions alleviate the issue.

People who worked in data science and tried to put "big" datasets in git repositories know git operations become slow when there are big files committed: checkout, merge, rebase you name it. It quickly becomes annoying.

The Zero Install approach originates from another feature, the Offline Mirror, which follow the exact same principle except that the installs still need to be run. It got released in 2016. Since then, we never heard a single time that this features was causing issues. In fact, not only did we heard the exact opposite, but I even witnessed it myself by working on such a codebase. So why does it work well?

  • Your tradeoffs are not everyone's tradeoffs. A large repository might be a cost that you don't wish to pay and that's ok, but for someone else this might not be true. Deployment stability and developer experience are two area typically very hard to scale, but repository size is easily measurable and optimizable. And git clone --depth is a thing, too. Not perfect, but as the incentives shift so does the tooling.

  • Both the offline mirror and the zero-install approach are completely optional. If you are in the case I mentioned and don't wish to pay the cost, just put enableGlobalCache in the yarnrc file at the root of the repository and you won't ever have to mind it ever again. It's a default, but not a requirement.

So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them".

While not directly related, I don't find productive or very ethical to post FUD on Twitter without even waiting to hear what others have to say about your findings.

@sheerun
Copy link

sheerun commented Jun 14, 2019

This guess this is good feature for Facebook-sized private repositories where everyone is one the same page about git clone --depth and there are optimisations in place for whole team for handling enormous repositories, but I would be annoyed if I found someone using this feature in the wild because of all the reasons I've mentioned.

Offline Mirror is useful feature and I don't think it's comparable because you don't need to create mirror directory inside project or commit it (e.g. you can upload it and download from S3 for production). On the other hand it seems that "zero-install" will encourage committing big files into repositories.

Exact numbers for my analysis don't matter because downloading even 2x more dependencies than necessary is not good. Also you cannot solve not downloading devDependencies for production even if you use git clone --depth and usually they weight more than production dependencies.

You're right about the message on twitter I should have waited at least until you answered. Unfortunately all of my arguments still hold and I posted about it because I would find it harmful if someone decided to do something like this on public repository. I guess it would be fine if this feature could be enabled only with "private": true, because in such case I don't care.

@arcanis
Copy link
Member

arcanis commented Jun 14, 2019

I mean, some of your points make sense, and we don't necessarily have answers to all of them. Still, my opinion based on the people I discussed with is that Yarn caters to two different audiences: independant developers, and companies. The two don't always have the same needs, and having options for both is important.

Overall, I think we agree that the feature makes sense but the messaging should be made more clear. I think a table like the one @bgotink started (with the various pros and cons) would be a good addition to the documentation (maybe on a separate page, for example behind a "Should I use Zero-Install?" link). If you're willing to give us a hand, we'd be happy to review such a PR! 🙂

@sheerun
Copy link

sheerun commented Jun 14, 2019

I also agree. One more comment: I think one of the reasons why Yarn implemented this feature is somehow decentralising package management (good cause) by committing all code including dependencies into git repositories, but I think it could backslash because some operations on such repositories would be very hard to perform without centralized service like GitHub (for example git blame or git log -p -- package.json needs whole history downloaded).

@arcanis
Copy link
Member

arcanis commented Jun 24, 2019

Some quick data obtained from the Berry repository (which is about 6 months old). The size on-disk of the cloned repo is 253M. After running a tree filter + an aggressive gc, the size went down to 149M. The size of the cache before being purged is 88M. That would give 16M of extraneous data (~6%).

It would be interesting to run a similar experience on a product application 🤔

@sheerun
Copy link

sheerun commented Jun 24, 2019

I think you might not have pruned all zip files from berry repository. Here's how to properly do it:

git clone --mirror https://github.com/yarnpkg/berry
cd berry.git
du -sh .
161M

Then download bfg tool: https://rtyley.github.io/bfg-repo-cleaner/

java -jar ~/Downloads/bfg-1.13.0.jar --delete-files '*.zip' --no-blob-protection .
git reflog expire --expire=now --all && git gc --prune=now --aggressive
du -sh .
17M	.

So the overhead seems 950%

If you do just shallow clone then repository is 97MB:

git clone --mirror https://github.com/yarnpkg/berry --depth 1
cd berry.git
du -sh .
97M	.

It means full clone downloads extra 64MB of historical .zip dependencies and 80MB of current dependencies.

@bgotink
Copy link
Member

bgotink commented Jun 26, 2019

Some numbers from an angular repo at work:

# initial zero-install at angular 7
$ du -sh .git
108M	.git
$ du -sh .yarn
 90M	.yarn

# after updating to angular 8
$ du -sh .git
167M	.git
$ du -sh .yarn
110M	.yarn

Adding 60MB per upgrade is too much for us to safely commit it into our repository. The repo would grow by at least 200 MB per year and this is only our repo with the smallest number of dependencies (it's the root of our internal stack, all other repos depend on packages from this repo).
Especially internal dependencies are updated often, so I'd expect the number to be a lot more than 200MB/year for some other repos.

@arcanis
Copy link
Member

arcanis commented Jul 22, 2019

I feel like that is being erased

Zero-install is optional. If you don't like it, don't use it. It's a bit like ranting against this new weapon called "swords" because it's easier to cut your own fingers with it than with a club.

We're not stuck in the old days anymore where npm could take several minutes to install. The modern versions of the package managers that we use are fast, especially when you have cached versions of your dependencies on your system already.

I worked on package setups for the past 2+ years as my daily job. I saw situations where cached installs still amounted to more than twenty-four hours a day. What's the cost in term of feedback loop? What's the cost in actual CPU time? What's the cost in build failures because of bugs during yarn install?

Additionally, Yarn's main selling point, perhaps above its speed, is the stability of its builds. We aim to guarantee you zero surprises. With yarn install this is only partially true, because it's quite possible that you'll forget to run an install and unknowingly compile your code against boggus dependencies. Zero-installs are a way to push back the theoretical limits of this statement - you can now ensure that the state of your project is always right, regardless where you are in the history.

Finally, whether you share those concerns isn't really the point - Yarn is used by millions nowadays. We're used by small companies, by medium companies, by very large companies. They share different use cases, and sometimes need different solutions. I believe this one is applicable to many scenarios (we're using it developing Yarn itself, and I think it proved great so far), but maybe you're not the target.

@aslilac
Copy link

aslilac commented Jul 23, 2019

At the very least, I currently only see a way to opt out on a per project basis. Will it be possible to have a system wide or at least workspace wide opt out?

@arcanis
Copy link
Member

arcanis commented Jul 23, 2019

Sure. Just put a ~/.yarnrc.yml file with enableGlobalCache: true, and Yarn will have the same behavior as before (except that it'll still be using PnP).

@peey
Copy link

peey commented Oct 30, 2021

The size on-disk of the cloned repo is 253M.

As of today, the size for a fresh clone is 1.28 GB whereas downloading the repo (without git history) is 79 MB.

I like the idea of a self-contained project within 200 MB, but it seems that perhaps how yarn stores cached dependencies (or how library authors provide new versions, or maybe just the fact that there are a lot of changes in libraries) doesn't play well with git's de-duplication efforts. I think this problem might need further attention to identify the bottleneck.

This is posing a practical problem that while it's zero-installs once you get the repo, getting the repo itself might be becoming more and more challenging as it accumulates cruft from old deps.

Zero-install is optional. If you don't like it, don't use it.

This makes sense, but it'd be nice if yarn issues a best-practices guide for community-at-large so that contributors who prefer it one way (or are bandwidth/disk constrained) find it easy to contribute to repositories who prefer it the other way.

Maybe git-lfs could solve the use case where you're not interested in checking out previous versions of source code (and of cached dependencies). Maybe something else. I think #180 (comment) provides quite an insightful comparison.

It'll be good if there's a discussion of how future-proof the feature is.

@generalov
Copy link

generalov commented Nov 28, 2021

Perhaps, centralised version control systems (like TFS) are more suitable for this approach out of box. Unlike the Git, cVCS client downloads just the latest version of files to your local machine, keeping all historical data on server. Here the size of repository is rather an infrastructure-level problem than yours.

What's about the Git, solution might be to put your cache directory into a "shallow submodule" (that is the same as to fetch the submodule using --depth 1 argument). This can be configured per repository.
https://stackoverflow.com/questions/2144406/how-to-make-shallow-git-submodules

@bgotink
Copy link
Member

bgotink commented Nov 28, 2021

Since the opening of this ticket git has made improvements to cloning large repositories, most importantly the introduction of clone filters. This is how I cloned yarn recently:

git clone [email protected]:yarnpkg/berry.git --filter blob:limit=200k

The --filter blob:limit=200k option tells git to download all blobs (files) smaller than 200kiB. I've chosen 200k because it should be plenty to include all source files (200 kiB is a file of 2,048 lines of 100 characters wide each, that's bigger than anything yarn has in the repo) but not so big as to include huge cached packages in .yarn/cache. Files excluded via the filter are downloaded when necessary, so you do sometimes require the network when checking out a branch or tag.
I initially had a clone with --filter blob:none to only download necessary files, but that meant the GitLens extension I'm running in VS Code constantly requires internet access to properly blame files.

The clone (.git folder only) on my machine is 345 MB, compared to the 1.3 GB a full clone fetches.
Only fetching blobs when needed yields a repo with a .git folder that's 317 MB. That's not much of an improvement, especially given the downside of having to access the network a lot more when opening source files in VS Code.
Including all source files while excluding large cache files seems like a good compromise between keeping the repo size in check and ensuring enough history is downloaded for tools to work properly.

GitHub, Bitbucket (hosted and server), and GitLab support clone filters.

Some benefits of clone filters compared to

  • git lfs:
    • No setup of an extra tool is required, pass one extra option while cloning and that's it
    • The repo itself still contains all of its history, without dependency on an extra LFS server
  • a shallow submodule:
    • Submodules are, in my personal experience, painful to maintain in large projects with many developers.
    • Shallow clones are discouraged (by GitHub) when you intend to pull later on, which is the case here. You could still end up downloading your entire yarn cache repository, if you're unlucky! (https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/)
    • The history of .yarn/cache is intact even though the files themselves may be absent. Commands like git log -- .yarn/cache work as expected.

@conartist6
Copy link

Is it possible to target the clone filter on anything more precise than filesize?

@joepio
Copy link
Contributor

joepio commented Sep 10, 2022

Is it possible to remove the .yarn/cache directories from all but the latest commits in a repo? That would fix most of the problem, for me at least. My repo is now about 1.5GB, and it isn't that old yet. The biggest issue is that playwright is included, which in turn depends on some browsers. Anytime we update that, the repo grows with hundreds of megabytes. I love the idea of zero-installs, but only for the latest commits - not for all of the history.

joepio added a commit to joepio/berry that referenced this issue Sep 10, 2022
joepio added a commit to joepio/berry that referenced this issue Sep 10, 2022
@sheerun
Copy link

sheerun commented Sep 10, 2022 via email

@RDIL
Copy link
Member

RDIL commented Sep 12, 2022

Closing since this is now documented.

@RDIL RDIL closed this as completed Sep 12, 2022
@joepio
Copy link
Contributor

joepio commented Sep 13, 2022

Closing since this is now documented.

I'd say this issue should be open until we have an actual solution. Documenting the pitfall does not equal fixing

I think we should, at least, find some way to remove folders from git history.

One idea from stackoverflow seems compelling:

# Make a fresh clone of YOUR_REPO
git clone YOUR_REPO
cd YOUR_REPO

# Create tracking branches of all branches
for remote in `git branch -r | grep -v /HEAD`; do git checkout --track $remote ; done

# Remove DIRECTORY_NAME from all commits, then remove the refs to the old commits
# (repeat these two commands for as many directories that you want to remove)
git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch DIRECTORY_NAME/' --prune-empty --tag-name-filter cat -- --all
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

# Ensure all old refs are fully removed
rm -Rf .git/logs .git/refs/original

# Perform a garbage collection to remove commits with no refs
git gc --prune=all --aggressive

# Force push all branches to overwrite their history
# (use with caution!)
git push origin --all --force
git push origin --tags --force

@arcanis
Copy link
Member

arcanis commented Sep 13, 2022

I think we can do better in the documentation (for example documenting the remediations, as you mention), but in the absence of a concrete action item I prefer to lock this thread and move this to Discussion (ie https://github.com/yarnpkg/berry/discussions). It makes more sense to use the discussions' threaded format for this, as it will help discuss multiple different options.

Personally though, I feel like Git already kinda solved this problem with partial clones, ie git clone --filter=blob:limit=2m.

Locking this thread in the meantime (I don't want to just convert it to a discussion, since the way GH works it would make each post in this issue a separate thread, which would kinda defeat the purpose; feel free to open a new one and link back here).

@yarnpkg yarnpkg locked as resolved and limited conversation to collaborators Sep 13, 2022
@arcanis
Copy link
Member

arcanis commented Sep 13, 2022

Follow-up discussion is there: #4845!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discussion Discussion about the project
Projects
None yet
Development

No branches or pull requests

10 participants