[Discussion] Zero-install & repository size #180

brillout · 2019-05-19T07:55:02Z

I really like the concept of getting rid of the yarn install step for deploying to production.

But I have projects that have 500MB of dependencies. Adding .yarn would dramatically increase the repo size.

The problem is that git hosts have limits on repo size. (E.g. GitHub recommends repo sizes to be <1GB.)

Git LFS could be a solution but it's seem to be fairly expensive.

I'm curious about yours thoughts on this.

The text was updated successfully, but these errors were encountered:

arcanis · 2019-05-19T09:36:24Z

At the time I started considering this option I contacted some folks at GitHub to make sure it wouldn't become a problem on their side. From what they told me it should be perfectly fine to host this amount of binary data (at least on GitHub).

What's nice with this approach (and maybe it would be even more true in the wake of the GitHub package registry) is that it's open to various optimizations. For example, assuming that many projects use the same version of the Lodash archive, I would assume GitHub could be able to eventually merge them in a single version in their "store". From the consumer perspective it wouldn't change a thing, except that their storage wouldn't be affected by the number of package they use.

Finally, Zero-Install is optional and as always there's a tradeoff. Smaller libraries might not really need it, and they can just use the common yarn install workflow everyone is used to. Enterprise applications and large projects with many contributors, however, will likely be fine with trading some MB against the improved DevX and guaranteed stability.

bgotink · 2019-05-19T20:37:30Z

While cloud-based services like github and bitbucket might support large repositories because they've got a lot of resources, on-prem solutions are more limited. Our enterprise git server becomes totally unresponsive when a new designer is on-boarded and cloning the designs repo, leading to developer frustration, CI failures and—most importantly—CD failures. While an initial size of ± 250 MB is still okay, this size will increase significantly once we've updated our dependencies a couple of times.

Zero install is still a great feature though. It solves the "I switched branch and now a dependency is missing" problem, it speeds up CI significantly and it lowers our dependence on our on-prem npm registry.

I've been thinking about this for a while now, mostly when trying to sleep, and this is where I'm at:

Commit them into git
- + no install necessary, "clone & go"
- - the size of the git repository increases significantly upon updating dependencies
- - on-disk duplication of dependencies between projects
Use git-lfs
- + no install necessary, "clone & go"
- + size of the git repo stays small
- ? is this going to work well? we don't have large files but many files (berry's .yarn contains 2208 zipfiles, the entire folder is 180MB so on average less than 81.5kB per zip)
- - on-disk duplication of dependencies between projects
Set yarn cache folder to a system folder
- + size of the git repo stays small
- + no on-disk duplication of dependencies between projects
- - install necessary
Use a separate git project mounted as .yarn submodule
- + no install necessary, "clone --recurse-submodules & go"
- + size of main git repo stays small, dependencies project can be retired and replaced once it becomes too large
- + dependencies project can be used in multiple projects, sharing dependencies
- - submodules are hard
- - on-disk duplication of dependencies between projects, unless you have git fu

brillout · 2019-06-11T20:28:44Z

A library cache-manager that manages the git submodule would be nice.

It would manage a .cache git submodule to be used by tools such as yarn or parcel.

Symlinks would be taken care of:

$ file .yarn
.yarn: symbolic link to ./cache/yarn

$ file frontend/.parcel
frontend/.parcel: symbolic link to ../cache/parcel

The library would abstract away the git submodule complexity and, from the user perspective, it would just work. The only thing the user would need to do is to save the cache repo address in a file .cacherepo:

$ cat .cacherepo
[email protected]:brillout/my-awesome-app__cache

Every time yarn runs, and prior to any operation, it would call require('cache-manager').init('.yarn/') where .yarn/ is the path of yarn's cache directory. It then automatically sets up the symbolic link and initializes the git submodule.

It would also occasionally git push --force and squash old commits to reduce the cache repo size.

If the user doesn't set the .cacherepo file then cache-manager is disabled and no .cache git submodule is created.

It would declutter the code repo while reaping the benefits of zero-install.

@arcanis would yarn be interested in using such library?

@bgotink what do you think? You seem to have thought a lot about this.

Would be nice to have other tools on board, such as parcel.

arcanis · 2019-06-11T21:03:59Z

It's an interesting idea. The most complicated part would be this:

It would also occasionally git push --force and squash old commits to reduce the cache repo size.

At the moment the cache is implemented within the core and cannot be replaced, but I'd like to offer a way for plugins to replace it by whatever implementation they'd like. It's not too hard technically, the only subtlety is that contrary to how plugins currently work we could only have one cache system at a time.

Under this approach you wouldn't need symlinks etc - your cache implementation would just use the submodule as it is.

sheerun · 2019-06-14T16:05:19Z

Heh. Let's do quick math what to expect with this feature:

Say project has 250MB of dependencies (not uncommon)
Say during project's history dependencies are updated 20 times (not uncommon)

It means that to clone this project you need to download around 5GB of data vs downloading only 250MB (possibly cached from other projects) with yarn install. I didn't even mention branches.

Other concerning things:

Cloning dependencies from GitHub is far less performant than downloading them from CDN that lives 10ms from you instead of other side of ocean. Git is also not perfect at parallelising these downloads.
People who worked in data science and tried to put "big" datasets in git repositories know git operations become slow when there are big files committed: checkout, merge, rebase you name it. It quickly becomes annoying.
When working with monorepos you don't always need all dependencies of all projects available. It's quicker to clone monorepo and install only what you need
For production you don't need devDependencies only production dependencies

So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them".

arcanis · 2019-06-14T18:17:04Z

Your analysis isn't much better than a guess. There are various factors in play. For one, you assume that each and every package will be upgraded 20 times. From my experience this is rarely the case. Yarn (and even moreso the v2) is pretty good at reusing packages during upgrades. Upgrading from Webpack 3 to Webpack 4 yields an addition of only 80 packages, compared to the ~400 parts of Webpack itself.

Of course, there's no denying that a git clone will be slower with more data - but it's still faster than a clone and an install - especially when you factor the amount of time you make a clone per day versus the amount of time you make an install. The balance might tip off at some point, but it remains to be seen how much time it takes in practical cases, and whether the possible solutions alleviate the issue.

People who worked in data science and tried to put "big" datasets in git repositories know git operations become slow when there are big files committed: checkout, merge, rebase you name it. It quickly becomes annoying.

The Zero Install approach originates from another feature, the Offline Mirror, which follow the exact same principle except that the installs still need to be run. It got released in 2016. Since then, we never heard a single time that this features was causing issues. In fact, not only did we heard the exact opposite, but I even witnessed it myself by working on such a codebase. So why does it work well?

Your tradeoffs are not everyone's tradeoffs. A large repository might be a cost that you don't wish to pay and that's ok, but for someone else this might not be true. Deployment stability and developer experience are two area typically very hard to scale, but repository size is easily measurable and optimizable. And git clone --depth is a thing, too. Not perfect, but as the incentives shift so does the tooling.
Both the offline mirror and the zero-install approach are completely optional. If you are in the case I mentioned and don't wish to pay the cost, just put enableGlobalCache in the yarnrc file at the root of the repository and you won't ever have to mind it ever again. It's a default, but not a requirement.

So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them".

While not directly related, I don't find productive or very ethical to post FUD on Twitter without even waiting to hear what others have to say about your findings.

sheerun · 2019-06-14T20:12:22Z

This guess this is good feature for Facebook-sized private repositories where everyone is one the same page about git clone --depth and there are optimisations in place for whole team for handling enormous repositories, but I would be annoyed if I found someone using this feature in the wild because of all the reasons I've mentioned.

Offline Mirror is useful feature and I don't think it's comparable because you don't need to create mirror directory inside project or commit it (e.g. you can upload it and download from S3 for production). On the other hand it seems that "zero-install" will encourage committing big files into repositories.

Exact numbers for my analysis don't matter because downloading even 2x more dependencies than necessary is not good. Also you cannot solve not downloading devDependencies for production even if you use git clone --depth and usually they weight more than production dependencies.

You're right about the message on twitter I should have waited at least until you answered. Unfortunately all of my arguments still hold and I posted about it because I would find it harmful if someone decided to do something like this on public repository. I guess it would be fine if this feature could be enabled only with "private": true, because in such case I don't care.

arcanis · 2019-06-14T20:41:12Z

I mean, some of your points make sense, and we don't necessarily have answers to all of them. Still, my opinion based on the people I discussed with is that Yarn caters to two different audiences: independant developers, and companies. The two don't always have the same needs, and having options for both is important.

Overall, I think we agree that the feature makes sense but the messaging should be made more clear. I think a table like the one @bgotink started (with the various pros and cons) would be a good addition to the documentation (maybe on a separate page, for example behind a "Should I use Zero-Install?" link). If you're willing to give us a hand, we'd be happy to review such a PR! 🙂

sheerun · 2019-06-14T20:57:30Z

I also agree. One more comment: I think one of the reasons why Yarn implemented this feature is somehow decentralising package management (good cause) by committing all code including dependencies into git repositories, but I think it could backslash because some operations on such repositories would be very hard to perform without centralized service like GitHub (for example git blame or git log -p -- package.json needs whole history downloaded).

arcanis · 2019-06-24T09:38:54Z

Some quick data obtained from the Berry repository (which is about 6 months old). The size on-disk of the cloned repo is 253M. After running a tree filter + an aggressive gc, the size went down to 149M. The size of the cache before being purged is 88M. That would give 16M of extraneous data (~6%).

It would be interesting to run a similar experience on a product application 🤔

sheerun · 2019-06-24T14:00:42Z

I think you might not have pruned all zip files from berry repository. Here's how to properly do it:

git clone --mirror https://github.com/yarnpkg/berry
cd berry.git
du -sh .
161M

Then download bfg tool: https://rtyley.github.io/bfg-repo-cleaner/

java -jar ~/Downloads/bfg-1.13.0.jar --delete-files '*.zip' --no-blob-protection .
git reflog expire --expire=now --all && git gc --prune=now --aggressive
du -sh .
17M	.

So the overhead seems 950%

If you do just shallow clone then repository is 97MB:

git clone --mirror https://github.com/yarnpkg/berry --depth 1
cd berry.git
du -sh .
97M	.

It means full clone downloads extra 64MB of historical .zip dependencies and 80MB of current dependencies.

bgotink · 2019-06-26T19:58:07Z

Some numbers from an angular repo at work:

# initial zero-install at angular 7
$ du -sh .git
108M	.git
$ du -sh .yarn
 90M	.yarn

# after updating to angular 8
$ du -sh .git
167M	.git
$ du -sh .yarn
110M	.yarn

Adding 60MB per upgrade is too much for us to safely commit it into our repository. The repo would grow by at least 200 MB per year and this is only our repo with the smallest number of dependencies (it's the root of our internal stack, all other repos depend on packages from this repo).
Especially internal dependencies are updated often, so I'd expect the number to be a lot more than 200MB/year for some other repos.

arcanis · 2019-07-22T07:02:01Z

I feel like that is being erased

Zero-install is optional. If you don't like it, don't use it. It's a bit like ranting against this new weapon called "swords" because it's easier to cut your own fingers with it than with a club.

We're not stuck in the old days anymore where npm could take several minutes to install. The modern versions of the package managers that we use are fast, especially when you have cached versions of your dependencies on your system already.

I worked on package setups for the past 2+ years as my daily job. I saw situations where cached installs still amounted to more than twenty-four hours a day. What's the cost in term of feedback loop? What's the cost in actual CPU time? What's the cost in build failures because of bugs during yarn install?

Additionally, Yarn's main selling point, perhaps above its speed, is the stability of its builds. We aim to guarantee you zero surprises. With yarn install this is only partially true, because it's quite possible that you'll forget to run an install and unknowingly compile your code against boggus dependencies. Zero-installs are a way to push back the theoretical limits of this statement - you can now ensure that the state of your project is always right, regardless where you are in the history.

Finally, whether you share those concerns isn't really the point - Yarn is used by millions nowadays. We're used by small companies, by medium companies, by very large companies. They share different use cases, and sometimes need different solutions. I believe this one is applicable to many scenarios (we're using it developing Yarn itself, and I think it proved great so far), but maybe you're not the target.

aslilac · 2019-07-23T15:32:32Z

At the very least, I currently only see a way to opt out on a per project basis. Will it be possible to have a system wide or at least workspace wide opt out?

arcanis · 2019-07-23T15:35:52Z

Sure. Just put a ~/.yarnrc.yml file with enableGlobalCache: true, and Yarn will have the same behavior as before (except that it'll still be using PnP).

peey · 2021-10-30T07:31:27Z

The size on-disk of the cloned repo is 253M.

As of today, the size for a fresh clone is 1.28 GB whereas downloading the repo (without git history) is 79 MB.

I like the idea of a self-contained project within 200 MB, but it seems that perhaps how yarn stores cached dependencies (or how library authors provide new versions, or maybe just the fact that there are a lot of changes in libraries) doesn't play well with git's de-duplication efforts. I think this problem might need further attention to identify the bottleneck.

This is posing a practical problem that while it's zero-installs once you get the repo, getting the repo itself might be becoming more and more challenging as it accumulates cruft from old deps.

Zero-install is optional. If you don't like it, don't use it.

This makes sense, but it'd be nice if yarn issues a best-practices guide for community-at-large so that contributors who prefer it one way (or are bandwidth/disk constrained) find it easy to contribute to repositories who prefer it the other way.

Maybe git-lfs could solve the use case where you're not interested in checking out previous versions of source code (and of cached dependencies). Maybe something else. I think #180 (comment) provides quite an insightful comparison.

It'll be good if there's a discussion of how future-proof the feature is.

generalov · 2021-11-28T16:01:35Z

Perhaps, centralised version control systems (like TFS) are more suitable for this approach out of box. Unlike the Git, cVCS client downloads just the latest version of files to your local machine, keeping all historical data on server. Here the size of repository is rather an infrastructure-level problem than yours.

What's about the Git, solution might be to put your cache directory into a "shallow submodule" (that is the same as to fetch the submodule using --depth 1 argument). This can be configured per repository.
https://stackoverflow.com/questions/2144406/how-to-make-shallow-git-submodules

bgotink · 2021-11-28T17:02:53Z

Since the opening of this ticket git has made improvements to cloning large repositories, most importantly the introduction of clone filters. This is how I cloned yarn recently:

git clone [email protected]:yarnpkg/berry.git --filter blob:limit=200k

The --filter blob:limit=200k option tells git to download all blobs (files) smaller than 200kiB. I've chosen 200k because it should be plenty to include all source files (200 kiB is a file of 2,048 lines of 100 characters wide each, that's bigger than anything yarn has in the repo) but not so big as to include huge cached packages in .yarn/cache. Files excluded via the filter are downloaded when necessary, so you do sometimes require the network when checking out a branch or tag.
I initially had a clone with --filter blob:none to only download necessary files, but that meant the GitLens extension I'm running in VS Code constantly requires internet access to properly blame files.

The clone (.git folder only) on my machine is 345 MB, compared to the 1.3 GB a full clone fetches.
Only fetching blobs when needed yields a repo with a .git folder that's 317 MB. That's not much of an improvement, especially given the downside of having to access the network a lot more when opening source files in VS Code.
Including all source files while excluding large cache files seems like a good compromise between keeping the repo size in check and ensuring enough history is downloaded for tools to work properly.

GitHub, Bitbucket (hosted and server), and GitLab support clone filters.

Some benefits of clone filters compared to

git lfs:
- No setup of an extra tool is required, pass one extra option while cloning and that's it
- The repo itself still contains all of its history, without dependency on an extra LFS server
a shallow submodule:
- Submodules are, in my personal experience, painful to maintain in large projects with many developers.
- Shallow clones are discouraged (by GitHub) when you intend to pull later on, which is the case here. You could still end up downloading your entire yarn cache repository, if you're unlucky! (https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/)
- The history of .yarn/cache is intact even though the files themselves may be absent. Commands like git log -- .yarn/cache work as expected.

conartist6 · 2021-12-21T20:05:23Z

Is it possible to target the clone filter on anything more precise than filesize?

joepio · 2022-09-10T07:30:05Z

Is it possible to remove the .yarn/cache directories from all but the latest commits in a repo? That would fix most of the problem, for me at least. My repo is now about 1.5GB, and it isn't that old yet. The biggest issue is that playwright is included, which in turn depends on some browsers. Anytime we update that, the repo grows with hundreds of megabytes. I love the idea of zero-installs, but only for the latest commits - not for all of the history.

sheerun · 2022-09-10T12:47:22Z

Not without rewriting whole history of this repo, or creating new one. For future commits git lfs can be used, but it’s not zero-install yet, and I’m not aware of proposal of making it so

On Sat, 10 Sep 2022 at 09:30 Joep Meindertsma ***@***.***> wrote: Is it possible to remove the .yarn/cache directories from all but the latest commits in a repo? That would fix most of the problem, for me at least. My repo is now about 1.5GB, and it isn't that old yet. The biggest issue is that playwright is included, which in turn depends on some browsers. Anytime we update that, the repo grows with hundreds of megabytes. I love the idea of zero-installs, but only for the latest commits - not for all of the history. — Reply to this email directly, view it on GitHub <#180 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACHMDISOP5P2KDQLTQ72RLV5Q2IPANCNFSM4HN34U7Q> . You are receiving this because you commented.Message ID: ***@***.***>

-- Best Regards, Adam Stankiewicz

RDIL · 2022-09-12T21:12:28Z

Closing since this is now documented.

joepio · 2022-09-13T08:01:49Z

Closing since this is now documented.

I'd say this issue should be open until we have an actual solution. Documenting the pitfall does not equal fixing

I think we should, at least, find some way to remove folders from git history.

One idea from stackoverflow seems compelling:

# Make a fresh clone of YOUR_REPO
git clone YOUR_REPO
cd YOUR_REPO

# Create tracking branches of all branches
for remote in `git branch -r | grep -v /HEAD`; do git checkout --track $remote ; done

# Remove DIRECTORY_NAME from all commits, then remove the refs to the old commits
# (repeat these two commands for as many directories that you want to remove)
git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch DIRECTORY_NAME/' --prune-empty --tag-name-filter cat -- --all
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

# Ensure all old refs are fully removed
rm -Rf .git/logs .git/refs/original

# Perform a garbage collection to remove commits with no refs
git gc --prune=all --aggressive

# Force push all branches to overwrite their history
# (use with caution!)
git push origin --all --force
git push origin --tags --force

arcanis · 2022-09-13T08:13:27Z

I think we can do better in the documentation (for example documenting the remediations, as you mention), but in the absence of a concrete action item I prefer to lock this thread and move this to Discussion (ie https://github.com/yarnpkg/berry/discussions). It makes more sense to use the discussions' threaded format for this, as it will help discuss multiple different options.

Personally though, I feel like Git already kinda solved this problem with partial clones, ie git clone --filter=blob:limit=2m.

Locking this thread in the meantime (I don't want to just convert it to a discussion, since the way GH works it would make each post in this issue a separate thread, which would kinda defeat the purpose; feel free to open a new one and link back here).

arcanis · 2022-09-13T16:17:38Z

Follow-up discussion is there: #4845!

arcanis changed the title ~~Zero-install & repository size~~ [Discussion] Zero-install & repository size May 19, 2019

arcanis added the discussion Discussion about the project label May 19, 2019

andrewgordstewart mentioned this issue Sep 30, 2020

migrate to yarn 2 statechannels/statechannels#2635

Closed

arcanis mentioned this issue Oct 18, 2020

[Case Study] Huge file size in repository #1997

Closed

andreialecu mentioned this issue Oct 20, 2020

fix: set timestamp to a higher value to handle timezones #2005

Closed

3 tasks

clemsos mentioned this issue Sep 2, 2021

Yarn workspaces and upgrade to v2 unlock-protocol/unlock#7446

Closed

10 tasks

ben196888 mentioned this issue Sep 23, 2021

Fix yarn 2 zero install feature ocftw/open-star-ter-village#1

Merged

joepio added a commit to joepio/berry that referenced this issue Sep 10, 2022

yarnpkg#180 mention zero-installs repo size in concerns

f31bda7

joepio added a commit to joepio/berry that referenced this issue Sep 10, 2022

yarnpkg#180 mention zero-installs repo size concern

1d358e2

joepio mentioned this issue Sep 10, 2022

docs: mention zero-installs repo size #4839

Merged

3 tasks

RDIL closed this as completed Sep 12, 2022

yarnpkg locked as resolved and limited conversation to collaborators Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Zero-install & repository size #180

[Discussion] Zero-install & repository size #180

brillout commented May 19, 2019

arcanis commented May 19, 2019

bgotink commented May 19, 2019

brillout commented Jun 11, 2019

arcanis commented Jun 11, 2019

sheerun commented Jun 14, 2019 •

edited

Loading

arcanis commented Jun 14, 2019

sheerun commented Jun 14, 2019 •

edited

Loading

arcanis commented Jun 14, 2019

sheerun commented Jun 14, 2019 •

edited

Loading

arcanis commented Jun 24, 2019

sheerun commented Jun 24, 2019 •

edited

Loading

bgotink commented Jun 26, 2019

arcanis commented Jul 22, 2019

aslilac commented Jul 23, 2019

arcanis commented Jul 23, 2019 •

edited by merceyz

Loading

peey commented Oct 30, 2021

generalov commented Nov 28, 2021 •

edited

Loading

bgotink commented Nov 28, 2021

conartist6 commented Dec 21, 2021

joepio commented Sep 10, 2022

sheerun commented Sep 10, 2022 via email

RDIL commented Sep 12, 2022

joepio commented Sep 13, 2022

arcanis commented Sep 13, 2022 •

edited

Loading

arcanis commented Sep 13, 2022

[Discussion] Zero-install & repository size #180

[Discussion] Zero-install & repository size #180

Comments

brillout commented May 19, 2019

arcanis commented May 19, 2019

bgotink commented May 19, 2019

brillout commented Jun 11, 2019

arcanis commented Jun 11, 2019

sheerun commented Jun 14, 2019 • edited Loading

arcanis commented Jun 14, 2019

sheerun commented Jun 14, 2019 • edited Loading

arcanis commented Jun 14, 2019

sheerun commented Jun 14, 2019 • edited Loading

arcanis commented Jun 24, 2019

sheerun commented Jun 24, 2019 • edited Loading

bgotink commented Jun 26, 2019

arcanis commented Jul 22, 2019

aslilac commented Jul 23, 2019

arcanis commented Jul 23, 2019 • edited by merceyz Loading

peey commented Oct 30, 2021

generalov commented Nov 28, 2021 • edited Loading

bgotink commented Nov 28, 2021

conartist6 commented Dec 21, 2021

joepio commented Sep 10, 2022

sheerun commented Sep 10, 2022 via email

RDIL commented Sep 12, 2022

joepio commented Sep 13, 2022

arcanis commented Sep 13, 2022 • edited Loading

arcanis commented Sep 13, 2022

sheerun commented Jun 14, 2019 •

edited

Loading

sheerun commented Jun 14, 2019 •

edited

Loading

sheerun commented Jun 14, 2019 •

edited

Loading

sheerun commented Jun 24, 2019 •

edited

Loading

arcanis commented Jul 23, 2019 •

edited by merceyz

Loading

generalov commented Nov 28, 2021 •

edited

Loading

arcanis commented Sep 13, 2022 •

edited

Loading