-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Zero-install & repository size #180
Comments
At the time I started considering this option I contacted some folks at GitHub to make sure it wouldn't become a problem on their side. From what they told me it should be perfectly fine to host this amount of binary data (at least on GitHub). What's nice with this approach (and maybe it would be even more true in the wake of the GitHub package registry) is that it's open to various optimizations. For example, assuming that many projects use the same version of the Lodash archive, I would assume GitHub could be able to eventually merge them in a single version in their "store". From the consumer perspective it wouldn't change a thing, except that their storage wouldn't be affected by the number of package they use. Finally, Zero-Install is optional and as always there's a tradeoff. Smaller libraries might not really need it, and they can just use the common |
While cloud-based services like github and bitbucket might support large repositories because they've got a lot of resources, on-prem solutions are more limited. Our enterprise git server becomes totally unresponsive when a new designer is on-boarded and cloning the designs repo, leading to developer frustration, CI failures and—most importantly—CD failures. While an initial size of ± 250 MB is still okay, this size will increase significantly once we've updated our dependencies a couple of times. Zero install is still a great feature though. It solves the "I switched branch and now a dependency is missing" problem, it speeds up CI significantly and it lowers our dependence on our on-prem npm registry. I've been thinking about this for a while now, mostly when trying to sleep, and this is where I'm at:
|
A library It would manage a Symlinks would be taken care of: $ file .yarn
.yarn: symbolic link to ./cache/yarn $ file frontend/.parcel
frontend/.parcel: symbolic link to ../cache/parcel The library would abstract away the git submodule complexity and, from the user perspective, it would just work. The only thing the user would need to do is to save the cache repo address in a file $ cat .cacherepo
[email protected]:brillout/my-awesome-app__cache Every time It would also occasionally If the user doesn't set the It would declutter the code repo while reaping the benefits of zero-install. @arcanis would yarn be interested in using such library? @bgotink what do you think? You seem to have thought a lot about this. Would be nice to have other tools on board, such as parcel. |
It's an interesting idea. The most complicated part would be this:
At the moment the cache is implemented within the core and cannot be replaced, but I'd like to offer a way for plugins to replace it by whatever implementation they'd like. It's not too hard technically, the only subtlety is that contrary to how plugins currently work we could only have one cache system at a time. Under this approach you wouldn't need symlinks etc - your cache implementation would just use the submodule as it is. |
Heh. Let's do quick math what to expect with this feature:
It means that to clone this project you need to download around 5GB of data vs downloading only 250MB (possibly cached from other projects) with Other concerning things:
So in short it shoudn't be named "zero-install", but "Install with git instead of yarn all dependencies that any project in this repository ever historically used, also install devDependencies even if you don't need them". |
Your analysis isn't much better than a guess. There are various factors in play. For one, you assume that each and every package will be upgraded 20 times. From my experience this is rarely the case. Yarn (and even moreso the v2) is pretty good at reusing packages during upgrades. Upgrading from Webpack 3 to Webpack 4 yields an addition of only 80 packages, compared to the ~400 parts of Webpack itself. Of course, there's no denying that a git clone will be slower with more data - but it's still faster than a clone and an install - especially when you factor the amount of time you make a clone per day versus the amount of time you make an install. The balance might tip off at some point, but it remains to be seen how much time it takes in practical cases, and whether the possible solutions alleviate the issue.
The Zero Install approach originates from another feature, the Offline Mirror, which follow the exact same principle except that the installs still need to be run. It got released in 2016. Since then, we never heard a single time that this features was causing issues. In fact, not only did we heard the exact opposite, but I even witnessed it myself by working on such a codebase. So why does it work well?
While not directly related, I don't find productive or very ethical to post FUD on Twitter without even waiting to hear what others have to say about your findings. |
This guess this is good feature for Facebook-sized private repositories where everyone is one the same page about Offline Mirror is useful feature and I don't think it's comparable because you don't need to create mirror directory inside project or commit it (e.g. you can upload it and download from S3 for production). On the other hand it seems that "zero-install" will encourage committing big files into repositories. Exact numbers for my analysis don't matter because downloading even 2x more dependencies than necessary is not good. Also you cannot solve not downloading You're right about the message on twitter I should have waited at least until you answered. Unfortunately all of my arguments still hold and I posted about it because I would find it harmful if someone decided to do something like this on public repository. I guess it would be fine if this feature could be enabled only with |
I mean, some of your points make sense, and we don't necessarily have answers to all of them. Still, my opinion based on the people I discussed with is that Yarn caters to two different audiences: independant developers, and companies. The two don't always have the same needs, and having options for both is important. Overall, I think we agree that the feature makes sense but the messaging should be made more clear. I think a table like the one @bgotink started (with the various pros and cons) would be a good addition to the documentation (maybe on a separate page, for example behind a "Should I use Zero-Install?" link). If you're willing to give us a hand, we'd be happy to review such a PR! 🙂 |
I also agree. One more comment: I think one of the reasons why Yarn implemented this feature is somehow decentralising package management (good cause) by committing all code including dependencies into git repositories, but I think it could backslash because some operations on such repositories would be very hard to perform without centralized service like GitHub (for example |
Some quick data obtained from the Berry repository (which is about 6 months old). The size on-disk of the cloned repo is 253M. After running a tree filter + an aggressive gc, the size went down to 149M. The size of the cache before being purged is 88M. That would give 16M of extraneous data (~6%). It would be interesting to run a similar experience on a product application 🤔 |
I think you might not have pruned all zip files from berry repository. Here's how to properly do it:
Then download bfg tool: https://rtyley.github.io/bfg-repo-cleaner/
So the overhead seems 950% If you do just shallow clone then repository is 97MB:
It means full clone downloads extra 64MB of historical .zip dependencies and 80MB of current dependencies. |
Some numbers from an angular repo at work: # initial zero-install at angular 7
$ du -sh .git
108M .git
$ du -sh .yarn
90M .yarn
# after updating to angular 8
$ du -sh .git
167M .git
$ du -sh .yarn
110M .yarn Adding 60MB per upgrade is too much for us to safely commit it into our repository. The repo would grow by at least 200 MB per year and this is only our repo with the smallest number of dependencies (it's the root of our internal stack, all other repos depend on packages from this repo). |
Zero-install is optional. If you don't like it, don't use it. It's a bit like ranting against this new weapon called "swords" because it's easier to cut your own fingers with it than with a club.
I worked on package setups for the past 2+ years as my daily job. I saw situations where cached installs still amounted to more than twenty-four hours a day. What's the cost in term of feedback loop? What's the cost in actual CPU time? What's the cost in build failures because of bugs during Additionally, Yarn's main selling point, perhaps above its speed, is the stability of its builds. We aim to guarantee you zero surprises. With Finally, whether you share those concerns isn't really the point - Yarn is used by millions nowadays. We're used by small companies, by medium companies, by very large companies. They share different use cases, and sometimes need different solutions. I believe this one is applicable to many scenarios (we're using it developing Yarn itself, and I think it proved great so far), but maybe you're not the target. |
At the very least, I currently only see a way to opt out on a per project basis. Will it be possible to have a system wide or at least workspace wide opt out? |
Sure. Just put a |
As of today, the size for a fresh clone is I like the idea of a self-contained project within 200 MB, but it seems that perhaps how yarn stores cached dependencies (or how library authors provide new versions, or maybe just the fact that there are a lot of changes in libraries) doesn't play well with git's de-duplication efforts. I think this problem might need further attention to identify the bottleneck. This is posing a practical problem that while it's zero-installs once you get the repo, getting the repo itself might be becoming more and more challenging as it accumulates cruft from old deps.
This makes sense, but it'd be nice if yarn issues a best-practices guide for community-at-large so that contributors who prefer it one way (or are bandwidth/disk constrained) find it easy to contribute to repositories who prefer it the other way. Maybe git-lfs could solve the use case where you're not interested in checking out previous versions of source code (and of cached dependencies). Maybe something else. I think #180 (comment) provides quite an insightful comparison. It'll be good if there's a discussion of how future-proof the feature is. |
Perhaps, centralised version control systems (like TFS) are more suitable for this approach out of box. Unlike the Git, cVCS client downloads just the latest version of files to your local machine, keeping all historical data on server. Here the size of repository is rather an infrastructure-level problem than yours. What's about the Git, solution might be to put your cache directory into a "shallow submodule" (that is the same as to fetch the submodule using |
Since the opening of this ticket git has made improvements to cloning large repositories, most importantly the introduction of clone filters. This is how I cloned yarn recently: git clone [email protected]:yarnpkg/berry.git --filter blob:limit=200k The The clone ( GitHub, Bitbucket (hosted and server), and GitLab support clone filters. Some benefits of clone filters compared to
|
Is it possible to target the clone filter on anything more precise than filesize? |
Is it possible to remove the |
Not without rewriting whole history of this repo, or creating new one. For
future commits git lfs can be used, but it’s not zero-install yet, and I’m
not aware of proposal of making it so
On Sat, 10 Sep 2022 at 09:30 Joep Meindertsma ***@***.***> wrote:
Is it possible to remove the .yarn/cache directories from all but the
latest commits in a repo? That would fix most of the problem, for me at
least. My repo is now about 1.5GB, and it isn't that old yet. The biggest
issue is that playwright is included, which in turn depends on some
browsers. Anytime we update that, the repo grows with hundreds of
megabytes. I love the idea of zero-installs, but only for the latest
commits - not for all of the history.
—
Reply to this email directly, view it on GitHub
<#180 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACHMDISOP5P2KDQLTQ72RLV5Q2IPANCNFSM4HN34U7Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Best Regards,
Adam Stankiewicz
|
Closing since this is now documented. |
I'd say this issue should be open until we have an actual solution. Documenting the pitfall does not equal fixing I think we should, at least, find some way to remove folders from git history. One idea from stackoverflow seems compelling: # Make a fresh clone of YOUR_REPO
git clone YOUR_REPO
cd YOUR_REPO
# Create tracking branches of all branches
for remote in `git branch -r | grep -v /HEAD`; do git checkout --track $remote ; done
# Remove DIRECTORY_NAME from all commits, then remove the refs to the old commits
# (repeat these two commands for as many directories that you want to remove)
git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch DIRECTORY_NAME/' --prune-empty --tag-name-filter cat -- --all
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
# Ensure all old refs are fully removed
rm -Rf .git/logs .git/refs/original
# Perform a garbage collection to remove commits with no refs
git gc --prune=all --aggressive
# Force push all branches to overwrite their history
# (use with caution!)
git push origin --all --force
git push origin --tags --force |
I think we can do better in the documentation (for example documenting the remediations, as you mention), but in the absence of a concrete action item I prefer to lock this thread and move this to Discussion (ie https://github.com/yarnpkg/berry/discussions). It makes more sense to use the discussions' threaded format for this, as it will help discuss multiple different options. Personally though, I feel like Git already kinda solved this problem with partial clones, ie Locking this thread in the meantime (I don't want to just convert it to a discussion, since the way GH works it would make each post in this issue a separate thread, which would kinda defeat the purpose; feel free to open a new one and link back here). |
Follow-up discussion is there: #4845! |
I really like the concept of getting rid of the
yarn install
step for deploying to production.But I have projects that have 500MB of dependencies. Adding
.yarn
would dramatically increase the repo size.The problem is that git hosts have limits on repo size. (E.g. GitHub recommends repo sizes to be <1GB.)
Git LFS could be a solution but it's seem to be fairly expensive.
I'm curious about yours thoughts on this.
The text was updated successfully, but these errors were encountered: