-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637
Comments
cc @jezdez |
Any progress here? It would be amazing if |
No progress so far, that would be visible here. @barabo can you look into (or redirect to the appropriate person) what's causing this? This smells like a CDN misconfiguration to me |
I looked into this a bit last week and couldn't find anything obviously wrong in the CDN configuration for the bz2 compressed repodata files. It was curious, however, that I couldn't find any record of mamba user agents downloading the bz2 repodata files. It seems that mamba user agents exclusively download In general, though, I think the team that runs the anaconda.org server prefers that users do not download the bz2 repodata because it takes longer for them to generate it server-side. And since the repodata.json files are generated per-request - they specify that cloudflare not cache them (nor the bz2 files) (cache status=dynamic). For the cloned channels (conda-forge, bioconda, pytorch, etc) - there's no problem downloading the bz2 repodata - it's just relatively uncommon for anyone to do it. |
@barabo yes, we're never using the bz2 files. This is about the on the fly gzip compression. You might be able to cache the gzip compression to serve the static files faster (just a theory) since the files are not that dynamic (change every 30 min or so) |
Although it might not be so simple from some googling around: https://community.cloudflare.com/t/how-to-serve-directly-my-brotli-and-gzip-pre-compressed-css-and-js-instead-of-the-cloudflare-compressed-ones/247288/9 |
Well, I don't know what to say. Cloudflare's docs say that they do compress certain content-types by default, and it looks like the
But on the other hand, this section of the page suggests that the provided user agent can determine whether gzip or brotli (or both) will be used. I wonder if the mamba user agents aren't recognized by cloudflare in an optimal way. I don't think I have access to inspect how this is done, though. However, in my local testing, I don't see any performance penalty for using the mamba user agent strings. I typically am getting 20-40MB down (when downloading the conda-forge linux-64 repodata.json) - regardless of which agent string I provide to curl. |
Gzip compression does clearly happen but its on-the-fly nature seems to be the bottleneck. Maybe it would be possible to explicitly add Alternatively, one could implement Zstd compressed Here's another benchmark showing that
|
This is where the compression happens https://github.com/conda/conda-index/blob/main/conda_index/index/__init__.py#L798 |
Thanks @dholth ! This is the explicit bz2 compression - whereas the |
I do like zstd and would like to remove bz2 entirely. Conda currently looks for bz2 but doesn't use it which is weird. If we were to support it that function would need to be updated, we'd need a new flag on conda-index, and we'd have to update a glob pattern on our CDN sync. |
|
Also when's conda-forge going to produce .conda by default? |
I'm not quite sure what |
When Anaconda.org and the CDN can support them. IIUC that is not the case yet currently (happy to learn I'm wrong). |
They should be supported now on anaconda.org and the CDN! We should make one and see how it goes. @corneliusroemer the |
Great! 🎉 For a long time that wasn't the case. There's some other work that would need to be done on the conda-forge side first ( conda-forge/conda-forge.github.io#1586 ) |
A separate question is how we would go about converting the existing packages to |
cc @beckermr |
I would suggest not converting the existing packages. It would be slow, take a lot of disk space, and double the size of repodata.json. |
We need to do a bunch of manual testing for .conda packages before we can roll that out. I've put in PRs in many places where the extension is assumed, but for sure we are going to miss things. I don't have a ton of conda-forge dev time these days, so it has been slow going on my end. |
Maybe we could come up with a list of things to do so folks like Cornelius could help? |
Sure. See the attached issue. We need PRs to conda-smithy next. I will add a few other items. |
I would be willing to invest some time into That said, this discussion is off topic, shall we split into a separate thread? Re gzip compression, IIUC the question of on the fly vs cached gzip responses is still open. I can try to find a setup of Cloudflare CDN that uses cached gzip responses in a personal Cloudflare account and report back here. |
I'd like all conda-forge related things to appear in conda-forge repos. So yes, let's move any discussion over there. The next item @jonashaag is a PR to smithy to optionally turn on .conda artifacts |
Result from trying out gzip compression on a personal Cloudflare account: You cannot make Cloudflare use cached/pre-compressed responses. It will always re-compress your files. (Maybe caching works for smaller files, not sure.) So to improve repodata load times we need to request a pre-compressed file (eg. For the selection of the compression format, here's some non-scientific compression benchmarks (b = bzip2, g = gzip, z = zstd):
So zstd is the clear winner here. It's interesting that |
@jonashaag Good to run a benchmark with repodata.json! zstd is known to be very performant so not a huge surprise ;) Interesting that I think your You can get higher compression ratios with I'll add a decompression speed benchmark on
So |
Try |
Oh that's neat! Learned something again! Looks like we'll soon get Time to add |
🤩🤩🤩 |
Are you already aware of the |
We decided on zstd -T0 -16 for repodata.json.zst by conda/conda-index. At that level the compression starts to get a bit slow. Decompression is very fast at all levels. Smaller or private channels hosted on https://anaconda.org/conda may have repodata.json generated on-demand; in that case zstd -1 or -3 might be warranted. |
WOW on my home internet repodata.json (uncompressed) curl's to /dev/null faster than repodata.json (compressed). Naturally repodata.json.zst beats them all. Thanks for reporting! |
Checklist
What happened?
Using mamba, I noticed that the maximum download speed of compressed package index jsons is 2-3 MB/s, well below my download speed.
Notably, downloading uncompressed was faster than telling the server to serve gzip.
This suggests that gzip compressed fiels are not cached
Additional Context
This was discussed and investigated at length in the mamba issue, please check this for further details: mamba-org/mamba#2021
I was asked to open here by @jakirkham conda-forge/conda-forge.github.io#1835 (comment)
@jonashaag @wolfv
The text was updated successfully, but these errors were encountered: