Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637

corneliusroemer · 2022-10-18T10:16:55Z

Checklist

I added a descriptive title
I searched open reports and couldn't find a duplicate

What happened?

Using mamba, I noticed that the maximum download speed of compressed package index jsons is 2-3 MB/s, well below my download speed.

Notably, downloading uncompressed was faster than telling the server to serve gzip.

This suggests that gzip compressed fiels are not cached

Additional Context

This was discussed and investigated at length in the mamba issue, please check this for further details: mamba-org/mamba#2021

I was asked to open here by @jakirkham conda-forge/conda-forge.github.io#1835 (comment)

@jonashaag @wolfv

jakirkham · 2022-10-18T20:23:51Z

cc @jezdez

corneliusroemer · 2022-11-06T20:54:02Z

Any progress here? It would be amazing if mamba channel downloads could last only 2s instead of 20s :)

jezdez · 2022-11-07T10:42:18Z

No progress so far, that would be visible here. @barabo can you look into (or redirect to the appropriate person) what's causing this? This smells like a CDN misconfiguration to me

barabo · 2022-11-07T21:14:32Z

I looked into this a bit last week and couldn't find anything obviously wrong in the CDN configuration for the bz2 compressed repodata files. It was curious, however, that I couldn't find any record of mamba user agents downloading the bz2 repodata files. It seems that mamba user agents exclusively download repodata.json files from channels.

In general, though, I think the team that runs the anaconda.org server prefers that users do not download the bz2 repodata because it takes longer for them to generate it server-side. And since the repodata.json files are generated per-request - they specify that cloudflare not cache them (nor the bz2 files) (cache status=dynamic).

For the cloned channels (conda-forge, bioconda, pytorch, etc) - there's no problem downloading the bz2 repodata - it's just relatively uncommon for anyone to do it.

wolfv · 2022-11-07T21:17:37Z

@barabo yes, we're never using the bz2 files.

This is about the on the fly gzip compression. You might be able to cache the gzip compression to serve the static files faster (just a theory) since the files are not that dynamic (change every 30 min or so)

wolfv · 2022-11-07T21:22:33Z

Although it might not be so simple from some googling around: https://community.cloudflare.com/t/how-to-serve-directly-my-brotli-and-gzip-pre-compressed-css-and-js-instead-of-the-cloudflare-compressed-ones/247288/9

barabo · 2022-11-07T22:51:07Z

Well, I don't know what to say. Cloudflare's docs say that they do compress certain content-types by default, and it looks like the repodata.json files are fetched as one of those types. So, I expect Cloudflare is gzip compressing them. I don't think we're providing the headers necessary to disable compression.

curl -I https://conda.anaconda.org/conda-forge/linux-64/repodata.json  0.02s user 0.01s system 11% cpu 0.262 total
(base) canderson@carls-mbp-2 mamba % curl -I https://conda.anaconda.org/conda-forge/linux-64/repodata.json
HTTP/2 200
date: Mon, 07 Nov 2022 22:43:43 GMT
content-type: application/json
content-length: 228290132
cf-ray: 766996e68a9413eb-ORD
accept-ranges: bytes
age: 324
cache-control: public, max-age=1200
etag: "2ee6b095b3f5fb9dda17f1b2741201cc"
expires: Mon, 07 Nov 2022 23:03:43 GMT
last-modified: Mon, 07 Nov 2022 22:37:52 GMT
vary: Accept-Encoding
cf-cache-status: HIT
set-cookie: __cf_bm=ogC55_Mnd.vxNKi64N8y5NMwCkAmbTi54a7gOTljyEg-1667861023-0-AR3kbil5B1sYC6XpDoAiJYLvMxpA2RsIw7EMZylo/0xD9tXjEIT2rT9nzlb6g1WQD/kDEGcWitubLaZ10maRNHh7vBC8K0g7ZaJcueAqCibq; path=/; expires=Mon, 07-Nov-22 23:13:43 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
server: cloudflare

But on the other hand, this section of the page suggests that the provided user agent can determine whether gzip or brotli (or both) will be used. I wonder if the mamba user agents aren't recognized by cloudflare in an optimal way. I don't think I have access to inspect how this is done, though.

However, in my local testing, I don't see any performance penalty for using the mamba user agent strings. I typically am getting 20-40MB down (when downloading the conda-forge linux-64 repodata.json) - regardless of which agent string I provide to curl.

corneliusroemer · 2022-11-08T01:15:04Z

Gzip compression does clearly happen but its on-the-fly nature seems to be the bottleneck.

Maybe it would be possible to explicitly add repodata.json.gz so that cloudflare doesn't need to recompute every time?

Alternatively, one could implement Zstd compressed repodata.json.zst (#648) which @wolfv seems to be happy to accept.

Here's another benchmark showing that --compressed clearly works but has a raw download speed of 3MB/s as opposed to bz2 download which gets a raw speed of 8MB/s. And 8MB/s is the max download speed I have. You (@barabo) seem to be able to get 20-40MB raw download speed for bz2 - but should still only get 3MB/s if you use --compressed. Is that correct?

$ curl --compressed https://conda.anaconda.org/conda-forge/linux-64/repodata.json > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.2M    0 26.2M    0     0  3059k      0 --:--:--  0:00:08 --:--:-- 3119k

$ curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json > /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  217M  100  217M    0     0  8288k      0  0:00:26  0:00:26 --:--:--  9.9M

$ curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json.bz2 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.8M  100 22.8M    0     0  8372k      0  0:00:02  0:00:02 --:--:-- 8370k

dholth · 2022-11-08T19:40:45Z

This is where the compression happens https://github.com/conda/conda-index/blob/main/conda_index/index/__init__.py#L798

corneliusroemer · 2022-11-08T19:42:52Z

This is where the compression happens https://github.com/conda/conda-index/blob/main/conda_index/index/__init__.py#L798

Thanks @dholth ! This is the explicit bz2 compression - whereas the gzip compression that's slow is probably cloudflare on the fly if we're not mistaken.

dholth · 2022-11-08T19:45:01Z

I do like zstd and would like to remove bz2 entirely. Conda currently looks for bz2 but doesn't use it which is weird. If we were to support it that function would need to be updated, we'd need a new flag on conda-index, and we'd have to update a glob pattern on our CDN sync.

corneliusroemer · 2022-11-08T19:46:40Z

mamba doesn't use bz2 either and @wolfv has mentioned that he would be happy for zstd to be used by mamba whereas he's not keen on bz2 - so that sounds like a good plan!

dholth · 2022-11-08T19:52:53Z

Also when's conda-forge going to produce .conda by default?

corneliusroemer · 2022-11-08T20:08:48Z

I'm not quite sure what .conda means in this context, also I'm not a conda-forge person 😀 Maybe @jakirkham or @jezdez know?

jakirkham · 2022-11-08T20:14:24Z

When Anaconda.org and the CDN can support them. IIUC that is not the case yet currently (happy to learn I'm wrong).

dholth · 2022-11-08T20:16:32Z

They should be supported now on anaconda.org and the CDN! We should make one and see how it goes.

@corneliusroemer the .conda format for conda packages uses zstd instead of .tar.bz2

jakirkham · 2022-11-08T20:19:32Z

Great! 🎉 For a long time that wasn't the case.

There's some other work that would need to be done on the conda-forge side first ( conda-forge/conda-forge.github.io#1586 )

jakirkham · 2022-11-08T20:20:34Z

A separate question is how we would go about converting the existing packages to .conda

jakirkham · 2022-11-08T20:20:43Z

cc @beckermr

dholth · 2022-11-08T20:21:26Z

I would suggest not converting the existing packages. It would be slow, take a lot of disk space, and double the size of repodata.json.

beckermr · 2022-11-08T20:25:03Z

We need to do a bunch of manual testing for .conda packages before we can roll that out. I've put in PRs in many places where the extension is assumed, but for sure we are going to miss things.

I don't have a ton of conda-forge dev time these days, so it has been slow going on my end.

jakirkham · 2022-11-08T20:46:24Z

Maybe we could come up with a list of things to do so folks like Cornelius could help?

beckermr · 2022-11-08T20:47:40Z

Sure. See the attached issue. We need PRs to conda-smithy next. I will add a few other items.

jonashaag · 2022-11-08T20:50:02Z

I would be willing to invest some time into .conda packages if you can guide me to some open problems that I’m able to work on.

That said, this discussion is off topic, shall we split into a separate thread?

Re gzip compression, IIUC the question of on the fly vs cached gzip responses is still open. I can try to find a setup of Cloudflare CDN that uses cached gzip responses in a personal Cloudflare account and report back here.

beckermr · 2022-11-08T20:51:54Z

I'd like all conda-forge related things to appear in conda-forge repos. So yes, let's move any discussion over there. The next item @jonashaag is a PR to smithy to optionally turn on .conda artifacts

jakirkham · 2022-11-08T20:57:54Z

xref: conda-forge/conda-forge.github.io#1586

jonashaag · 2022-11-14T11:51:12Z

Result from trying out gzip compression on a personal Cloudflare account: You cannot make Cloudflare use cached/pre-compressed responses. It will always re-compress your files. (Maybe caching works for smaller files, not sure.)

So to improve repodata load times we need to request a pre-compressed file (eg. repodata.json.bz2).

For the selection of the compression format, here's some non-scientific compression benchmarks (b = bzip2, g = gzip, z = zstd):

method	compression time	compression ratio
orig	0	1
b9	13.9	0.105016637432424
b1	11.6	0.116227201341733
g9	5.8	0.125971405821175
g7	2.3	0.128959094531222
g5	1.8	0.134876966036568
g1	0.9	0.16432670929425
z15	10.5	0.121263110685134
z10	2.0	0.122599202681385
z5	0.8	0.129368847595369
z1	0.3	0.115159940890874

So zstd is the clear winner here. It's interesting that -1 is smaller than -5, it's not a benchmarking mistake. I'm going to report an issue upstream about this.

corneliusroemer · 2022-11-14T19:42:34Z

@jonashaag Good to run a benchmark with repodata.json! zstd is known to be very performant so not a huge surprise ;)

Interesting that zstd -1 performs so well. I guess that's the zstd mode we should use then!

I think your time is for compression? Would be good to have both - zstd is particularly good for fast decompression as well.

You can get higher compression ratios with zstd --ultra -20 but that takes unreasonably long for minor gains.

I'll add a decompression speed benchmark on

curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json > repodata.json
time zstd -1 repodata.json
time zstd -d repodata.json.zst
...

method	decompression time in seconds
orig
b9	13.9
b1	11.6
g9	0.33
g1	0.36
z1	0.12

So zstd -1 is the best for compression and uncompression speed, and only loses against b9 by a small margin for compression ratio. Overall very clear result: we should use zstd.

jonashaag · 2022-11-14T19:47:20Z

Try zstd -b1 :)

corneliusroemer · 2022-11-14T19:54:03Z

Oh that's neat! Learned something again!

Looks like we'll soon get zst compressed outputs @jonashaag, see conda/conda-index#65

Time to add mamba support 🙃

jonashaag · 2022-11-14T20:00:12Z

🤩🤩🤩

jakirkham · 2022-11-14T20:18:20Z

Are you already aware of the libmamba solver for Conda?

dholth · 2022-11-14T20:32:44Z

We decided on zstd -T0 -16 for repodata.json.zst by conda/conda-index. At that level the compression starts to get a bit slow. Decompression is very fast at all levels.

Smaller or private channels hosted on https://anaconda.org/conda may have repodata.json generated on-demand; in that case zstd -1 or -3 might be warranted.

dholth · 2022-11-16T19:24:36Z

WOW on my home internet repodata.json (uncompressed) curl's to /dev/null faster than repodata.json (compressed). Naturally repodata.json.zst beats them all. Thanks for reporting!

corneliusroemer added the type::bug describes erroneous operation, use severity::* to classify the type label Oct 18, 2022

conda-bot added this to 🧭 Planning Oct 18, 2022

corneliusroemer mentioned this issue Oct 18, 2022

Why does mamba often redownload conda-forge/osx-64 channel index rather than checking/using cached version? mamba-org/mamba#2021

Closed

corneliusroemer mentioned this issue Nov 6, 2022

ENH: Make zstd compressed index files available #648

Closed

2 tasks

jakirkham mentioned this issue Nov 8, 2022

build and process .conda artifacts conda-forge/conda-forge.github.io#1586

Closed

20 tasks

dholth closed this as completed Nov 16, 2022

dholth moved this to Done in 🧭 Planning Nov 16, 2022

dholth reopened this Nov 16, 2022

dholth closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2022

github-actions bot added the locked [bot] locked due to inactivity label Nov 17, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637

corneliusroemer commented Oct 18, 2022 •

edited

Loading

jakirkham commented Oct 18, 2022

corneliusroemer commented Nov 6, 2022

jezdez commented Nov 7, 2022

barabo commented Nov 7, 2022

wolfv commented Nov 7, 2022

wolfv commented Nov 7, 2022

barabo commented Nov 7, 2022

corneliusroemer commented Nov 8, 2022 •

edited

Loading

dholth commented Nov 8, 2022

corneliusroemer commented Nov 8, 2022 •

edited

Loading

dholth commented Nov 8, 2022

corneliusroemer commented Nov 8, 2022

dholth commented Nov 8, 2022

corneliusroemer commented Nov 8, 2022

jakirkham commented Nov 8, 2022

dholth commented Nov 8, 2022 •

edited

Loading

jakirkham commented Nov 8, 2022

jakirkham commented Nov 8, 2022

jakirkham commented Nov 8, 2022

dholth commented Nov 8, 2022 •

edited

Loading

beckermr commented Nov 8, 2022

jakirkham commented Nov 8, 2022

beckermr commented Nov 8, 2022

jonashaag commented Nov 8, 2022

beckermr commented Nov 8, 2022

jakirkham commented Nov 8, 2022

jonashaag commented Nov 14, 2022 •

edited

Loading

corneliusroemer commented Nov 14, 2022 •

edited

Loading

jonashaag commented Nov 14, 2022

corneliusroemer commented Nov 14, 2022

jonashaag commented Nov 14, 2022

jakirkham commented Nov 14, 2022

dholth commented Nov 14, 2022

dholth commented Nov 16, 2022

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637

Comments

corneliusroemer commented Oct 18, 2022 • edited Loading

Checklist

What happened?

Additional Context

jakirkham commented Oct 18, 2022

corneliusroemer commented Nov 6, 2022

jezdez commented Nov 7, 2022

barabo commented Nov 7, 2022

wolfv commented Nov 7, 2022

wolfv commented Nov 7, 2022

barabo commented Nov 7, 2022

corneliusroemer commented Nov 8, 2022 • edited Loading

dholth commented Nov 8, 2022

corneliusroemer commented Nov 8, 2022 • edited Loading

dholth commented Nov 8, 2022

corneliusroemer commented Nov 8, 2022

dholth commented Nov 8, 2022

corneliusroemer commented Nov 8, 2022

jakirkham commented Nov 8, 2022

dholth commented Nov 8, 2022 • edited Loading

jakirkham commented Nov 8, 2022

jakirkham commented Nov 8, 2022

jakirkham commented Nov 8, 2022

dholth commented Nov 8, 2022 • edited Loading

beckermr commented Nov 8, 2022

jakirkham commented Nov 8, 2022

beckermr commented Nov 8, 2022

jonashaag commented Nov 8, 2022

beckermr commented Nov 8, 2022

jakirkham commented Nov 8, 2022

jonashaag commented Nov 14, 2022 • edited Loading

corneliusroemer commented Nov 14, 2022 • edited Loading

jonashaag commented Nov 14, 2022

corneliusroemer commented Nov 14, 2022

jonashaag commented Nov 14, 2022

jakirkham commented Nov 14, 2022

dholth commented Nov 14, 2022

dholth commented Nov 16, 2022

corneliusroemer commented Oct 18, 2022 •

edited

Loading

corneliusroemer commented Nov 8, 2022 •

edited

Loading

corneliusroemer commented Nov 8, 2022 •

edited

Loading

dholth commented Nov 8, 2022 •

edited

Loading

dholth commented Nov 8, 2022 •

edited

Loading

jonashaag commented Nov 14, 2022 •

edited

Loading

corneliusroemer commented Nov 14, 2022 •

edited

Loading