remote upload: Skip gzip compression for files that are already compressed #161
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This avoids some useless work but mostly serves to head off confusion
(e.g. curl without the --compression option) and/or quirks of HTTP
clients (e.g. Snakemake's HTTP remote file provider¹) when a compressed
file is compressed again and served with a Content-Encoding: gzip
header.²
This doesn't come up with Nextstrain dataset and narrative files but
does with adjacent input files like metadata.tsv.gz and
sequences.fasta.xz which we also put in the S3 buckets (e.g.
s3://nextstrain-data/files/zika/…).³
The remote family of commands is not intended for generic S3 management
per se, but they're often useful in the Nextstrain ecosystem to manage
these ancillary data files. Part of this is that the commands are handy
and available, part of it is that Cloudfront invalidation still remains
a complication with using
aws s3
directly. Avoiding doublecompression doesn't go so far out of our way and helps support this
slightly off-label use case.
¹ snakemake/snakemake#1508
² https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1647910842228169
³ nextstrain/fauna#114