Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote upload: Skip gzip compression for files that are already compressed #161

Merged
merged 1 commit into from
Mar 22, 2022

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Mar 22, 2022

This avoids some useless work but mostly serves to head off confusion
(e.g. curl without the --compression option) and/or quirks of HTTP
clients (e.g. Snakemake's HTTP remote file provider¹) when a compressed
file is compressed again and served with a Content-Encoding: gzip
header.²

This doesn't come up with Nextstrain dataset and narrative files but
does with adjacent input files like metadata.tsv.gz and
sequences.fasta.xz which we also put in the S3 buckets (e.g.
s3://nextstrain-data/files/zika/…).³

The remote family of commands is not intended for generic S3 management
per se, but they're often useful in the Nextstrain ecosystem to manage
these ancillary data files. Part of this is that the commands are handy
and available, part of it is that Cloudfront invalidation still remains
a complication with using aws s3 directly. Avoiding double
compression doesn't go so far out of our way and helps support this
slightly off-label use case.

¹ snakemake/snakemake#1508
² https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1647910842228169
³ nextstrain/fauna#114

…essed

This avoids some useless work but mostly serves to head off confusion
(e.g. curl without the --compression option) and/or quirks of HTTP
clients (e.g. Snakemake's HTTP remote file provider¹) when a compressed
file is compressed again and served with a Content-Encoding: gzip
header.²

This doesn't come up with Nextstrain dataset and narrative files but
does with adjacent input files like metadata.tsv.gz and
sequences.fasta.xz which we also put in the S3 buckets (e.g.
s3://nextstrain-data/files/zika/…).³

The remote family of commands is not intended for generic S3 management
per se, but they're often useful in the Nextstrain ecosystem to manage
these ancillary data files.  Part of this is that the commands are handy
and available, part of it is that Cloudfront invalidation still remains
a complication with using `aws s3` directly.  Avoiding double
compression doesn't go so far out of our way and helps support this
slightly off-label use case.

¹ snakemake/snakemake#1508
² https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1647910842228169
³ nextstrain/fauna#114
@tsibley tsibley requested a review from a team March 22, 2022 21:20
@trvrb
Copy link
Member

trvrb commented Mar 22, 2022

Thanks for including these guardrails @tsibley. Running with my original commands prevents the confusing behavior. Much appreciated.

@tsibley tsibley merged commit d04023d into master Mar 22, 2022
@tsibley tsibley deleted the trs/skip-compression-if-already-compressed branch March 22, 2022 22:48
@tsibley
Copy link
Member Author

tsibley commented Mar 22, 2022

Released with 3.2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants