Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] exclude directories for upload #219

Closed
jameshadfield opened this issue Sep 16, 2022 · 9 comments · Fixed by #370
Closed

[batch] exclude directories for upload #219

jameshadfield opened this issue Sep 16, 2022 · 9 comments · Fixed by #370
Assignees
Labels
enhancement New feature or request priority: high To be resolved before other issues

Comments

@jameshadfield
Copy link
Member

Context

Deploying a build to AWS-batch currently zips up (almost) the entire current directory and uploads this to S3. Certain assets are not uploaded, e.g. .git/, but these are not user-definable. I commonly have lots of large files which are unnecessary for the current build and it would be quicker and less overhead to ignore them for the purposes of the desired build. As examples: ./data (sometimes >100Gb), big JSONs in ./auspice (can take more than an hour to zip), and a huge number of intermediate files in ./results.

This causes pain during job deployment (both time to zip up, and time to upload the zip file) as well as during download (as the zip file is bigger than necessary).

My current workflow often involves creating a "temporary" directory which is a copy of the the current one without those directories, or moving the directories to a temporary place while the job is deployed. Both are a pain and prone to messing up badly!

Description

Add a --exclude-from-upload flag, which I'd commonly use use like so: --exclude-from-upload auspice results data

I'm not sure how this would work when downloading if you had a previous auspice/ directory which wasn't part of the uploaded assets, and then downloaded a (aws-batch generated) auspice/ directory. I'm not that familiar with the logic for working out which files to download (I typically request a subset of files to be downloaded).

@jameshadfield jameshadfield added the enhancement New feature or request label Sep 16, 2022
@jameshadfield
Copy link
Member Author

I imagine we could reuse a lot of the code behind the --download functionality here

    if supplied_exclude_patterns:
        force_exclude = glob_matcher(patterns)
    else:
        force_exclude = lambda path: False

    hardcoded_excluded = path_matcher([
        ".git/", ...
    ])

    excluded = lambda path: hardcoded_excluded(path) or force_exclude(path)

@corneliusroemer
Copy link
Member

corneliusroemer commented Sep 16, 2022

Edit: I misunderstood what the issue was about, so the struck out stuff below isn't applicable
I agree that we don't need to update things that are just downloaded - like data, but results are often useful for debugging or quickly rerunning a rule locally without having to rerun time-intensive steps.

Also, we may reduce storage by tar.zst compressing instead of using standard zip. Since we have lots of sequences zstd compression would work extremely well.

@jameshadfield
Copy link
Member Author

but results are often useful for debugging or quickly rerunning a rule locally without having to rerun time-intensive steps

Yup, I agree, but I think we're thinking of slightly different things. This is a request to avoid uploading certain artefacts via an opt-in process -- e.g. in my case ./results from previous runs which aren't relevant to the current job I want to deploy.

@tsibley
Copy link
Member

tsibley commented Sep 19, 2022

I think the confusion may be that @jameshadfield is talking about the upload from the local computer → S3 before the Batch job is submitted and @corneliusroemer is talking about the upload from the remote Batch job → S3 at workflow completion.

@victorlin victorlin moved this from New to Backlog in Nextstrain planning (archived) Sep 20, 2022
@corneliusroemer
Copy link
Member

corneliusroemer commented Jan 25, 2023

Using --aws-batch for the first time from a local directory (rather than in CI) I noticed the exact same thing: previous results and data get zipped which takes long (~10min in my case). So definitely +1 for this feature request.

A good standard option could be to use .gitignore to determine what to ignore. That usually comprises input and output data that gets pulled/produced by other workflows. This wouldn't allow much customization, so --exclude-from-upload may still be needed as well.

@tsibley
Copy link
Member

tsibley commented Jan 27, 2023

The intention is for previous results and input data to be included by default so that the AWS Batch runner works the same way as the other runners, preserving the consistent interface of nextstrain build, enabling use of partial remote builds, etc. Using .gitignore would change that behaviour, though, as the results and input files produced by the workflow are almost always ignored from version control. I think it's important for nextstrain build to have consistent behaviour.

I do think that there should be a way to specify ignore patterns, though, probably both via a command-line option and via an ignores file which can live with the workflow.

More generally, I'd like to separate out the workflow as program (rules, code, etc.) from the workflow as state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this issue too. For example, instead of relying on the implicit state combined with the code like now, you'd explicitly pass an empty starting state separate from the workflow program. (Which is effectively what James is doing in his workaround described above.)

@huddlej
Copy link
Contributor

huddlej commented May 14, 2024

We just got a +1 for this functionality from @lmoncla since she often has large subdirectories in her top-level flu build directories that she doesn't want to upload during AWS Batch jobs especially when she has a poor internet connection.

@victorlin victorlin added the priority: high To be resolved before other issues label May 14, 2024
@tsibley tsibley self-assigned this May 14, 2024
@tsibley
Copy link
Member

tsibley commented May 14, 2024

+1 from @lmoncla and @trvrb in a Slack thread, along with some feature design discussion.

tsibley added a commit that referenced this issue May 15, 2024
@tsibley
Copy link
Member

tsibley commented May 15, 2024

Working prototype: ab8b6d1

That'll be the basis for a PR sometime this week (when exactly depends on meetings and other work).

tsibley added a commit that referenced this issue May 24, 2024
tsibley added a commit that referenced this issue May 24, 2024
Excluding files from upload is very handy for ad-hoc skipping of large
ancillary files or previous output files in the build directory that the
user wants to ignore for the remote build on AWS Batch (i.e. to start it
"fresh").  Existing workarounds for the lack of an exclusion mechanism
include git worktrees to obtain a clean state and moving files or
directories temporarily out of the build directory.

A future improvement would be adding support to also specify these
patterns via a file, which can then be checked in to a build repo and
shared by all users.

As a broader improvement, I'd like to design away this issue by
separating out the workflow-as-program (rules, code, etc.) from the
workflow-as-state (config, inputs, outputs, etc.).  This comes out of
several other goals, but would apply to this "excluded files" need too.
For example, instead of relying on the implicit state combined with the
code like now in a single build directory, we'd instead explicitly pass
an empty starting state separate from the workflow program.

This change includes a small bug fix for --download to allow _only_
negated patterns.

Resolves: <#219>
tsibley added a commit that referenced this issue May 24, 2024
Excluding files from upload is very handy for ad-hoc skipping of large
ancillary files or previous output files in the build directory that the
user wants to ignore for the remote build on AWS Batch (i.e. to start it
"fresh").  Existing workarounds for the lack of an exclusion mechanism
include git worktrees to obtain a clean state and moving files or
directories temporarily out of the build directory.

A future improvement would be adding support to also specify these
patterns via a file, which can then be checked in to a build repo and
shared by all users.

As a broader improvement, I'd like to design away this issue by
separating out the workflow-as-program (rules, code, etc.) from the
workflow-as-state (config, inputs, outputs, etc.).  This comes out of
several other goals, but would apply to this "excluded files" need too.
For example, instead of relying on the implicit state combined with the
code like now in a single build directory, we'd instead explicitly pass
an empty starting state separate from the workflow program.

This change includes a small bug fix for --download to allow _only_
negated patterns.

Resolves: <#219>
tsibley added a commit that referenced this issue May 24, 2024
Excluding files from upload is very handy for ad-hoc skipping of large
ancillary files or previous output files in the build directory that the
user wants to ignore for the remote build on AWS Batch (i.e. to start it
"fresh").  Existing workarounds for the lack of an exclusion mechanism
include git worktrees to obtain a clean state and moving files or
directories temporarily out of the build directory.

A future improvement would be adding support to also specify these
patterns via a file, which can then be checked in to a build repo and
shared by all users.

As a broader improvement, I'd like to design away this issue by
separating out the workflow-as-program (rules, code, etc.) from the
workflow-as-state (config, inputs, outputs, etc.).  This comes out of
several other goals, but would apply to this "excluded files" need too.
For example, instead of relying on the implicit state combined with the
code like now in a single build directory, we'd instead explicitly pass
an empty starting state separate from the workflow program.

This change includes a small bug fix for --download to allow _only_
negated patterns.

Resolves: <#219>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: high To be resolved before other issues
Projects
No open projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

5 participants