-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] exclude directories for upload #219
Comments
I imagine we could reuse a lot of the code behind the if supplied_exclude_patterns:
force_exclude = glob_matcher(patterns)
else:
force_exclude = lambda path: False
hardcoded_excluded = path_matcher([
".git/", ...
])
excluded = lambda path: hardcoded_excluded(path) or force_exclude(path) |
Edit: I misunderstood what the issue was about, so the struck out stuff below isn't applicable
|
Yup, I agree, but I think we're thinking of slightly different things. This is a request to avoid uploading certain artefacts via an opt-in process -- e.g. in my case |
I think the confusion may be that @jameshadfield is talking about the upload from the local computer → S3 before the Batch job is submitted and @corneliusroemer is talking about the upload from the remote Batch job → S3 at workflow completion. |
Using A good standard option could be to use |
The intention is for previous results and input data to be included by default so that the AWS Batch runner works the same way as the other runners, preserving the consistent interface of I do think that there should be a way to specify ignore patterns, though, probably both via a command-line option and via an ignores file which can live with the workflow. More generally, I'd like to separate out the workflow as program (rules, code, etc.) from the workflow as state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this issue too. For example, instead of relying on the implicit state combined with the code like now, you'd explicitly pass an empty starting state separate from the workflow program. (Which is effectively what James is doing in his workaround described above.) |
We just got a +1 for this functionality from @lmoncla since she often has large subdirectories in her top-level flu build directories that she doesn't want to upload during AWS Batch jobs especially when she has a poor internet connection. |
+1 from @lmoncla and @trvrb in a Slack thread, along with some feature design discussion. |
Working prototype: ab8b6d1 That'll be the basis for a PR sometime this week (when exactly depends on meetings and other work). |
Excluding files from upload is very handy for ad-hoc skipping of large ancillary files or previous output files in the build directory that the user wants to ignore for the remote build on AWS Batch (i.e. to start it "fresh"). Existing workarounds for the lack of an exclusion mechanism include git worktrees to obtain a clean state and moving files or directories temporarily out of the build directory. A future improvement would be adding support to also specify these patterns via a file, which can then be checked in to a build repo and shared by all users. As a broader improvement, I'd like to design away this issue by separating out the workflow-as-program (rules, code, etc.) from the workflow-as-state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this "excluded files" need too. For example, instead of relying on the implicit state combined with the code like now in a single build directory, we'd instead explicitly pass an empty starting state separate from the workflow program. This change includes a small bug fix for --download to allow _only_ negated patterns. Resolves: <#219>
Excluding files from upload is very handy for ad-hoc skipping of large ancillary files or previous output files in the build directory that the user wants to ignore for the remote build on AWS Batch (i.e. to start it "fresh"). Existing workarounds for the lack of an exclusion mechanism include git worktrees to obtain a clean state and moving files or directories temporarily out of the build directory. A future improvement would be adding support to also specify these patterns via a file, which can then be checked in to a build repo and shared by all users. As a broader improvement, I'd like to design away this issue by separating out the workflow-as-program (rules, code, etc.) from the workflow-as-state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this "excluded files" need too. For example, instead of relying on the implicit state combined with the code like now in a single build directory, we'd instead explicitly pass an empty starting state separate from the workflow program. This change includes a small bug fix for --download to allow _only_ negated patterns. Resolves: <#219>
Excluding files from upload is very handy for ad-hoc skipping of large ancillary files or previous output files in the build directory that the user wants to ignore for the remote build on AWS Batch (i.e. to start it "fresh"). Existing workarounds for the lack of an exclusion mechanism include git worktrees to obtain a clean state and moving files or directories temporarily out of the build directory. A future improvement would be adding support to also specify these patterns via a file, which can then be checked in to a build repo and shared by all users. As a broader improvement, I'd like to design away this issue by separating out the workflow-as-program (rules, code, etc.) from the workflow-as-state (config, inputs, outputs, etc.). This comes out of several other goals, but would apply to this "excluded files" need too. For example, instead of relying on the implicit state combined with the code like now in a single build directory, we'd instead explicitly pass an empty starting state separate from the workflow program. This change includes a small bug fix for --download to allow _only_ negated patterns. Resolves: <#219>
Context
Deploying a build to AWS-batch currently zips up (almost) the entire current directory and uploads this to S3. Certain assets are not uploaded, e.g.
.git/
, but these are not user-definable. I commonly have lots of large files which are unnecessary for the current build and it would be quicker and less overhead to ignore them for the purposes of the desired build. As examples:./data
(sometimes >100Gb), big JSONs in./auspice
(can take more than an hour to zip), and a huge number of intermediate files in./results
.This causes pain during job deployment (both time to zip up, and time to upload the zip file) as well as during download (as the zip file is bigger than necessary).
My current workflow often involves creating a "temporary" directory which is a copy of the the current one without those directories, or moving the directories to a temporary place while the job is deployed. Both are a pain and prone to messing up badly!
Description
Add a
--exclude-from-upload
flag, which I'd commonly use use like so:--exclude-from-upload auspice results data
I'm not sure how this would work when downloading if you had a previous
auspice/
directory which wasn't part of the uploaded assets, and then downloaded a (aws-batch generated)auspice/
directory. I'm not that familiar with the logic for working out which files to download (I typically request a subset of files to be downloaded).The text was updated successfully, but these errors were encountered: