-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[build] --cpus and --memory don't work with newer AWS Batch job definitions #144
Comments
This works! I tested with diff --git a/nextstrain/cli/runner/aws_batch/jobs.py b/nextstrain/cli/runner/aws_batch/jobs.py
index 81b0c4b..b2a83be 100644
--- a/nextstrain/cli/runner/aws_batch/jobs.py
+++ b/nextstrain/cli/runner/aws_batch/jobs.py
@@ -177,8 +177,10 @@ def submit(name: str,
*forwarded_environment(),
*[{"name": name, "value": value} for name, value in env.items()]
],
- **({ "vcpus": cpus } if cpus else {}),
- **({ "memory": memory } if memory else {}),
+ "resourceRequirements": [
+ *([{ "type": "VCPU", "value": str(cpus) }] if cpus else []),
+ *([{ "type": "MEMORY", "value": str(memory) }] if memory else []),
+ ],
"command": [
"/sbin/entrypoint-aws-batch",
*exec and our usual job definition which defaults to
I ran a Snakefile that looked like this (containing more than was necessary to test, but it was handy from verifying #175): import os
rule:
shell: f"""
echo instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
echo instance-type $(curl -s http://169.254.169.254/latest/meta-data/instance-type)
echo workflow.cores {workflow.cores}
echo nproc "$(nproc)"
echo os.sched_getaffinity {len(os.sched_getaffinity(0))}
echo os.cpu_count {os.cpu_count()}
env | grep -i threads || true
""" I observed that when I didn't pass |
… of .vcpu and .memory Job submissions with resourceRequirements correctly override the defaults from job definitions that use the newer resourceRequirements and those that use the older, deprecated vcpu and memory fields. Job submissions using the older fields only override job definitions that also use the older fields; otherwise they're ignored with only an easy to miss warning in the AWS console to alert you. Notably, using the AWS console to modify an existing job definition which uses vcpus and memory will switch the definition to resourceRequirements automatically in the new revision. This meant revising your old job definition in the AWS console could break --cpus and --memory for your `nextstrain build` invocations. It also means that --cpus and --memory would never have worked if your AWS job definitions were originally created with resourceRequirements. Resolves <#144>.
… of .vcpu and .memory Job submissions with resourceRequirements correctly override the defaults from job definitions that use the newer resourceRequirements and those that use the older, deprecated vcpu and memory fields. Job submissions using the older fields only override job definitions that also use the older fields; otherwise they're ignored with only an easy to miss warning in the AWS console to alert you. Notably, using the AWS console to modify an existing job definition which uses vcpus and memory will switch the definition to resourceRequirements automatically in the new revision. This meant revising your old job definition in the AWS console could break --cpus and --memory for your `nextstrain build` invocations. It also means that --cpus and --memory would never have worked if your AWS job definitions were originally created with resourceRequirements. Resolves <#144>.
Since the update in #177 was included in the latest release of the CLI, our AWS Batch jobs have had the same override warning: The warning has gone away since I created a new revision of the job definition via the AWS console. I think this is just a UI bug on AWS because our large ncov builds have been running successfully despite the warning. I just wanted to document here since it confused me last week when our GISAID ncov-ingest run failed due to an out-of-memory issue (see Slack thread) |
@joverlee521 Hmm. I'm slightly skeptical of it being an AWS Console bug. I wonder if something else is happening here. Do you have an example of a job where the warnings appeared? Did you confirm the job was submitted with Nextstrain CLI 4.0.0? |
From what I've seen, the warning appears in all ncov-ingest/ncov jobs using job definition The earliest one I can find is AWS Batch job |
Ok, I'm now in agreement that it's a Console UI bug, or at least misleading UI. The warning appears to apply to any Also, the warning text itself seems misleading/incorrect because while it says (emphasis mine):
but I think the value used is actually coming from the job submission's resourceRequirements key based on previous testing above. |
The
--cpus
and--memory
options fornextstrain build
don't work with newer AWS Batch job definitions that use entries inresourceRequirements
(instead of separatevcpus
andmemory
properties) to declare default CPU and memory requirements.More information on this deprecation is at https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#override-resource-requirements and displayed in a tooltip in the AWS console:
Our initial troubleshooting happened in Slack.
Notably, using the AWS console to modify an existing job definition which uses
vcpus
andmemory
will switch toresourceRequirements
in the new revision. However, you can still create definitions usingvcpus
andmemory
via the AWS API or CLI, e.g.aws batch register-job-definition
.Ideal result would be supporting both kinds of job definitions without much extra effort. Maybe defining overrides in
resourceRequirements
during submission will work with both kinds of definitions? Should test.The text was updated successfully, but these errors were encountered: