[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

tsibley · 2022-02-04T00:27:10Z

The --cpus and --memory options for nextstrain build don't work with newer AWS Batch job definitions that use entries in resourceRequirements (instead of separate vcpus and memory properties) to declare default CPU and memory requirements.

More information on this deprecation is at https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#override-resource-requirements and displayed in a tooltip in the AWS console:

This value was submitted using containerOverrides.vcpus which has been deprecated and was not used as an override. Instead, the VCPU value found in the job definition's resourceRequirements key was used instead. More information about the deprecated key can be found in the AWS Batch API documentation.

Our initial troubleshooting happened in Slack.

Notably, using the AWS console to modify an existing job definition which uses vcpus and memory will switch to resourceRequirements in the new revision. However, you can still create definitions using vcpus and memory via the AWS API or CLI, e.g. aws batch register-job-definition.

Ideal result would be supporting both kinds of job definitions without much extra effort. Maybe defining overrides in resourceRequirements during submission will work with both kinds of definitions? Should test.

The text was updated successfully, but these errors were encountered:

tsibley · 2022-05-25T23:04:44Z

Ideal result would be supporting both kinds of job definitions without much extra effort. Maybe defining overrides in resourceRequirements during submission will work with both kinds of definitions? Should test.

This works! I tested with

diff --git a/nextstrain/cli/runner/aws_batch/jobs.py b/nextstrain/cli/runner/aws_batch/jobs.py
index 81b0c4b..b2a83be 100644
--- a/nextstrain/cli/runner/aws_batch/jobs.py
+++ b/nextstrain/cli/runner/aws_batch/jobs.py
@@ -177,8 +177,10 @@ def submit(name: str,
                 *forwarded_environment(),
                 *[{"name": name, "value": value} for name, value in env.items()]
             ],
-            **({ "vcpus": cpus } if cpus else {}),
-            **({ "memory": memory } if memory else {}),
+            "resourceRequirements": [
+                *([{ "type": "VCPU", "value": str(cpus) }] if cpus else []),
+                *([{ "type": "MEMORY", "value": str(memory) }] if memory else []),
+            ],
             "command": [
                 "/sbin/entrypoint-aws-batch",
                 *exec

and our usual job definition which defaults to "vcpus": 4 (e.g. the old style not resourceRequirements) and our usual job queue. At the time of my testing, there were three Batch-managed EC2 instances already running:

c5.4xlarge (16 CPUs) with a 16 CPU ncov-ingest job running on it
c5.9xlarge (36 CPUs) with a 16 CPU ncov-ingest job running on it (20 CPUs spare!)
c5.24xlarge (96 CPUs) with a 72 CPU ncov trial job running on it (24 CPUs spare!)

I ran a Snakefile that looked like this (containing more than was necessary to test, but it was handy from verifying #175):

import os

rule:
    shell: f"""
        echo instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
        echo instance-type $(curl -s http://169.254.169.254/latest/meta-data/instance-type)
        echo workflow.cores {workflow.cores}
        echo nproc "$(nproc)"
        echo os.sched_getaffinity {len(os.sched_getaffinity(0))}
        echo os.cpu_count {os.cpu_count()}
        env | grep -i threads || true
    """

I observed that when I didn't pass --cpus, my job quickly started on the c5.9xlarge (default of 4 < 20 spare). When I passed --cpus X for several values of where X was less than 20, my job also quickly started on the c5.9xlarge. When I passed --cpus 36, the job took longer to start and Batch spun up a new c5.9xlarge instance for it (and quickly spun it back down after the job was done). Concurrently with that, jobs submitted with --cpus X where X < 20 continued to place and start on the original c5.9xlarge with spare CPUs. This indicates to me that the new resourceRequirements in the submitted job were working as expected to override the old default vcpus of the job description.

… of .vcpu and .memory Job submissions with resourceRequirements correctly override the defaults from job definitions that use the newer resourceRequirements and those that use the older, deprecated vcpu and memory fields. Job submissions using the older fields only override job definitions that also use the older fields; otherwise they're ignored with only an easy to miss warning in the AWS console to alert you. Notably, using the AWS console to modify an existing job definition which uses vcpus and memory will switch the definition to resourceRequirements automatically in the new revision. This meant revising your old job definition in the AWS console could break --cpus and --memory for your `nextstrain build` invocations. It also means that --cpus and --memory would never have worked if your AWS job definitions were originally created with resourceRequirements. Resolves <#144>.

joverlee521 · 2022-07-05T18:01:49Z

Since the update in #177 was included in the latest release of the CLI, our AWS Batch jobs have had the same override warning:

The warning has gone away since I created a new revision of the job definition via the AWS console. I think this is just a UI bug on AWS because our large ncov builds have been running successfully despite the warning. I just wanted to document here since it confused me last week when our GISAID ncov-ingest run failed due to an out-of-memory issue (see Slack thread)

tsibley · 2022-07-05T18:48:58Z

@joverlee521 Hmm. I'm slightly skeptical of it being an AWS Console bug. I wonder if something else is happening here. Do you have an example of a job where the warnings appeared? Did you confirm the job was submitted with Nextstrain CLI 4.0.0?

joverlee521 · 2022-07-05T19:21:31Z

From what I've seen, the warning appears in all ncov-ingest/ncov jobs using job definition nextstrain-job:23 since the release of Nextstrain CLI 4.0.0. on Jun 24, 2022.

The earliest one I can find is AWS Batch job 6eb606e3-67a0-4bfa-88f3-42af63dbe514, which was launched by GH Action 7050978590. The GH Action logs show it was using nextstrain cli 4.0.0.

tsibley · 2022-07-05T20:42:54Z

Ok, I'm now in agreement that it's a Console UI bug, or at least misleading UI.

The warning appears to apply to any containerOverrides values for vcpus and memory, regardless of if they come from the job submission or the job definition. The warning is necessary because the Console UI shows the values from containerOverrides in the job definition even if the job submission used the newer resourceRequirements. I'd previously understood the warning to only apply to the reverse: the job definition using the newer resourceRequirements while the job submission used the older containerOverrides.

Also, the warning text itself seems misleading/incorrect because while it says (emphasis mine):

This value was submitted using containerOverrides.memory which has been deprecated and was not used as an override. Instead, the MEMORY value found in the job definition’s resourceRequirements key was used instead. More information about the deprecated key can be found in the AWS Batch API documentation

but I think the value used is actually coming from the job submission's resourceRequirements key based on previous testing above.

tsibley added the bug Something isn't working label Feb 4, 2022

tsibley added this to Nextstrain planning (archived) Feb 4, 2022

tsibley moved this to New in Nextstrain planning (archived) Feb 4, 2022

tsibley mentioned this issue May 25, 2022

[build] Snakemake will auto-detect an incorrect number of CPUs on AWS Batch when --cpus is not used #175

Open

tsibley self-assigned this May 25, 2022

tsibley mentioned this issue May 26, 2022

runner.aws_batch: Use containerOverrides.resourceRequirements instead of .vcpu and .memory #177

Merged

tsibley closed this as completed in #177 May 27, 2022

Repository owner moved this from Prioritized to Done in Nextstrain planning (archived) May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

tsibley commented Feb 4, 2022

tsibley commented May 25, 2022

joverlee521 commented Jul 5, 2022

tsibley commented Jul 5, 2022

joverlee521 commented Jul 5, 2022

tsibley commented Jul 5, 2022 •

edited

Loading

[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

Comments

tsibley commented Feb 4, 2022

tsibley commented May 25, 2022

joverlee521 commented Jul 5, 2022

tsibley commented Jul 5, 2022

joverlee521 commented Jul 5, 2022

tsibley commented Jul 5, 2022 • edited Loading

tsibley commented Jul 5, 2022 •

edited

Loading