Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

Closed
tsibley opened this issue Feb 4, 2022 · 5 comments · Fixed by #177
Closed

[build] --cpus and --memory don't work with newer AWS Batch job definitions #144

tsibley opened this issue Feb 4, 2022 · 5 comments · Fixed by #177
Assignees
Labels
bug Something isn't working

Comments

@tsibley
Copy link
Member

tsibley commented Feb 4, 2022

The --cpus and --memory options for nextstrain build don't work with newer AWS Batch job definitions that use entries in resourceRequirements (instead of separate vcpus and memory properties) to declare default CPU and memory requirements.

More information on this deprecation is at https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#override-resource-requirements and displayed in a tooltip in the AWS console:

This value was submitted using containerOverrides.vcpus which has been deprecated and was not used as an override. Instead, the VCPU value found in the job definition's resourceRequirements key was used instead. More information about the deprecated key can be found in the AWS Batch API documentation.

Our initial troubleshooting happened in Slack.

Notably, using the AWS console to modify an existing job definition which uses vcpus and memory will switch to resourceRequirements in the new revision. However, you can still create definitions using vcpus and memory via the AWS API or CLI, e.g. aws batch register-job-definition.

Ideal result would be supporting both kinds of job definitions without much extra effort. Maybe defining overrides in resourceRequirements during submission will work with both kinds of definitions? Should test.

@tsibley
Copy link
Member Author

tsibley commented May 25, 2022

Ideal result would be supporting both kinds of job definitions without much extra effort. Maybe defining overrides in resourceRequirements during submission will work with both kinds of definitions? Should test.

This works! I tested with

diff --git a/nextstrain/cli/runner/aws_batch/jobs.py b/nextstrain/cli/runner/aws_batch/jobs.py
index 81b0c4b..b2a83be 100644
--- a/nextstrain/cli/runner/aws_batch/jobs.py
+++ b/nextstrain/cli/runner/aws_batch/jobs.py
@@ -177,8 +177,10 @@ def submit(name: str,
                 *forwarded_environment(),
                 *[{"name": name, "value": value} for name, value in env.items()]
             ],
-            **({ "vcpus": cpus } if cpus else {}),
-            **({ "memory": memory } if memory else {}),
+            "resourceRequirements": [
+                *([{ "type": "VCPU", "value": str(cpus) }] if cpus else []),
+                *([{ "type": "MEMORY", "value": str(memory) }] if memory else []),
+            ],
             "command": [
                 "/sbin/entrypoint-aws-batch",
                 *exec

and our usual job definition which defaults to "vcpus": 4 (e.g. the old style not resourceRequirements) and our usual job queue. At the time of my testing, there were three Batch-managed EC2 instances already running:

  • c5.4xlarge (16 CPUs) with a 16 CPU ncov-ingest job running on it
  • c5.9xlarge (36 CPUs) with a 16 CPU ncov-ingest job running on it (20 CPUs spare!)
  • c5.24xlarge (96 CPUs) with a 72 CPU ncov trial job running on it (24 CPUs spare!)

I ran a Snakefile that looked like this (containing more than was necessary to test, but it was handy from verifying #175):

import os

rule:
    shell: f"""
        echo instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
        echo instance-type $(curl -s http://169.254.169.254/latest/meta-data/instance-type)
        echo workflow.cores {workflow.cores}
        echo nproc "$(nproc)"
        echo os.sched_getaffinity {len(os.sched_getaffinity(0))}
        echo os.cpu_count {os.cpu_count()}
        env | grep -i threads || true
    """

I observed that when I didn't pass --cpus, my job quickly started on the c5.9xlarge (default of 4 < 20 spare). When I passed --cpus X for several values of where X was less than 20, my job also quickly started on the c5.9xlarge. When I passed --cpus 36, the job took longer to start and Batch spun up a new c5.9xlarge instance for it (and quickly spun it back down after the job was done). Concurrently with that, jobs submitted with --cpus X where X < 20 continued to place and start on the original c5.9xlarge with spare CPUs. This indicates to me that the new resourceRequirements in the submitted job were working as expected to override the old default vcpus of the job description.

@tsibley tsibley self-assigned this May 25, 2022
tsibley added a commit that referenced this issue May 25, 2022
… of .vcpu and .memory

Job submissions with resourceRequirements correctly override the
defaults from job definitions that use the newer resourceRequirements
and those that use the older, deprecated vcpu and memory fields.  Job
submissions using the older fields only override job definitions that
also use the older fields; otherwise they're ignored with only an easy
to miss warning in the AWS console to alert you.

Notably, using the AWS console to modify an existing job definition
which uses vcpus and memory will switch the definition to
resourceRequirements automatically in the new revision.  This meant
revising your old job definition in the AWS console could break --cpus
and --memory for your `nextstrain build` invocations.  It also means
that --cpus and --memory would never have worked if your AWS job
definitions were originally created with resourceRequirements.

Resolves <#144>.
tsibley added a commit that referenced this issue May 26, 2022
… of .vcpu and .memory

Job submissions with resourceRequirements correctly override the
defaults from job definitions that use the newer resourceRequirements
and those that use the older, deprecated vcpu and memory fields.  Job
submissions using the older fields only override job definitions that
also use the older fields; otherwise they're ignored with only an easy
to miss warning in the AWS console to alert you.

Notably, using the AWS console to modify an existing job definition
which uses vcpus and memory will switch the definition to
resourceRequirements automatically in the new revision.  This meant
revising your old job definition in the AWS console could break --cpus
and --memory for your `nextstrain build` invocations.  It also means
that --cpus and --memory would never have worked if your AWS job
definitions were originally created with resourceRequirements.

Resolves <#144>.
Repository owner moved this from Prioritized to Done in Nextstrain planning (archived) May 27, 2022
@joverlee521
Copy link
Contributor

Since the update in #177 was included in the latest release of the CLI, our AWS Batch jobs have had the same override warning:
Screen Shot 2022-07-01 at 6 10 13 PM

The warning has gone away since I created a new revision of the job definition via the AWS console. I think this is just a UI bug on AWS because our large ncov builds have been running successfully despite the warning. I just wanted to document here since it confused me last week when our GISAID ncov-ingest run failed due to an out-of-memory issue (see Slack thread)

@tsibley
Copy link
Member Author

tsibley commented Jul 5, 2022

@joverlee521 Hmm. I'm slightly skeptical of it being an AWS Console bug. I wonder if something else is happening here. Do you have an example of a job where the warnings appeared? Did you confirm the job was submitted with Nextstrain CLI 4.0.0?

@joverlee521
Copy link
Contributor

From what I've seen, the warning appears in all ncov-ingest/ncov jobs using job definition nextstrain-job:23 since the release of Nextstrain CLI 4.0.0. on Jun 24, 2022.

The earliest one I can find is AWS Batch job 6eb606e3-67a0-4bfa-88f3-42af63dbe514, which was launched by GH Action 7050978590. The GH Action logs show it was using nextstrain cli 4.0.0.

@tsibley
Copy link
Member Author

tsibley commented Jul 5, 2022

Ok, I'm now in agreement that it's a Console UI bug, or at least misleading UI.

The warning appears to apply to any containerOverrides values for vcpus and memory, regardless of if they come from the job submission or the job definition. The warning is necessary because the Console UI shows the values from containerOverrides in the job definition even if the job submission used the newer resourceRequirements. I'd previously understood the warning to only apply to the reverse: the job definition using the newer resourceRequirements while the job submission used the older containerOverrides.

Also, the warning text itself seems misleading/incorrect because while it says (emphasis mine):

This value was submitted using containerOverrides.memory which has been deprecated and was not used as an override. Instead, the MEMORY value found in the job definition’s resourceRequirements key was used instead. More information about the deprecated key can be found in the AWS Batch API documentation

but I think the value used is actually coming from the job submission's resourceRequirements key based on previous testing above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
2 participants