Skip to content

Commit

Permalink
Better support pathogen repositories which place workflows in subdire…
Browse files Browse the repository at this point in the history
…ctories

…as this is the direction we're moving.

The main change is making the filesystem isolation boundary (i.e. what's
mapped to `/nextstrain/build` in a container) a _separate thing_ from
the workflow and initial working directory (i.e. what's given to
`nextstrain build`).  In this codebase, these two things are referred to
as the _build volume_ (aka `opts.build`) and the _working volume_.
Historically, the working volume given to the runners for `nextstrain
build` and `nextstrain shell` _was_ the build volume; now, they're
separately considered and sometimes differ.

See the included changelog entry for usage details and rationale.  For
background, I made the initial proposal for this feature¹ on a PR in our
pathogen-repo-template repository and some discussion ensued.

¹ <nextstrain/pathogen-repo-guide#16 (comment)>
  or <https://github.com/tsibley/blab-standup/blob/master/2024-01-25.md>
  • Loading branch information
tsibley committed Jan 31, 2024
1 parent 6b8262a commit bf858d2
Show file tree
Hide file tree
Showing 7 changed files with 194 additions and 31 deletions.
61 changes: 61 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,67 @@ development source code and as such may not be routinely kept up to date.

# __NEXT__

## Features

* `nextstrain build` and `nextstrain shell` now better support pathogen
repositories which place workflows in subdirectories. The top-level of the
repo must contain a `nextstrain-pathogen.yaml` file for this support to
activate. The file may be empty for now, though we anticipate using it for
pathogen-level metadata in the future to aid indexing, listing, and
attribution of pathogen repos.

As an example of the new support, consider the following repo layout

mpox/
├── nextstrain-pathogen.yaml
├── ingest/
│ ├── Snakefile
│ └── …
├── phylogenetic/
│ ├── Snakefile
│ └── …
├── shared/
│ ├── reference.fasta
│ └── …
└── …

where `ingest/` and `phylogenetic/` contain workflows that use
`shared/reference.fasta` via a relative path (i.e.
`../shared/reference.fasta`).

It's now possible to invoke those workflows with any of the following:

nextstrain build mpox/ingest/
nextstrain build mpox/phylogenetic/

cd mpox
nextstrain build ingest/
nextstrain build phylogenetic/

cd phylogenetic
nextstrain build .
nextstrain build ../ingest/

regardless of runtime.

Previously, such workflows required careful invocation, e.g.

nextstrain build mpox/ -d phylogenetic/ -s phylogenetic/Snakefile

when using runtimes with filesystem isolation (i.e. the [containerized][]
ones; Docker, Singularity, and AWS Batch) but not when using runtimes without
it.

When active, this feature makes the top-level of the pathogen repo (e.g.
`mpox/`) available in the container at `/nextstrain/build` while the
initial working directory is set to the workflow subdirectory in the
container (e.g. `/nextstrain/build/phylogenetic`). That is, the filesystem
isolation boundary is drawn at the top-level of the pathogen repo instead of
at the workflow directory (i.e. what's given to `nextstrain build`).
([#355](https://github.com/nextstrain/cli/pull/355))

[containerized]: https://docs.nextstrain.org/projects/cli/en/__NEXT__/runtimes/#comparison

## Improvements

* `nextstrain build` now errors if a [development overlay option][] such as
Expand Down
119 changes: 104 additions & 15 deletions nextstrain/cli/command/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,16 @@
"""

import re
from pathlib import Path
from textwrap import dedent
from typing import Tuple
from .. import runner
from ..argparse import add_extended_help_flags, AppendOverwriteDefault, SKIP_AUTO_DEFAULT_IN_HELP
from ..debug import debug
from ..errors import UsageError, UserError
from ..runner import docker, singularity
from ..util import byte_quantity, runner_name, warn
from ..volume import store_volume
from ..volume import NamedVolume


def register_parser(subparser):
Expand Down Expand Up @@ -137,7 +140,7 @@ def register_parser(subparser):
"Required, except when the AWS Batch runtime is in use and --attach and either --no-download or --cancel are given. "
f"{SKIP_AUTO_DEFAULT_IN_HELP}",
metavar = "<directory>",
action = store_volume("build"),
type = Path,
nargs = "?")

# Register runner flags and arguments
Expand All @@ -149,7 +152,7 @@ def register_parser(subparser):
def run(opts):
assert_overlay_volumes_support(opts)

# We must check this before the conditions under which opts.build is
# We must check this before the conditions under which opts.directory is
# optional because otherwise we could pass a missing build dir to a runner
# which ignores opts.attach.
if (opts.attach or opts.detach or opts.detach_on_interrupt or opts.cancel) and opts.__runner__ is not runner.aws_batch:
Expand All @@ -160,24 +163,23 @@ def run(opts):
Did you forget to specify --aws-batch?
""")

# Ensure our build dir exists
if opts.build is None:
# Interpret the given directory and ensure it exists if necessary.
if opts.directory is not None:
build_volume, working_volume = pathogen_volumes(opts.directory)

else:
if opts.attach and (not opts.download or opts.cancel):
# Don't require a build directory with --attach + --no-download
# or --attach + --cancel. User just wants to check status/logs or
# stop the job.
pass
build_volume = None
working_volume = None
else:
raise UsageError("Path to a pathogen build <directory> is required.")

elif not opts.build.src.is_dir():
warn("Error: Build path \"%s\" does not exist or is not a directory." % opts.build.src)

if not opts.build.src.is_absolute():
warn()
warn("Perhaps your current working directory is different than you expect?")
if build_volume:
opts.volumes.append(build_volume) # for Docker, Singularity, and AWS Batch

return 1

# Automatically pass thru appropriate resource options to Snakemake to
# avoid the user having to repeat themselves (once for us, once for
Expand Down Expand Up @@ -233,14 +235,14 @@ def run(opts):
based on its --memory option. This may or may not be what you expect.
""" % (snakemake_opts["--resources"][0],)))

return runner.run(opts, working_volume = opts.build, cpus = opts.cpus, memory = opts.memory)
return runner.run(opts, working_volume = working_volume, cpus = opts.cpus, memory = opts.memory)


def assert_overlay_volumes_support(opts):
"""
Check that runtime overlays are supported, if given.
"""
overlay_volumes = [v for v in opts.volumes if v is not opts.build]
overlay_volumes = opts.volumes

if overlay_volumes and opts.__runner__ not in {docker, singularity}:
raise UserError(f"""
Expand All @@ -249,6 +251,93 @@ def assert_overlay_volumes_support(opts):
""")


def pathogen_volumes(directory: Path) -> Tuple[NamedVolume, NamedVolume]:
"""
Discern the pathogen **build volume** and **working volume** for a given
*directory* path.
The **build volume** is the pathogen repo root, if discernable by the
presence of :file:`nextstrain-pathogen.yaml` in one of the parents of
*directory*. Otherwise, its the given *directory* as-is.
The **working volume** is always the given *directory* labeled with a
volume name that reflects its relative path within the **build volume**.
Some examples:
>>> build_volume, working_volume = pathogen_volumes(Path("tests/data/pathogen-repo/ingest/"))
>>> build_volume # doctest: +ELLIPSIS
NamedVolume(name='build', src=...Path('.../tests/data/pathogen-repo'), dir=True, writable=True)
>>> working_volume # doctest: +ELLIPSIS
NamedVolume(name='build/ingest', src=...Path('.../tests/data/pathogen-repo/ingest'), dir=True, writable=True)
>>> docker.mount_point(build_volume) <= docker.mount_point(working_volume)
True
>>> build_volume, working_volume = pathogen_volumes(Path("tests/data/"))
>>> build_volume # doctest: +ELLIPSIS
NamedVolume(name='build', src=...Path('.../tests/data'), dir=True, writable=True)
>>> working_volume # doctest: +ELLIPSIS
NamedVolume(name='build', src=...Path('.../tests/data'), dir=True, writable=True)
>>> docker.mount_point(build_volume) <= docker.mount_point(working_volume)
True
"""
if not directory.is_dir():
err = f"Build path {str(directory)!r} does not exist or is not a directory."

if not directory.is_absolute():
raise UserError(f"""
{err}
Perhaps your current working directory is different than you expect?
""")
else:
raise UserError(err)

# The pathogen repo root is (optionally) indicated by the presence of
# nextstrain-pathogen.yaml. We intentionally don't use a marker tied to
# Git, i.e. .git/, because we and users sometimes run pathogen repos out of
# exports from/snapshots of Git instead of full Git clones.
#
# Also, we have ideas for leveraging nextstrain-pathogen.yaml in the future
# as a source of pathogen-level metadata for use in indexing, listing,
# attribution, etc. For now, the contents do not matter and an empty file
# works just fine as the repo root marker.
# -trs, 29 Jan 2024
marker_name = "nextstrain-pathogen.yaml"

# Search upwards for pathogen repo root to serve as the build volume.
working_dir = directory.resolve(strict = True)
debug(f"Resolved {directory} to {working_dir}")

debug(f"Looking for {marker_name} as pathogen root dir marker")

for marker in (d / marker_name for d in [working_dir, *working_dir.parents]):
if marker.exists():
debug(f"{marker}: exists")
build_volume = NamedVolume("build", marker.parent)
break
else:
debug(f"{marker}: does not exist")
else:
build_volume = NamedVolume("build", working_dir)

debug(f"Using {build_volume.src} as build volume")

# Construct the working volume name based on its relative path within the
# build volume we just determined. The working volume should always be
# within (or identical to) the build volume.
working_volume = NamedVolume(
str(build_volume.name / working_dir.relative_to(build_volume.src)),
working_dir)

debug(f"Using {working_volume.src} as working ({working_volume.name}) volume")

assert build_volume.src <= working_volume.src
assert docker.mount_point(build_volume) <= docker.mount_point(working_volume) # type: ignore[attr-defined] # for mypy

return build_volume, working_volume


def parse_snakemake_args(args):
"""
Inspects a tiny subset of Snakemake's CLI arguments in order to determine
Expand Down
22 changes: 9 additions & 13 deletions nextstrain/cli/command/shell.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@
commands and perform debugging.
"""

from pathlib import Path
from typing import Tuple
from .. import resources
from .. import runner
from ..argparse import add_extended_help_flags
from ..paths import SHELL_HISTORY
from ..runner import docker, conda, singularity
from ..util import colored, remove_prefix, runner_name, warn
from ..volume import store_volume, NamedVolume
from .build import assert_overlay_volumes_support
from ..util import colored, remove_prefix, runner_name
from ..volume import NamedVolume
from .build import assert_overlay_volumes_support, pathogen_volumes


def register_parser(subparser):
Expand All @@ -30,7 +31,7 @@ def register_parser(subparser):
"directory",
help = "Path to pathogen build directory",
metavar = "<directory>",
action = store_volume("build"))
type = Path)

# Register runner flags and arguments; excludes ambient and AWS Batch
# runners since those don't make any sense here.
Expand All @@ -45,15 +46,10 @@ def register_parser(subparser):
def run(opts):
assert_overlay_volumes_support(opts)

# Ensure our build dir exists
if not opts.build.src.is_dir():
warn("Error: Build path \"%s\" does not exist or is not a directory." % opts.build.src)
# Interpret the given directory
build_volume, working_volume = pathogen_volumes(opts.directory)

if not opts.build.src.is_absolute():
warn()
warn("Perhaps your current working directory is different than you expect?")

return 1
opts.volumes.append(build_volume) # for Docker and Singularity

print(colored("bold", f"Entering the Nextstrain runtime ({runner_name(opts.__runner__)})"))
print()
Expand Down Expand Up @@ -107,7 +103,7 @@ def run(opts):
"NEXTSTRAIN_HISTFILE": str(history_file),
}

return runner.run(opts, working_volume = opts.build, extra_env = extra_env)
return runner.run(opts, working_volume = working_volume, extra_env = extra_env)


def ps1() -> str:
Expand Down
23 changes: 20 additions & 3 deletions nextstrain/cli/runner/aws_batch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,14 +163,22 @@ def register_arguments(parser) -> None:


def run(opts, argv, working_volume = None, extra_env: Env = {}, cpus: int = None, memory: int = None) -> int:
build_volume = next((v for v in opts.volumes if v and v.name == "build"), None)

# Unlike other runners, the AWS Batch runner currently *requires* a working
# dir in most usages. This is ok as we only provide the AWS Batch runner
# for commands which also require a working dir (e.g. build), whereas other
# runners also work with commands that don't.
# -trs, 28 Feb 2022 (updated 24 August 2023)
assert working_volume is not None or (opts.attach and (not opts.download or opts.cancel))
assert (working_volume is not None and build_volume is not None) or (opts.attach and (not opts.download or opts.cancel))

local_workdir = working_volume.src.resolve(strict = True) if working_volume else None
# Note that "workdir" here always refers to our image's default WORKDIR,
# /nextstrain/build, since AWS Batch provides no way to override the
# initial working directory of a container, i.e. an equivalent to the
# --workdir option of `docker run`. Instead, to change the initial working
# directory, we arrange to exec-chain thru `env --chdir`.
# -trs, 29 Jan 2024
local_workdir = build_volume.src.resolve(strict = True) if build_volume else None

if opts.attach:
print_stage("Attaching to Nextstrain AWS Batch Job ID:", opts.attach)
Expand All @@ -192,6 +200,7 @@ def run(opts, argv, working_volume = None, extra_env: Env = {}, cpus: int = None
print_stage("Job is %s" % job.status)
else:
assert local_workdir is not None
assert working_volume is not None

# Generate our own unique run id since we can't know the AWS Batch job id
# until we submit it. This run id is used for workdir and run results
Expand Down Expand Up @@ -239,6 +248,14 @@ def run(opts, argv, working_volume = None, extra_env: Env = {}, cpus: int = None
# Memory from our caller is in bytes, but AWS expects MiB.
memory //= 1024**2

# Change working directory in the container before running our command.
# Note that this happens _after_ the build context we uploaded
# (remote_workdir) is downloaded and extracted to the default working
# directory (/nextstrain/build).
exec = [
"env", "--chdir", str(docker.mount_point(working_volume)), "--",
*argv ]

try:
job = jobs.submit(
name = run_id,
Expand All @@ -248,7 +265,7 @@ def run(opts, argv, working_volume = None, extra_env: Env = {}, cpus: int = None
cpus = cpus,
memory = memory,
workdir = remote_workdir,
exec = argv,
exec = exec,
env = { k: v for k, v in extra_env.items() if v is not None })
except Exception as error:
warn(error)
Expand Down
Empty file.
Empty file.
Empty file.

0 comments on commit bf858d2

Please sign in to comment.