Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update snakemake docs #380

Merged
merged 5 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/source/assets/snakemake/jit-task.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,11 +157,12 @@ cli/mv
:hidden:
:maxdepth: 2
:caption: Snakemake Integration
snakemake/overview.md
snakemake/tutorial.md
snakemake/lifecycle.md
snakemake/metadata.md
snakemake/environments.md
snakemake/cloud.md
snakemake/lifecycle.md
snakemake/debugging.md
snakemake/troubleshooting.md
```
Expand Down
17 changes: 9 additions & 8 deletions docs/source/snakemake/cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ When a Snakemake workflow is executed on Latch, each generated job is run in a s

Therefore, it may be necessary to adapt your Snakefile to address issues arising from this execution method, which were not encountered during local execution:

* Add missing rule inputs that are implicitly fulfilled when executing locally.
* Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task
* Add `resources` directives if tasks run out of memory or disk space
* Optimize data transfer by merging tasks that have 1-to-1 dependencies
- Add missing rule inputs that are implicitly fulfilled when executing locally.
- Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task
- Add `resources` directives if tasks run out of memory or disk space
- Optimize data transfer by merging tasks that have 1-to-1 dependencies

Here, we will walk through examples of each of the cases outlined above.

Expand All @@ -23,8 +23,8 @@ A typical example is if the index files for biological data are not explicitly s

In the example below, there are two Snakefile rules:

* `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.
* `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.
- `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.
- `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.

```python
rule delly_s: # single-sample analysis
Expand Down Expand Up @@ -353,6 +353,7 @@ rule kraken:
...
resources:
mem_mb=128000
cpus=8
...
```

Expand All @@ -364,13 +365,13 @@ To optimize performance and minimize costs, it is recommended to consolidate the

#### Example

* Inefficient example with multiple rules processing the same BAM file:
- Inefficient example with multiple rules processing the same BAM file:

```python
rule all:
input:
"results/final_variants.vcf"

rule mark_duplicates:
input:
"data/sample.bam"
Expand Down
6 changes: 4 additions & 2 deletions docs/source/snakemake/debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ your_workflow_name_jit_register_task(
You can execute the script in your `latch develop` session like so:

```console
$ python3 scripts/dry_run.py
$ /usr/local/bin/python3 scripts/dry_run.py
rahuldesai1 marked this conversation as resolved.
Show resolved Hide resolved
```

**Note**: If you are using `conda`, your shell may activate the conda base environment by default. To ensure that you are running in the exact same envioronment as the JIT task, either run `conda deactivate` once you enter the shell or disable conda's environment auto activation in your Dockerfile: `RUN conda config --set auto_activate_base false`
**Note**: If you are running into an `ImportError`, be sure to use the verison of python in which the Latch SDK was installed.

**Note**: If you are using `conda`, your shell may activate the conda base environment by default. To ensure that you are running in the exact same environment as the JIT task, either run `conda deactivate` once you enter the shell or disable conda's environment auto activation in your Dockerfile: `RUN conda config --set auto_activate_base false`
rahuldesai1 marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 15 additions & 3 deletions docs/source/snakemake/environments.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,20 @@
# Environments

## Configuring Conda and Container Environments
When registering a Snakemake workflow on Latch, we need to build a single container image which contains all your runtime dependencies as well as the Latch packages. By default, all tasks (included the JIT step) will run inside this container.

Latch's Snakemake integration supports the use of both the `conda` and `container` directives in your Snakefile. To configure which environment to run tasks in (which is typically done through the use of `--use-conda` and `--use-singularity`), add the `env_config` field to your workflow's `SnakemakeMetadata` object. For example,
To generate a Dockerfile with all the Latch specific dependencies, run the following command from inside your workflow directory:

```console
latch dockerfile . --snakemake
```

Be sure to inspect the resulting Dockerfile and add any runtime dependencies that are required for your workflow.

## Configuring Task Environments

Sometimes it is preferrable to use isolated environments for each Snakemake rule using the `container` and `conda` [Snakemake directives](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers) instead of building one large image.

Typically, when using these directives, we must pass the `--use-conda` and `--use-singularity` flags to the `snakemake` command in order to configure which environment to activate. Similarly, to configure your environment on Latch, add the `env_config` field to your workflow's `SnakemakeMetadata` object. For example,

```
# latch_metadata.py
Expand Down Expand Up @@ -37,7 +49,7 @@ SnakemakeMetadata(
)
```

If there is no `env_config` defined, Snakemake tasks on Latch will NOT use containers or conda environments by default.
**Note**: If there is no `env_config` defined, Snakemake tasks on Latch will NOT use containers or conda environments by default.

## Using Private Container Registries

Expand Down
70 changes: 9 additions & 61 deletions docs/source/snakemake/lifecycle.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Lifecycle of a Snakemake Execution on Latch

Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will only register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.
Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.

### JIT Workflow

Expand All @@ -13,75 +13,23 @@ The first ("JIT") workflow does the following:

Debugging:

* The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
* Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`
- The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
- Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`

### Runtime Workflow

The runtime workflow contains a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status.
The runtime workflow will spawn a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status.

Each task runs a modified Snakemake executable using a script from the Latch SDK which monkey-patches the appropriate parts of the Snakemake package. This executable is different in two ways:
When a task executes it will:

1. Rules that are not part of the task's target are entirely ignored
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed a lot of the technical explanation here because I thought it was too detailed and not that helpful for users.
open to adding it back if ppl think there is value

2. The target rule has all of its properties (currently inputs, outputs, benchmark, log, shellcode) replaced with the job-specific strings. This is the same as the value of these directives with all wildcards expanded and lazy values evaluated

Debugging:

* The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/compiled_tasks`

#### Example

Snakefile rules:

```Snakemake
rule all:
input:
os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"),
os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html")

rule fastqc:
input: os.path.join(WORKDIR, "fastq", "{sample}.fastq")
output: os.path.join(WORKDIR, "qc", "fastqc", "{sample}_fastqc.html")
shellcmd: "fastqc {input} -o {output}"
```

Produced jobs:

1. Rule: `fastqc` Wildcards: `sample=read1`
1. Rule: `fastqc` Wildcards: `sample=read2`

Resulting single-job executable for job 1:

```py
# @workflow.rule(name='all', lineno=1, snakefile='/root/Snakefile')
# @workflow.input( # os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"),
# # os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html"),
# )
# @workflow.norun()
# @workflow.run
# def __rule_all(input, output, ...):
# pass

@workflow.rule(name='fastqc', lineno=6, snakefile='/root/Snakefile')
@workflow.input("work/fastq/read1.fastq" # os.path.join(WORKDIR, "fastq", "{sample}.fastq")
)
@workflow.shellcmd("fastqc work/fastq/read1.fastq -o work/qc/fastqc/read1_fastqc.html")
@workflow.run
def __rule_fastqc(input, output, ...):
shell("fastqc {input} -o {output}", ...)
```

Note:

* The "all" rule is entirely commented out
* The "fastqc" rule has no wildcards in its decorators
1. Download all input files that are explicitly defined in the rule
2. Execute the Snakemake task
3. Upload outputs/logs/benchmarks to Latch Data

### Limitations

1. The workflow will execute the first rule defined in the Snakefile (matching standard Snakemake behavior). There is no way to change the default rule other than by moving the desired rule up in the file
1. The workflow will output files that are not used by downstream tasks. This means that intermediate files cannot be included in the output. The only way to exclude an output is to write a rule that lists it as an input
1. Input files and directories are downloaded fully, even if they are not used to generate the dependency graph. This commonly leads to issues with large directories being downloaded just to list the files contained within, delaying the JIT workflow by a large amount of time and requiring a large amount of disk space
1. Only the JIT workflow downloads input files. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime
1. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime
1. Large files that move between tasks will need to be uploaded by the outputting task and downloaded by each consuming task. This can take a large amount of time. Frequently it's possible to merge the producer and the consumer into one task to improve performance
1. Environment dependencies (Conda packages, Python packages, other software) must be well-specified. Missing dependencies will lead to JIT-time or runtime crashes
1. Config files are not supported and must be hard-coded into the workflow Docker image
Expand Down
7 changes: 4 additions & 3 deletions docs/source/snakemake/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,19 @@ To construct a graphical interface from a snakemake workflow, the file parameter
The `latch_metadata.py` file holds these parameter definitions, along with any styling or cosmetic modifications the developer wishes to make to each parameter.

To generate a `latch_metadata.py` file, type:

```console
latch generate-metadata <path_to_config.yaml>
```

The command automatically parses the existing `config.yaml` file in the Snakemake repository, and create a Python parameters file.
The command automatically parses the existing `config.yaml` file in the Snakemake repository, and creates a Python parameters file.

#### Examples

Below is an example `config.yaml` file from the [rna-seq-star-deseq2 workflow](https://github.com/snakemake-workflows/rna-seq-star-deseq2) from Snakemake workflow catalog.

`config.yaml`

```yaml
# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
samples: config/samples.tsv
Expand All @@ -26,7 +28,6 @@ samples: config/samples.tsv
# sample).
units: config/units.tsv


ref:
# Ensembl species name
species: homo_sapiens
Expand Down Expand Up @@ -84,14 +85,14 @@ diffexp:
# model: ~jointly_handled + treatment_1 + treatment_2
model: ""


params:
cutadapt-pe: ""
cutadapt-se: ""
star: ""
```

The Python `latch_metadata.py` generated from the Latch command:

```python
from dataclasses import dataclass
import typing
Expand Down
16 changes: 16 additions & 0 deletions docs/source/snakemake/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Overview

## Motivation

Latch's snakemake integration allows developers to build graphical interfaces to expose their Snakemake workflows to wet lab teams. It also provides managed cloud infrastructure for the execution of the workflow's jobs.

A primary design goal for the Snakemake integration is to allow developers to register existing projects with minimal added boilerplate and modifications to code. Here, we outline these changes and why they are needed.

## Snakemake Workflow's on Latch
rahuldesai1 marked this conversation as resolved.
Show resolved Hide resolved

Recall a snakemake project consists of a `Snakefile` , which describes workflow
rules in an ["extension"](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html) of Python, and associated python code imported and called by these rules. To make this project compatible with Latch, we need to do the following:

1. Define [metadata and input file parameters](./metadata.md) for your workflow
2. Build a [container](./environments.md) with all runtime dependencies
3. Ensure your `Snakefile` is compatible with [cloud execution](./cloud.md)
Loading
Loading