Merge pull request #380 from latchbio/rahuldesai1/snakemake/update-docs

update snakemake docs
latchbio · Jan 24, 2024 · a523c52 · a523c52
2 parents b6383d8 + 00b8331
commit a523c52
Show file tree

Hide file tree

Showing 11 changed files with 277 additions and 237 deletions.
diff --git a/docs/source/assets/snakemake/jit-task.png b/docs/source/assets/snakemake/jit-task.png
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -157,11 +157,12 @@ cli/mv
 :hidden:
 :maxdepth: 2
 :caption: Snakemake Integration
+snakemake/overview.md
 snakemake/tutorial.md
-snakemake/lifecycle.md
 snakemake/metadata.md
 snakemake/environments.md
 snakemake/cloud.md
+snakemake/lifecycle.md
 snakemake/debugging.md
 snakemake/troubleshooting.md
 ```

diff --git a/docs/source/snakemake/cloud.md b/docs/source/snakemake/cloud.md
@@ -6,10 +6,10 @@ When a Snakemake workflow is executed on Latch, each generated job is run in a s
 
 Therefore, it may be necessary to adapt your Snakefile to address issues arising from this execution method, which were not encountered during local execution:
 
-* Add missing rule inputs that are implicitly fulfilled when executing locally. 
-* Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task
-* Add `resources` directives if tasks run out of memory or disk space
-* Optimize data transfer by merging tasks that have 1-to-1 dependencies
+- Add missing rule inputs that are implicitly fulfilled when executing locally.
+- Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task
+- Add `resources` directives if tasks run out of memory or disk space
+- Optimize data transfer by merging tasks that have 1-to-1 dependencies
 
 Here, we will walk through examples of each of the cases outlined above.
 
@@ -23,8 +23,8 @@ A typical example is if the index files for biological data are not explicitly s
 
 In the example below, there are two Snakefile rules:
 
-* `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.
-* `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.
+- `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.
+- `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.
 
 ```python
 rule delly_s:  # single-sample analysis
@@ -353,6 +353,7 @@ rule kraken:
     ...
     resources:
         mem_mb=128000
+        cpus=8
     ...
 ```
 
@@ -364,13 +365,13 @@ To optimize performance and minimize costs, it is recommended to consolidate the
 
 #### Example
 
-* Inefficient example with multiple rules processing the same BAM file:
+- Inefficient example with multiple rules processing the same BAM file:
 
 ```python
 rule all:
     input:
         "results/final_variants.vcf"
-        
+
 rule mark_duplicates:
     input:
         "data/sample.bam"

diff --git a/docs/source/snakemake/debugging.md b/docs/source/snakemake/debugging.md
@@ -2,9 +2,9 @@
 
 ## Local Development
 
-When debugging a Snakemake workflow, it's helpful to run the JIT step locally instead of re-registering your workflow everytime you want to test a change. To address this, the Latch SDK supports local development for Snakemake workflows.
+When debugging a Snakemake workflow, it's helpful to run the JIT step locally instead of re-registering your workflow every time you want to test a change. To address this, the Latch SDK supports local development for Snakemake workflows.
 
-If you are not familiar with the `latch develop` command, please read about [Local Development](../basics/local_development.md) before continuing.
+If you are unfamiliar with the `latch develop` command, please read about [Local Development](../basics/local_development.md) before continuing.
 
 ---
 
@@ -38,4 +38,4 @@ You can execute the script in your `latch develop` session like so:
 $ python3 scripts/dry_run.py
 ```
 
-**Note**: If you are using `conda`, your shell may activate the conda base environment by default. To ensure that you are running in the exact same envioronment as the JIT task, either run `conda deactivate` once you enter the shell or disable conda's environment auto activation in your Dockerfile: `RUN conda config --set auto_activate_base false`
+**Note**: If you are running into an `ImportError`, be sure to use the version of Python in which the Latch SDK was installed.
diff --git a/docs/source/snakemake/environments.md b/docs/source/snakemake/environments.md
@@ -1,8 +1,20 @@
 # Environments
 
-## Configuring Conda and Container Environments
+When registering a Snakemake workflow on Latch, we need to build a single container image containing all your runtime dependencies and the Latch packages. By default, all tasks (including the JIT step) will run inside this container.
 
-Latch's Snakemake integration supports the use of both the `conda` and `container` directives in your Snakefile. To configure which environment to run tasks in (which is typically done through the use of `--use-conda` and `--use-singularity`), add the `env_config` field to your workflow's `SnakemakeMetadata` object. For example,
+To generate a Dockerfile with all the Latch-specific dependencies, run the following command from inside your workflow directory:
+
+```console
+latch dockerfile . --snakemake
+```
+
+Inspect the resulting Dockerfile and add any runtime dependencies required for your workflow.
+
+## Configuring Task Environments
+
+Sometimes, it is preferable to use isolated environments for each Snakemake rule using the `container` and `conda` [Snakemake directives](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers) instead of building one large image.
+
+Typically, when using these directives, we must pass the `--use-conda` and `--use-singularity` flags to the `snakemake` command to configure which environment to activate. Similarly, to configure your environment on Latch, add the `env_config` field to your workflow's `SnakemakeMetadata` object. For example,
 
 ```
 # latch_metadata.py
@@ -37,14 +49,14 @@ SnakemakeMetadata(
 )
 ```
 
-If there is no `env_config` defined, Snakemake tasks on Latch will NOT use containers or conda environments by default.
+**Note**: If no `env_config` is defined, Snakemake tasks on Latch will NOT use containers or conda environments by default.
 
 ## Using Private Container Registries
 
-When executing Snakemake workflows in containers, it is possible that the container images will exist in a private registry that the Latch cloud does not have access to. Downloading images from private registries at runtime requires two steps:
+When executing Snakemake workflows in containers, the container images may exist in a private registry that the Latch cloud cannot access. Downloading images from private registries at runtime requires two steps:
 
-1. Upload the password / access token of your private container registry to the Latch platform. See [Storing and using Secrets](../basics/adding_secrets.md).
-2. Add the `docker_metadata` field to your workflow's `SnakemakeMetadata` object so that the workflow engine knows where to pull your credentials from. For example:
+1. Upload your private container registry's password/access token to the Latch platform. See [Storing and using Secrets](../basics/adding_secrets.md).
+2. Add the `docker_metadata` field to your workflow's `SnakemakeMetadata` object so the workflow engine knows where to pull your credentials. For example:
 
 ```
 # latch_metadata.py

diff --git a/docs/source/snakemake/lifecycle.md b/docs/source/snakemake/lifecycle.md
@@ -1,89 +1,37 @@
 # Lifecycle of a Snakemake Execution on Latch
 
-Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will only register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.
+Snakemake support is currently based on JIT (Just-In-Time) registration. This means that the workflow produced by `latch register` will register a second workflow, which will run the pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.
 
 ### JIT Workflow
 
 The first ("JIT") workflow does the following:
 
 1. Download all input files
-2. Import the Snakefile, calculate the dependency graph, determine which jobs need to be run
+2. Import the Snakefile, calculate the dependency graph, and determine which jobs need to be run
 3. Generate a Latch SDK workflow Python script for the second ("runtime") workflow and register it
 4. Run the runtime workflow using the same inputs
 
 Debugging:
 
-* The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
-* Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`
+- The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
+- Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`
 
 ### Runtime Workflow
 
-The runtime workflow contains a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status.
+The runtime workflow will spawn a task per each Snakemake job. This means there will be a separate task per each wildcard instantiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status.
 
-Each task runs a modified Snakemake executable using a script from the Latch SDK which monkey-patches the appropriate parts of the Snakemake package. This executable is different in two ways:
+When a task executes, it will:
 
-1. Rules that are not part of the task's target are entirely ignored
-2. The target rule has all of its properties (currently inputs, outputs, benchmark, log, shellcode) replaced with the job-specific strings. This is the same as the value of these directives with all wildcards expanded and lazy values evaluated
-
-Debugging:
-
-* The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/compiled_tasks`
-
-#### Example
-
-Snakefile rules:
-
-```Snakemake
-rule all:
-  input:
-    os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"),
-    os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html")
-
-rule fastqc:
-  input: os.path.join(WORKDIR, "fastq", "{sample}.fastq")
-  output: os.path.join(WORKDIR, "qc", "fastqc", "{sample}_fastqc.html")
-  shellcmd: "fastqc {input} -o {output}"
-```
-
-Produced jobs:
-
-1. Rule: `fastqc` Wildcards: `sample=read1`
-1. Rule: `fastqc` Wildcards: `sample=read2`
-
-Resulting single-job executable for job 1:
-
-```py
-# @workflow.rule(name='all', lineno=1, snakefile='/root/Snakefile')
-# @workflow.input( # os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"),
-#     # os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html"),
-# )
-# @workflow.norun()
-# @workflow.run
-# def __rule_all(input, output, ...):
-#     pass
-
-@workflow.rule(name='fastqc', lineno=6, snakefile='/root/Snakefile')
-@workflow.input("work/fastq/read1.fastq" # os.path.join(WORKDIR, "fastq", "{sample}.fastq")
-)
-@workflow.shellcmd("fastqc work/fastq/read1.fastq -o work/qc/fastqc/read1_fastqc.html")
-@workflow.run
-def __rule_fastqc(input, output, ...):
-    shell("fastqc {input} -o {output}", ...)
-```
-
-Note:
-
-* The "all" rule is entirely commented out
-* The "fastqc" rule has no wildcards in its decorators
+1. Download all input files that are defined in the rule
+2. Execute the Snakemake task
+3. Upload outputs/logs/benchmarks to Latch Data
 
 ### Limitations
 
 1. The workflow will execute the first rule defined in the Snakefile (matching standard Snakemake behavior). There is no way to change the default rule other than by moving the desired rule up in the file
-1. The workflow will output files that are not used by downstream tasks. This means that intermediate files cannot be included in the output. The only way to exclude an output is to write a rule that lists it as an input
-1. Input files and directories are downloaded fully, even if they are not used to generate the dependency graph. This commonly leads to issues with large directories being downloaded just to list the files contained within, delaying the JIT workflow by a large amount of time and requiring a large amount of disk space
-1. Only the JIT workflow downloads input files. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime
-1. Large files that move between tasks will need to be uploaded by the outputting task and downloaded by each consuming task. This can take a large amount of time. Frequently it's possible to merge the producer and the consumer into one task to improve performance
+1. Rules only download their inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of the ones explicitly defined in the rule, it will usually fail at runtime
+1. Large files that move between tasks need to be uploaded by the outputting task and downloaded by each consuming task. This can take a significant amount of time. Frequently, it's possible to merge the producer and the consumer into one task to improve performance
 1. Environment dependencies (Conda packages, Python packages, other software) must be well-specified. Missing dependencies will lead to JIT-time or runtime crashes
 1. Config files are not supported and must be hard-coded into the workflow Docker image
-1. `conda` directives will frequently fail with timeouts/SSL errors because Conda does not react well to dozens of tasks trying to install conda environments over a short timespan. It is recommended that all conda environments are included in the Docker image
-1. The JIT workflow hard-codes the latch paths for rule inputs, outputs and other files. If these files are missing when the runtime workflow task runs, it will fail
+1. `conda` directives will frequently fail with timeouts/SSL errors because Conda does not react well to dozens of tasks trying to install Conda environments over a short period. It is recommended that all conda environments are included in the Docker image.
+1. The JIT workflow hard-codes the latch paths for rule inputs, outputs, and other files. If these files are missing when the runtime workflow task runs, it will fail
diff --git a/docs/source/snakemake/metadata.md b/docs/source/snakemake/metadata.md
@@ -7,17 +7,19 @@ To construct a graphical interface from a snakemake workflow, the file parameter
 The `latch_metadata.py` file holds these parameter definitions, along with any styling or cosmetic modifications the developer wishes to make to each parameter.
 
 To generate a `latch_metadata.py` file, type:
+
 ```console
 latch generate-metadata <path_to_config.yaml>
 ```
 
-The command automatically parses the existing `config.yaml` file in the Snakemake repository, and create a Python parameters file.
+The command automatically parses the existing `config.yaml` file in the Snakemake repository, and creates a Python parameters file.
 
 #### Examples
 
 Below is an example `config.yaml` file from the [rna-seq-star-deseq2 workflow](https://github.com/snakemake-workflows/rna-seq-star-deseq2) from Snakemake workflow catalog.
 
 `config.yaml`
+
 ```yaml
 # path or URL to sample sheet (TSV format, columns: sample, condition, ...)
 samples: config/samples.tsv
@@ -26,7 +28,6 @@ samples: config/samples.tsv
 # sample).
 units: config/units.tsv
 
-
 ref:
   # Ensembl species name
   species: homo_sapiens
@@ -84,14 +85,14 @@ diffexp:
   # model: ~jointly_handled + treatment_1 + treatment_2
   model: ""
 
-
 params:
   cutadapt-pe: ""
   cutadapt-se: ""
   star: ""
 ```
 
 The Python `latch_metadata.py` generated from the Latch command:
+
 ```python
 from dataclasses import dataclass
 import typing

diff --git a/docs/source/snakemake/overview.md b/docs/source/snakemake/overview.md
@@ -0,0 +1,16 @@
+# Overview
+
+## Motivation
+
+Latch's Snakemake integration allows developers to build graphical interfaces to expose their Snakemake workflows to wet lab teams. It also provides managed cloud infrastructure for executing the workflow's jobs.
+
+A primary design goal for the Snakemake integration is to allow developers to register existing projects with minimal added boilerplate and modifications to code. Here, we outline these changes and why they are needed.
+
+## Snakemake Workflows on Latch
+
+Recall a Snakemake project consists of a `Snakefile`, which describes workflow
+rules in an ["extension"](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html) of Python and associated Python code imported and called by these rules. To make this project compatible with Latch, we need to do the following:
+
+1. Define [metadata and input file parameters](./metadata.md) for your workflow
+2. Build a [container](./environments.md) with all runtime dependencies
+3. Ensure your `Snakefile` is compatible with [cloud execution](./cloud.md)