latchbio · rahuldesai1 · Jan 24, 2024 · Jan 23, 2024 · Jan 23, 2024 · Jan 23, 2024
@@ -157,11 +157,12 @@ cli/mv
 :hidden:
 :maxdepth: 2
 :caption: Snakemake Integration
+snakemake/overview.md
 snakemake/tutorial.md
-snakemake/lifecycle.md
 snakemake/metadata.md
 snakemake/environments.md
 snakemake/cloud.md
+snakemake/lifecycle.md
 snakemake/debugging.md
 snakemake/troubleshooting.md
 ```

@@ -6,10 +6,10 @@ When a Snakemake workflow is executed on Latch, each generated job is run in a s
 
 Therefore, it may be necessary to adapt your Snakefile to address issues arising from this execution method, which were not encountered during local execution:
 
-* Add missing rule inputs that are implicitly fulfilled when executing locally. 
-* Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task
-* Add `resources` directives if tasks run out of memory or disk space
-* Optimize data transfer by merging tasks that have 1-to-1 dependencies
+- Add missing rule inputs that are implicitly fulfilled when executing locally.
+- Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task
+- Add `resources` directives if tasks run out of memory or disk space
+- Optimize data transfer by merging tasks that have 1-to-1 dependencies
 
 Here, we will walk through examples of each of the cases outlined above.
 
@@ -23,8 +23,8 @@ A typical example is if the index files for biological data are not explicitly s
 
 In the example below, there are two Snakefile rules:
 
-* `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.
-* `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.
+- `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.
+- `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.
 
 ```python
 rule delly_s:  # single-sample analysis
@@ -353,6 +353,7 @@ rule kraken:
     ...
     resources:
         mem_mb=128000
+        cpus=8
     ...
 ```
 
@@ -364,13 +365,13 @@ To optimize performance and minimize costs, it is recommended to consolidate the
 
 #### Example
 
-* Inefficient example with multiple rules processing the same BAM file:
+- Inefficient example with multiple rules processing the same BAM file:
 
 ```python
 rule all:
     input:
         "results/final_variants.vcf"
-        
+
 rule mark_duplicates:
     input:
         "data/sample.bam"

@@ -35,7 +35,9 @@ your_workflow_name_jit_register_task(
 You can execute the script in your `latch develop` session like so:
 
 ```console
-$ python3 scripts/dry_run.py
+$ /usr/local/bin/python3 scripts/dry_run.py
 ```
 
-**Note**: If you are using `conda`, your shell may activate the conda base environment by default. To ensure that you are running in the exact same envioronment as the JIT task, either run `conda deactivate` once you enter the shell or disable conda's environment auto activation in your Dockerfile: `RUN conda config --set auto_activate_base false`
+**Note**: If you are running into an `ImportError`, be sure to use the verison of python in which the Latch SDK was installed.
+
+**Note**: If you are using `conda`, your shell may activate the conda base environment by default. To ensure that you are running in the exact same environment as the JIT task, either run `conda deactivate` once you enter the shell or disable conda's environment auto activation in your Dockerfile: `RUN conda config --set auto_activate_base false`
@@ -1,8 +1,20 @@
 # Environments
 
-## Configuring Conda and Container Environments
+When registering a Snakemake workflow on Latch, we need to build a single container image which contains all your runtime dependencies as well as the Latch packages. By default, all tasks (included the JIT step) will run inside this container.
 
-Latch's Snakemake integration supports the use of both the `conda` and `container` directives in your Snakefile. To configure which environment to run tasks in (which is typically done through the use of `--use-conda` and `--use-singularity`), add the `env_config` field to your workflow's `SnakemakeMetadata` object. For example,
+To generate a Dockerfile with all the Latch specific dependencies, run the following command from inside your workflow directory:
+
+```console
+latch dockerfile . --snakemake
+```
+
+Be sure to inspect the resulting Dockerfile and add any runtime dependencies that are required for your workflow.
+
+## Configuring Task Environments
+
+Sometimes it is preferrable to use isolated environments for each Snakemake rule using the `container` and `conda` [Snakemake directives](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers) instead of building one large image.
+
+Typically, when using these directives, we must pass the `--use-conda` and `--use-singularity` flags to the `snakemake` command in order to configure which environment to activate. Similarly, to configure your environment on Latch, add the `env_config` field to your workflow's `SnakemakeMetadata` object. For example,
 
 ```
 # latch_metadata.py
@@ -37,7 +49,7 @@ SnakemakeMetadata(
 )
 ```
 
-If there is no `env_config` defined, Snakemake tasks on Latch will NOT use containers or conda environments by default.
+**Note**: If there is no `env_config` defined, Snakemake tasks on Latch will NOT use containers or conda environments by default.
 
 ## Using Private Container Registries
 

@@ -1,6 +1,6 @@
 # Lifecycle of a Snakemake Execution on Latch
 
-Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will only register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.
+Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.
 
 ### JIT Workflow
 
@@ -13,75 +13,23 @@ The first ("JIT") workflow does the following:
 
 Debugging:
 
-* The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
-* Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`
+- The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
+- Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`
 
 ### Runtime Workflow
 
-The runtime workflow contains a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status.
+The runtime workflow will spawn a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status.
 
-Each task runs a modified Snakemake executable using a script from the Latch SDK which monkey-patches the appropriate parts of the Snakemake package. This executable is different in two ways:
+When a task executes it will:
 
-1. Rules that are not part of the task's target are entirely ignored
-2. The target rule has all of its properties (currently inputs, outputs, benchmark, log, shellcode) replaced with the job-specific strings. This is the same as the value of these directives with all wildcards expanded and lazy values evaluated
-
-Debugging:
-
-* The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/compiled_tasks`
-
-#### Example
-
-Snakefile rules:
-
-```Snakemake
-rule all:
-  input:
-    os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"),
-    os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html")
-
-rule fastqc:
-  input: os.path.join(WORKDIR, "fastq", "{sample}.fastq")
-  output: os.path.join(WORKDIR, "qc", "fastqc", "{sample}_fastqc.html")
-  shellcmd: "fastqc {input} -o {output}"
-```
-
-Produced jobs:
-
-1. Rule: `fastqc` Wildcards: `sample=read1`
-1. Rule: `fastqc` Wildcards: `sample=read2`
-
-Resulting single-job executable for job 1:
-
-```py
-# @workflow.rule(name='all', lineno=1, snakefile='/root/Snakefile')
-# @workflow.input( # os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"),
-#     # os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html"),
-# )
-# @workflow.norun()
-# @workflow.run
-# def __rule_all(input, output, ...):
-#     pass
-
-@workflow.rule(name='fastqc', lineno=6, snakefile='/root/Snakefile')
-@workflow.input("work/fastq/read1.fastq" # os.path.join(WORKDIR, "fastq", "{sample}.fastq")
-)
-@workflow.shellcmd("fastqc work/fastq/read1.fastq -o work/qc/fastqc/read1_fastqc.html")
-@workflow.run
-def __rule_fastqc(input, output, ...):
-    shell("fastqc {input} -o {output}", ...)
-```
-
-Note:
-
-* The "all" rule is entirely commented out
-* The "fastqc" rule has no wildcards in its decorators
+1. Download all input files that are explicitly defined in the rule
+2. Execute the Snakemake task
+3. Upload outputs/logs/benchmarks to Latch Data
 
 ### Limitations
 
 1. The workflow will execute the first rule defined in the Snakefile (matching standard Snakemake behavior). There is no way to change the default rule other than by moving the desired rule up in the file
-1. The workflow will output files that are not used by downstream tasks. This means that intermediate files cannot be included in the output. The only way to exclude an output is to write a rule that lists it as an input
-1. Input files and directories are downloaded fully, even if they are not used to generate the dependency graph. This commonly leads to issues with large directories being downloaded just to list the files contained within, delaying the JIT workflow by a large amount of time and requiring a large amount of disk space
-1. Only the JIT workflow downloads input files. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime
+1. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime
 1. Large files that move between tasks will need to be uploaded by the outputting task and downloaded by each consuming task. This can take a large amount of time. Frequently it's possible to merge the producer and the consumer into one task to improve performance
 1. Environment dependencies (Conda packages, Python packages, other software) must be well-specified. Missing dependencies will lead to JIT-time or runtime crashes
 1. Config files are not supported and must be hard-coded into the workflow Docker image

@@ -7,17 +7,19 @@ To construct a graphical interface from a snakemake workflow, the file parameter
 The `latch_metadata.py` file holds these parameter definitions, along with any styling or cosmetic modifications the developer wishes to make to each parameter.
 
 To generate a `latch_metadata.py` file, type:
+
 ```console
 latch generate-metadata <path_to_config.yaml>
 ```
 
-The command automatically parses the existing `config.yaml` file in the Snakemake repository, and create a Python parameters file.
+The command automatically parses the existing `config.yaml` file in the Snakemake repository, and creates a Python parameters file.
 
 #### Examples
 
 Below is an example `config.yaml` file from the [rna-seq-star-deseq2 workflow](https://github.com/snakemake-workflows/rna-seq-star-deseq2) from Snakemake workflow catalog.
 
 `config.yaml`
+
 ```yaml
 # path or URL to sample sheet (TSV format, columns: sample, condition, ...)
 samples: config/samples.tsv
@@ -26,7 +28,6 @@ samples: config/samples.tsv
 # sample).
 units: config/units.tsv
 
-
 ref:
   # Ensembl species name
   species: homo_sapiens
@@ -84,14 +85,14 @@ diffexp:
   # model: ~jointly_handled + treatment_1 + treatment_2
   model: ""
 
-
 params:
   cutadapt-pe: ""
   cutadapt-se: ""
   star: ""
 ```
 
 The Python `latch_metadata.py` generated from the Latch command:
+
 ```python
 from dataclasses import dataclass
 import typing

@@ -0,0 +1,16 @@
+# Overview
+
+## Motivation
+
+Latch's snakemake integration allows developers to build graphical interfaces to expose their Snakemake workflows to wet lab teams. It also provides managed cloud infrastructure for the execution of the workflow's jobs.
+
+A primary design goal for the Snakemake integration is to allow developers to register existing projects with minimal added boilerplate and modifications to code. Here, we outline these changes and why they are needed.
+
+## Snakemake Workflow's on Latch
+
+Recall a snakemake project consists of a `Snakefile` , which describes workflow
+rules in an ["extension"](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html) of Python, and associated python code imported and called by these rules. To make this project compatible with Latch, we need to do the following:
+
+1. Define [metadata and input file parameters](./metadata.md) for your workflow
+2. Build a [container](./environments.md) with all runtime dependencies
+3. Ensure your `Snakefile` is compatible with [cloud execution](./cloud.md)