Merge branch 'main' of https://github.com/ngs-docs/2023-snakemake-boo…

…k-draft into add_big_examples
ngs-docs · Jun 5, 2023 · adc9338 · adc9338
2 parents 1d05812 + 18c51f1
commit adc9338
Show file tree

Hide file tree

Showing 20 changed files with 761 additions and 7 deletions.
diff --git a/code/examples/config.basic/config.multi_samples.yml b/code/examples/config.basic/config.multi_samples.yml
@@ -0,0 +1,4 @@
+samples:
+- DEF_789
+- GHI_234
+- JKL_567
diff --git a/code/examples/config.basic/config.one_sample.yml b/code/examples/config.basic/config.one_sample.yml
@@ -0,0 +1 @@
+sample: XYZ_123
diff --git a/code/examples/config.basic/config.one_sample_b.yml b/code/examples/config.basic/config.one_sample_b.yml
@@ -0,0 +1 @@
+sample: ABC_456
diff --git a/code/examples/config.basic/snakefile.multi_samples b/code/examples/config.basic/snakefile.multi_samples
@@ -0,0 +1,13 @@
+configfile: "config.multi_samples.yml"
+
+SAMPLES=config['samples']
+
+rule all:
+    input:
+        expand("one_sample.{s}.out", s=SAMPLES)
+
+rule make_single_sample_wc:
+    output:
+        "one_sample.{s}.out"
+    shell:
+        "touch {output}"
diff --git a/code/examples/config.basic/snakefile.multi_samples.pprint b/code/examples/config.basic/snakefile.multi_samples.pprint
@@ -0,0 +1,23 @@
+import pprint
+
+configfile: "config.multi_samples.yml"
+
+# print out the config dictionary
+print('config is:')
+pprint.pprint(config)
+
+SAMPLES=config['samples']
+
+# print out the SAMPLES variable
+print('SAMPLES is:')
+pprint.pprint(SAMPLES)
+
+rule all:
+    input:
+        expand("one_sample.{s}.out", s=SAMPLES)
+
+rule make_single_sample_wc:
+    output:
+        "one_sample.{s}.out"
+    shell:
+        "touch {output}"
diff --git a/code/examples/config.basic/snakefile.one_sample b/code/examples/config.basic/snakefile.one_sample
@@ -0,0 +1,9 @@
+configfile: "config.one_sample.yml"
+
+SAMPLE=config['sample']
+
+rule all:
+    output:
+        expand("one_sample.{s}.out", s=SAMPLE)
+    shell:
+        "touch {output}"
diff --git a/code/examples/errors.simple-fail/.gitignore b/code/examples/errors.simple-fail/.gitignore
@@ -0,0 +1 @@
+file-does-not-exist-typo
diff --git a/code/examples/errors.simple-fail/snakefile.missing-input b/code/examples/errors.simple-fail/snakefile.missing-input
@@ -0,0 +1,5 @@
+# expect_fail
+
+rule example:
+    input:
+        "file-does-not-exist"
diff --git a/code/examples/errors.simple-fail/snakefile.missing-output b/code/examples/errors.simple-fail/snakefile.missing-output
@@ -0,0 +1,8 @@
+# expect_fail
+
+rule example:
+    output:
+       "file-does-not-exist"
+    shell: """
+       touch file-does-not-exist-typo
+    """   
diff --git a/code/examples/errors.simple-fail/snakefile.shell-fail b/code/examples/errors.simple-fail/snakefile.shell-fail
@@ -0,0 +1,6 @@
+# expect_fail
+
+rule hello_fail:
+    shell: """
+        ls file-does-not-exist
+    """
diff --git a/code/examples/errors.simple-fail/snakefile.wildcard-error b/code/examples/errors.simple-fail/snakefile.wildcard-error
@@ -0,0 +1,6 @@
+# expect_fail
+
+rule example:
+    input: "{name}.input"
+    output: "{name}.output"
+    shell: "cp {input} {output}"
diff --git a/deploy.sh b/deploy.sh
@@ -1,12 +1,14 @@
 #! /bin/bash
+set -e
+set -x
 
 # create a temp directory & build book into it
 tmpdir=$(mktemp -d /tmp/bookXXX)
 mdbook build -d ${tmpdir}
 echo "build directory is: ${tmpdir}"
 
 # go to temp directory
-cd ${tmpdir}/html/
+cd ${tmpdir}
 
 # indicate that GitHub should not interpret this as a Jekyll site, i.e.
 # it's a static site.

diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -23,10 +23,12 @@
   - [Using wildcards to generalize your rules](./beginner+/wildcards.md)
   - [`params:` blocks and `{params}`](./beginner+/params-blocks.md)
   - [Using `expand` to generate filenames](./beginner+/expand.md))
+  - [Running rules and choosing targets from the command line](./beginner+/targets.md)
   - [Techniques for debugging snakemake workflows](./beginner+/debugging.md)
   - [Basic syntax rules for Snakefiles](./beginner+/syntax.md)
   - [Visualizing your workflow](./beginner+/visualizing.md)
   - [String formatting "minilanguage"](./beginner+/string-formatting.md)
+  - [Using configuration files](./beginner+/config.md)
 
 - [Section 3b - complete examples](./section_3b.md)
   - [Variant calling](./complete/variant.md)
@@ -42,6 +44,7 @@
 
 - [Section 5 - Advanced Features](./section_5.md)
   - [Beyond `-j` - parallelizing snakemake](./advanced/parallel.md)
+  - [Resource constraints and job management](./advanced/resources.md)
 
 - [Section 6 - A Reference Guide for Snakemake Features](./section_6.md)
   - [Wildcard constraints](reference/wildcard-constraints.md)

diff --git a/src/advanced/resources.md b/src/advanced/resources.md
@@ -0,0 +1,17 @@
+# Resources, constraints, and job management
+
+## Points to make / outline
+
+* impossibility of predicting exactly; a strategy
+* how to measure with benchmarks, slurm, top (??); RSS as key thing to manage
+* CPU utilization, context switching, overhead; threads, processes
+* considerations for parallelism (perhaps also see [parallel](parallel.md)).
+
+Standard resources: mem, disk, runtime, and tmpdir
+
+Your own defined resources: other things.
+
+## Examples
+
+* set up various memory constrained jobs and run with various
+  different max memories; show overlap; make figure showing total memory used.
diff --git a/src/beginner+/config.md b/src/beginner+/config.md
@@ -0,0 +1,186 @@
+# Using configuration files
+
+Configuration files are a snakemake feature that can be used to
+separate the _rules_ in the workflow from the _configuration_ of the
+workflow.  For example, suppose that we want to run the same sequence
+trimming workflow on many different samples. With the techniques we've
+seen so far, you'd need to change the Snakefile each time; with config
+files, you can keep the Snakefile the same, and just provide a different
+config file for each new sample. Config files can also be used to
+define parameters, or override default parameters, for specific programs
+being run by your workflow.
+
+## A first example - running a rule with a single sample ID
+
+Consider this Snakefile, which create an output file based on a
+sample ID. Here the sample ID is taken from a config file and provided
+via the Python dictionary named `config`:
+```python
+{{#include ../../code/examples/config.basic/snakefile.one_sample}}
+```
+
+The default configuration file is `config.one_sample.yml`, which
+sets `config['sample']` to the value `XYZ_123`, and creates
+`one_sample.XYZ_123.out`:
+```yml
+{{#include ../../code/examples/config.basic/config.one_sample.yml}}
+```
+
+However, the `configfile:` directive in the Snakefile can be overriden
+on the command line by using `--configfile`; consider the file
+`config.one_sample_b.yml`:
+```yml
+{{#include ../../code/examples/config.basic/config.one_sample_b.yml}}
+```
+If we now run `snakemake -s snakefile.one_sample --configfile
+config.one_sample_b.yml -j 1`, the value of sample will be set to
+`ABC_456`, and the file `one_sample.ABC_456.out` will be created.
+
+(CTB: assert that the appropriate output files are created.)
+
+## Specifying multiple sample IDs in a config file
+
+The previous example only handles one sample at a time, but there's
+no reason we couldn't provide multiple, using YAML lists. Consider
+this Snakefile, `snakefile.multi_samples`:
+```python
+{{#include ../../code/examples/config.basic/snakefile.multi_samples}}
+```
+
+and this config file, `config.multi_samples.yml`:
+```yml
+{{#include ../../code/examples/config.basic/config.multi_samples.yml}}
+```
+
+Here, we're creating multiple output files, using a more complicated setup.
+
+First, we use `samples` from the config file. The `config['samples']` value
+is a Python list of strings, instead of a Python string, as in the previous
+sample; that's because the config file specifies `samples` as a list in
+the `config.multi_samples.yml` file.
+
+Second, we switched to using [a wildcard rule](wildcards.md) in the
+Snakefile, because we want to
+[run one rule on many files](wildcards.md#running-one-rule-on-many-files);
+this has a lot of benefits!
+
+Last but not least, we provide a [default rule](../chapter_10.md) that
+uses [the `expand` function with a single pattern and one list of values](expand.md#using-expand-with-a-single-pattern-and-one-list-of-values) to construct
+the list of output files for the wildcard rule to make.
+
+Now we can either edit the list of samples in the config file, or we can
+provide different config files with different lists of samples!
+
+## Specifying input spreadsheets via config file
+
+## Specifying command line parameters in a config file
+
+Config files aren't limited to sample IDs - you can put pretty much
+anything in a config file.
+
+Consider our `sourmash sketch` command from the workflow
+we developed in [Section 1](../chapter_0.md), where we compare genomes
+at a particular k-mer size. For example, from
+[Chapter 2](../chapter_2.md), we have:
+
+```python
+rule sketch_genomes:
+    output:
+       "GCF_000017325.1.fna.gz.sig",
+       "GCF_000020225.1.fna.gz.sig",
+       "GCF_000021665.1.fna.gz.sig"
+    shell: """
+        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
+    """
+```
+
+Here, `sketch dna` is run with the parameter `-p k=31`, which sets the
+k-mer size for comparison to k=31. This is a prime candidate for a
+config file!
+
+Using [a params block](params.md) and a config file, we could rewrite this
+rule as 
+```python
+rule sketch_genomes:
+    output:
+       "GCF_000017325.1.fna.gz.sig",
+       "GCF_000020225.1.fna.gz.sig",
+       "GCF_000021665.1.fna.gz.sig",
+    params:
+        ksize=config['ksize'],
+    shell: """
+        sourmash sketch dna -p k={params.ksize} genomes/*.fna.gz --name-from-first
+    """
+```
+
+This has a few nice features:
+
+* the use of 'params' makes it clear to the reader that this is a parameter!
+* the k-mer size is configurable!
+
+CTB: check that it actually works with k=21!
+
+CTB: talk about config.get and int/type validation
+
+CTB: advanced usage: conditional parameters like output=pdf for compare.
+
+note/danger, might want to have some info on parameters in output file names...
+
+note/danger, talk about tradeoff b/t information in config file, vs information in snakefile - e.g. what programs to run, vs what parameters to use
+
+## Debugging config files and displaying the `config` dictionary
+
+I frequently want to know what the config actually is when running
+snakemake. A convenient way to do this is to use `pprint` -
+for example, see `snakefile.multi_samples.pprint`,
+```python
+{{#include ../../code/examples/config.basic/snakefile.multi_samples.pprint}}
+```
+which produces the following output:
+```
+config is:
+{'samples': ['DEF_789', 'GHI_234', 'JKL_567']}
+SAMPLES is:
+['DEF_789', 'GHI_234', 'JKL_567']
+```
+
+CTB: explain python dict/list, or link.
+
+CTB: link to debugging
+
+CTB: talk about -n, and Python statements vs rules...
+
+print, pprint
+keys
+
+using .get/providing defaults
+
+## Advanced usage
+
+### Providing config variables on the command line
+
+You can also set individual config variables on the command line:
+
+```
+snakemake -j 1 -s snakefile.one_sample -C sample=ZZZ_123
+```
+
+CTB: how to do this for lists; how to do this for multiple config variables.
+
+### Providing multiple config files
+
+`--configfiles`
+
+## Recap
+
+With config files, you can:
+
+* separate configuration from your workflow
+* provide multiple different config files for the same workflow
+* change the samples by editing a YML file instead of a Snakefile
+* make it easy to validate your input configuration (DISCUSS)
+
+## Leftovers
+
+* Point to official snakemake docs
+* Guide to YAML and JSON syntax