Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…k-draft into add_big_examples
  • Loading branch information
ctb committed Jun 5, 2023
2 parents 1d05812 + 18c51f1 commit adc9338
Show file tree
Hide file tree
Showing 20 changed files with 761 additions and 7 deletions.
4 changes: 4 additions & 0 deletions code/examples/config.basic/config.multi_samples.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
samples:
- DEF_789
- GHI_234
- JKL_567
1 change: 1 addition & 0 deletions code/examples/config.basic/config.one_sample.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sample: XYZ_123
1 change: 1 addition & 0 deletions code/examples/config.basic/config.one_sample_b.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sample: ABC_456
13 changes: 13 additions & 0 deletions code/examples/config.basic/snakefile.multi_samples
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
configfile: "config.multi_samples.yml"

SAMPLES=config['samples']

rule all:
input:
expand("one_sample.{s}.out", s=SAMPLES)

rule make_single_sample_wc:
output:
"one_sample.{s}.out"
shell:
"touch {output}"
23 changes: 23 additions & 0 deletions code/examples/config.basic/snakefile.multi_samples.pprint
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import pprint

configfile: "config.multi_samples.yml"

# print out the config dictionary
print('config is:')
pprint.pprint(config)

SAMPLES=config['samples']

# print out the SAMPLES variable
print('SAMPLES is:')
pprint.pprint(SAMPLES)

rule all:
input:
expand("one_sample.{s}.out", s=SAMPLES)

rule make_single_sample_wc:
output:
"one_sample.{s}.out"
shell:
"touch {output}"
9 changes: 9 additions & 0 deletions code/examples/config.basic/snakefile.one_sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
configfile: "config.one_sample.yml"

SAMPLE=config['sample']

rule all:
output:
expand("one_sample.{s}.out", s=SAMPLE)
shell:
"touch {output}"
1 change: 1 addition & 0 deletions code/examples/errors.simple-fail/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
file-does-not-exist-typo
5 changes: 5 additions & 0 deletions code/examples/errors.simple-fail/snakefile.missing-input
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# expect_fail

rule example:
input:
"file-does-not-exist"
8 changes: 8 additions & 0 deletions code/examples/errors.simple-fail/snakefile.missing-output
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# expect_fail

rule example:
output:
"file-does-not-exist"
shell: """
touch file-does-not-exist-typo
"""
6 changes: 6 additions & 0 deletions code/examples/errors.simple-fail/snakefile.shell-fail
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# expect_fail

rule hello_fail:
shell: """
ls file-does-not-exist
"""
6 changes: 6 additions & 0 deletions code/examples/errors.simple-fail/snakefile.wildcard-error
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# expect_fail

rule example:
input: "{name}.input"
output: "{name}.output"
shell: "cp {input} {output}"
4 changes: 3 additions & 1 deletion deploy.sh
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
#! /bin/bash
set -e
set -x

# create a temp directory & build book into it
tmpdir=$(mktemp -d /tmp/bookXXX)
mdbook build -d ${tmpdir}
echo "build directory is: ${tmpdir}"

# go to temp directory
cd ${tmpdir}/html/
cd ${tmpdir}

# indicate that GitHub should not interpret this as a Jekyll site, i.e.
# it's a static site.
Expand Down
3 changes: 3 additions & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,12 @@
- [Using wildcards to generalize your rules](./beginner+/wildcards.md)
- [`params:` blocks and `{params}`](./beginner+/params-blocks.md)
- [Using `expand` to generate filenames](./beginner+/expand.md))
- [Running rules and choosing targets from the command line](./beginner+/targets.md)
- [Techniques for debugging snakemake workflows](./beginner+/debugging.md)
- [Basic syntax rules for Snakefiles](./beginner+/syntax.md)
- [Visualizing your workflow](./beginner+/visualizing.md)
- [String formatting "minilanguage"](./beginner+/string-formatting.md)
- [Using configuration files](./beginner+/config.md)

- [Section 3b - complete examples](./section_3b.md)
- [Variant calling](./complete/variant.md)
Expand All @@ -42,6 +44,7 @@

- [Section 5 - Advanced Features](./section_5.md)
- [Beyond `-j` - parallelizing snakemake](./advanced/parallel.md)
- [Resource constraints and job management](./advanced/resources.md)

- [Section 6 - A Reference Guide for Snakemake Features](./section_6.md)
- [Wildcard constraints](reference/wildcard-constraints.md)
Expand Down
17 changes: 17 additions & 0 deletions src/advanced/resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Resources, constraints, and job management

## Points to make / outline

* impossibility of predicting exactly; a strategy
* how to measure with benchmarks, slurm, top (??); RSS as key thing to manage
* CPU utilization, context switching, overhead; threads, processes
* considerations for parallelism (perhaps also see [parallel](parallel.md)).

Standard resources: mem, disk, runtime, and tmpdir

Your own defined resources: other things.

## Examples

* set up various memory constrained jobs and run with various
different max memories; show overlap; make figure showing total memory used.
186 changes: 186 additions & 0 deletions src/beginner+/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Using configuration files

Configuration files are a snakemake feature that can be used to
separate the _rules_ in the workflow from the _configuration_ of the
workflow. For example, suppose that we want to run the same sequence
trimming workflow on many different samples. With the techniques we've
seen so far, you'd need to change the Snakefile each time; with config
files, you can keep the Snakefile the same, and just provide a different
config file for each new sample. Config files can also be used to
define parameters, or override default parameters, for specific programs
being run by your workflow.

## A first example - running a rule with a single sample ID

Consider this Snakefile, which create an output file based on a
sample ID. Here the sample ID is taken from a config file and provided
via the Python dictionary named `config`:
```python
{{#include ../../code/examples/config.basic/snakefile.one_sample}}
```

The default configuration file is `config.one_sample.yml`, which
sets `config['sample']` to the value `XYZ_123`, and creates
`one_sample.XYZ_123.out`:
```yml
{{#include ../../code/examples/config.basic/config.one_sample.yml}}
```

However, the `configfile:` directive in the Snakefile can be overriden
on the command line by using `--configfile`; consider the file
`config.one_sample_b.yml`:
```yml
{{#include ../../code/examples/config.basic/config.one_sample_b.yml}}
```
If we now run `snakemake -s snakefile.one_sample --configfile
config.one_sample_b.yml -j 1`, the value of sample will be set to
`ABC_456`, and the file `one_sample.ABC_456.out` will be created.

(CTB: assert that the appropriate output files are created.)

## Specifying multiple sample IDs in a config file

The previous example only handles one sample at a time, but there's
no reason we couldn't provide multiple, using YAML lists. Consider
this Snakefile, `snakefile.multi_samples`:
```python
{{#include ../../code/examples/config.basic/snakefile.multi_samples}}
```

and this config file, `config.multi_samples.yml`:
```yml
{{#include ../../code/examples/config.basic/config.multi_samples.yml}}
```

Here, we're creating multiple output files, using a more complicated setup.

First, we use `samples` from the config file. The `config['samples']` value
is a Python list of strings, instead of a Python string, as in the previous
sample; that's because the config file specifies `samples` as a list in
the `config.multi_samples.yml` file.

Second, we switched to using [a wildcard rule](wildcards.md) in the
Snakefile, because we want to
[run one rule on many files](wildcards.md#running-one-rule-on-many-files);
this has a lot of benefits!

Last but not least, we provide a [default rule](../chapter_10.md) that
uses [the `expand` function with a single pattern and one list of values](expand.md#using-expand-with-a-single-pattern-and-one-list-of-values) to construct
the list of output files for the wildcard rule to make.

Now we can either edit the list of samples in the config file, or we can
provide different config files with different lists of samples!

## Specifying input spreadsheets via config file

## Specifying command line parameters in a config file

Config files aren't limited to sample IDs - you can put pretty much
anything in a config file.

Consider our `sourmash sketch` command from the workflow
we developed in [Section 1](../chapter_0.md), where we compare genomes
at a particular k-mer size. For example, from
[Chapter 2](../chapter_2.md), we have:

```python
rule sketch_genomes:
output:
"GCF_000017325.1.fna.gz.sig",
"GCF_000020225.1.fna.gz.sig",
"GCF_000021665.1.fna.gz.sig"
shell: """
sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
"""
```

Here, `sketch dna` is run with the parameter `-p k=31`, which sets the
k-mer size for comparison to k=31. This is a prime candidate for a
config file!

Using [a params block](params.md) and a config file, we could rewrite this
rule as
```python
rule sketch_genomes:
output:
"GCF_000017325.1.fna.gz.sig",
"GCF_000020225.1.fna.gz.sig",
"GCF_000021665.1.fna.gz.sig",
params:
ksize=config['ksize'],
shell: """
sourmash sketch dna -p k={params.ksize} genomes/*.fna.gz --name-from-first
"""
```

This has a few nice features:

* the use of 'params' makes it clear to the reader that this is a parameter!
* the k-mer size is configurable!

CTB: check that it actually works with k=21!

CTB: talk about config.get and int/type validation

CTB: advanced usage: conditional parameters like output=pdf for compare.

note/danger, might want to have some info on parameters in output file names...

note/danger, talk about tradeoff b/t information in config file, vs information in snakefile - e.g. what programs to run, vs what parameters to use

## Debugging config files and displaying the `config` dictionary

I frequently want to know what the config actually is when running
snakemake. A convenient way to do this is to use `pprint` -
for example, see `snakefile.multi_samples.pprint`,
```python
{{#include ../../code/examples/config.basic/snakefile.multi_samples.pprint}}
```
which produces the following output:
```
config is:
{'samples': ['DEF_789', 'GHI_234', 'JKL_567']}
SAMPLES is:
['DEF_789', 'GHI_234', 'JKL_567']
```

CTB: explain python dict/list, or link.

CTB: link to debugging

CTB: talk about -n, and Python statements vs rules...

print, pprint
keys

using .get/providing defaults

## Advanced usage

### Providing config variables on the command line

You can also set individual config variables on the command line:

```
snakemake -j 1 -s snakefile.one_sample -C sample=ZZZ_123
```

CTB: how to do this for lists; how to do this for multiple config variables.

### Providing multiple config files

`--configfiles`

## Recap

With config files, you can:

* separate configuration from your workflow
* provide multiple different config files for the same workflow
* change the samples by editing a YML file instead of a Snakefile
* make it easy to validate your input configuration (DISCUSS)

## Leftovers

* Point to official snakemake docs
* Guide to YAML and JSON syntax
Loading

0 comments on commit adc9338

Please sign in to comment.