Snakemake workflows used to assemble bacterial isolates.
Workflows were used to assemble five historical Bacillus anthracis isolates soon to be published in Microbiology Resource Annoucements.
The Bacillus anthracis assemblies have been deposited in DDBJ/ENA/GenBank under BioSample accession numbers SAMN12620928, SAMN12620929, SAMN12620930, SAMN12620931, and SAMN12620932. The raw Illumina paired-end sequencing reads have been deposited in the Sequence Read Archive under accession numbers SRR10019497, SRR10019498, SRR10019499, SRR10019500, and SRR10019501.
- Install Anaconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
- Download asm_tools
git clone git://github.com/bioforensics/asm_tools
OR
Download a Release
- Setup python environment and use conda to install required packages (mash, fastp, etc).
cd asm_tools/preprocess
conda create -f preprocess_env.yml
conda activate bmap_preprocess
- (Optional) Download databases for "mash screen" to check for contaminants.
Mash Sketch databases for RefSeq release 88:
- RefSeq88n.msh.gz: Genomes (k=21, s=1000), 1.2Gb uncompressed
- RefSeq88p.msh.gz: Proteomes (k=9, s=1000), 1.1Gb uncompressed
- Edit preprocess/config.yml with path to mash database
mashdb: path/to/mashdb
- Run the read preprocessing workflow
path/to/asm_tools/preprocess/bmap_preprocess -r1 test/seq/test_R1.fastq.gz -r2 test/seq/test_R2.fastq.gz -s sample_name
singularity pull bmap_preprocess.sif library://dsommer/default/bmap/bmap_preprocess singularity exec bmap_preprocess.sif -r1 test/seq/test_R1.fastq.gz -r2 test/seq/test_R2.fastq.gz -s test1
The preprocessing.smk Snakemake workflow prepares Illumina reads to be assembled.
- Run fastp to trim adapter sequence, low quality bases, and very short reads. By default, bases below Q20 at ends of reads will be trimmed. Any reads below length 75 and/or containing Ns will be removed.
- Run "mash screen" against RefSeq to check for contaminents.
- Estimate genome size by building a k-mer profile on the reads.
- Randomly downsample reads to 150× coverage of the estimated genome size using sample-reads program.