Skip to content
L A Liggett edited this page May 16, 2019 · 19 revisions

Contents

Introduction

FERMI is used to identify mutations at an extremely rare frequency. FERMI contains set of tools to analyze unique molecular identifier (UMI) tagged, amplicon captured, genomic DNA sequence data. Tools are included both for rapid identification of variants within amplicon sequencing, and for further analysis of patterns and trends within the identified variant pool.

Updating

FERMI will continue to be updated. To update if using the Manual Install Method, simply redo all the installation steps.

Updating If Installed From GitHub

1. cd FERMI
2. git pull

This should typically be enough to update FERMI, but occasionally I may add a new dependency through Anaconda. If errors are encountered simply reinstall in the same way originally installed.

Usage

Most of this information can be accessed by running:

./fermi -h

FERMI must be run with a few required input flags. Below is an example of the minimum required input.

fermi -i /inputDirectory -o .outputDirectory -b 'freebayes' -y '/referenceGenome.fa'

1. The input directory should contain unzipped paired end fastq files.
2. The output directory can be any directory that can be written to with your given permissions.
3. The -b flag specifies the command to be used to run the variant caller freebayes. If you don't have a different freebayes or other aligner you would like to use 'freebayes' will use the one installed automatically during the install process.
4. No reference genome is included with this pipeline by default. All testing was done with hg19 downloaded from the UCSC Genome Browser, but other reference genomes should work just fine.

Common options

  -h, --help            show this help message and exit
  --nfo NFO, -n NFO     Info writeup about a particular run that will be
                        output in the run directory.
  --largefiles, -l      Outputs all generated fastq files generated during
                        analysis.
  --avoidalign, -a      Only runs through initial analysis of input fastq
                        files, and does not align to reference or call
                        variants.
  --outdir OUTDIR, -o OUTDIR
                        Specifies output directory where all analysis files
                        will be written.
  --indir INDIR, -i INDIR
                        Specifies the input directory that contains the fastq
                        files to be analyzed.
  --single, -s          Only process a single set of paired end reads.
  --prevdict PREVDICT, -p PREVDICT
                        Specify a previously output pickle file containing
                        collapsed fastq data as an input instead of raw fastq
                        files.
  --umimismatch UMIMISMATCH, -u UMIMISMATCH
                        Specify the number of mismatches allowed in a UMI pair
                        to still consider as the same UMI.
  --varthresh VARTHRESH, -v VARTHRESH
                        Specify the percentage of reads that must contain a
                        particular base for that base to be used in the final
                        consensus read.
  --readsupport READSUPPORT, -r READSUPPORT
                        Specifies the number of reads that must have a given
                        UMI sequence in order to be binned as a true capture
                        event, and not be thrown out.
  --clustersubmit, -c   Submit run to cluster computing rather than running
                        locally.
  --filterao FILTERAO, -f FILTERAO
                        Specifies the AO cuttoff for reported variants, where
                        -f 5 would eliminate all variants that are seen 5
                        times or less. Default == 5.
  --dpfilter DPFILTER, -d DPFILTER
                        Read depth elimination threshold. If specified as -d
                        500 only variants found in a locus read greater than
                        500 times will be reported. Default == 500.
  --freebayes FREEBAYES, -b FREEBAYES
                        Location of freebayes in the format of /dir/freebayes
  --errorrate, -e       Overall pcr amplification + sequencing error rates
                        will be estimated and returned.
  --readLength READLENGTH, -q READLENGTH
                        Manually set the read length. If this is not set,
                        length will be automatically set as the number of
                        bases found between the two UMI sequences.
  --badBaseSubstitute, -x
                        This flag will trigger replacement of bad bases with N
                        instead of invalidating an entire capture.
  --reference REFERENCE, -y REFERENCE
                        Set the location of the human reference genome hg19.fa
                        and supporting files.
  --duplexcollapse, -w  This will run duplex collapsing instead of the
                        original collapsing that treats two complementary
                        strands as different captures.
  --minimaloutput, -z   This will suppress the output of most files, and only
                        include the final vcf files and some of the info
                        files.
  --getsamplesautomatically, -g
                        This will try and automatically grab and sort all
                        files in a specified input directory so they dont need
                        to be manually specified. Samples should fit the
                        pattern x1.fastq (r1) and x2.fastq (r2) where x can be
                        any string.
  --realvsmock, -j      This flag will trigger elimination of potential errors
                        found by duplex sequencing, if duplex collapsing is
                        flagged without this flag, mock duplex sequencing will
                        be performed in order to compare with the effects of
                        eliminating potential errors.