SEAStAR_User_Guide.html

<link href="style.css" media="screen" rel="stylesheet" type="text/css" />

<h1 id="seastar-user-guide-version-050">SEAStAR User Guide, version 0.5.0</h1>
<h4 id="vaughn-iverson-and-chris-berthiaume">Vaughn Iverson and Chris Berthiaume</h4>
<h5 id="armbrust-lab-school-of-oceanography-university-of-washington"><a href="http://armbrustlab.ocean.washington.edu">Armbrust Lab</a>, School of Oceanography, University of Washington</h5>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#quick_start">Quick Start</a></li>
<li><a href="#command_reference">Command Reference</a></li>
<li><a href="#file_formats">File Formats</a></li>
<li><a href="#misc_scripts">Miscellaneous Scripts</a></li>
</ul>
<h2 id="a-nameintroductionintroductiona"><a name="introduction">Introduction</a></h2>
<p>The SEAStAR tools are a collection of programs which implement many data and processing intensive steps in next-generation sequencing analysis pipelines. The tools quantitatively analyze alignments to reference sequence databases (including de novo assembled contigs), to produce a wide variety of statistics and relationships, including producing binned, assembled genomic scaffolds. At one time, SEAStAR was an acronym for <strong>S</strong>elect and <strong>E</strong>stimate <strong>A</strong>bundance from <strong>S</strong>hor<b>t</b> <strong>A</strong>ligned <strong>R</strong>eads. However, as described above, the tools and their capabilities have expanded well beyond that description, and so now SEAStAR is just our name for the in-house components of our analysis pipeline.</p>
<p><a href="SEAStAR_Pipeline_Stages.png"><img src="SEAStAR_Pipeline_Stages.png" alt="SEAStAR Pipeline" title="SEAStAR Pipeline"></a></p>
<p>Our philosophy has been to create stable, modular, high-performance tools that fill important gaps we encountered between other established pre-existing tools in our pipeline. As such, you will see references to other packages throughout this document. We recognize that considerable innovation is ongoing in analysis software for next generation sequence, and so we have maintained a modular, pipelined approach to permit experimentation with, and substitution of, new tools as they become available.  Due to the wide variety of different computational environments that undoubtedly exist in different labs and institutes, our focus will be on releasing stable, documented, portable and high performance <em>components</em> from our analysis pipeline, but not the pipeline scripts themselves. This is because these scripts will need to be highly customized for both the goals and computational environment of any specific project.</p>
<p>Initially our goal was to release tools to enable others to work directly with datasets generated by the Life Technologies SOLiD&trade; sequencing platform. We recognized that most of the components we have developed are also valuable for use with Illumina&reg; sequence data, or with a mixture of sequence data types, and so in this release we have generalized most components to support both colorspace and nucleotide reads.</p>
<h3 id="technical-guidance">Technical guidance</h3>
<p>The next-generation sequence datasets that the SEAStAR tools are designed to work with are typically very large (&gt;&gt; one billion reads). All compiled components of the SEAStAR tools are multithreaded and optimized for modern multi-core processors. Many SEAStAR tools will take advantage of 8 or more cores, particularly when compressed (gzip format) input and output files are being processed. When running on such a system, you will find that these components may run considerably more than 10x faster than equivalent tools written in a scripting language such as Python. For this reason, the operation of such tools will often be &quot;I/O bound&quot;; that is: the performance bottleneck is the speed of the disk(s) and/or network connections connected to the workstation, and not the availability of CPU cycles. The use of compressed files and local, separate independent disks for input and output files can often relieve some of this bottleneck and further improve run times.</p>
<p>Several of the SEAStAR tools (as well as many other tools used by our pipeline) have very high memory requirements when working with typical next-generation sequence datasets. We note these cases in the discussion of each individual SEAStAR tool, and attempt to provide guidance about the impact different parameters will have on memory use. However, in general, doing this kind of work will require a workstation with at least 32 gigabytes of RAM, and depending on the dataset and the types of analysis your project requires, you may require some multiple of that amount (64, 128, 256, or even 512 gigabytes) to successfully implement an analysis pipeline that meets your needs. When running these tools, it is highly advisable to set a session limit on the memory a process can attempt to allocate, to prevent machines from crashing and impacting other users who may be logged in. For example, adding the Unix shell command:</p>
<pre><code>ulimit -v 64000000</code></pre><p>to any script that invokes high memory tools will effectively prevent a 64GB machine from being unintentionally brought down (i.e. thrashing the swap file) by an application that attempts to allocate more memory than is physically available.</p>
<h3 id="license-and-citation">License and citation</h3>
<p>The SEAStAR tools are open source and are currently publicly distributed under the <a href="http://gplv3.fsf.org/">GPL version 3 license</a>, a copy of which is provided in a file named <code>COPYING</code> in the project root directory. By using this software, you are agreeing to be bound by the terms of this license. Please contact the authors if you are interested in discussing an alternative licensing arrangement.</p>
<p>We are in the process of preparing several publications specific to the SEAStAR analysis pipeline and the custom tools it contains. In the interim, we ask that you cite the following paper (which was the first to use and describe these tools):</p>
<blockquote>
<p>Iverson, V., Morris, R. M., Frazar, C. D., Berthiaume, C. T., Morales, R. L .and Armbrust, E. V., Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota, <a href="http://www.sciencemag.org/content/335/6068/587.abstract"><em>Science</em> <strong>335</strong> pp.587-590</a> (2012)</p>
</blockquote>
<h2 id="a-namequick_startquick-starta"><a name="quick_start">Quick Start</a></h2>
<p>This section is a quick introduction to using SEAStAR tools for some common use cases. The SEAStAR tools are &quot;command-line&quot; tools that run on UNIX-like operating systems. If you are not familiar with how to use the UNIX command-line shell, you should work your way through a tutorial before proceeding. Here are a few possibilities:</p>
<ul>
<li><a href="http://www.ee.surrey.ac.uk/Teaching/Unix/">UNIX Tutorial for Beginners</a></li>
<li><a href="http://linuxcommand.org/learning_the_shell.php">Learning the Linux Shell</a></li>
<li><a href="http://www.matisse.net/osx/intro_unix/0_outline.html">Introduction to the OS X Unix Command Line</a></li>
</ul>
<p>This tutorial assumes that you have already installed the SEAStAR software. If you are working from the <a href="http://armbrustlab.ocean.washington.edu/node/305">SEAStAR Quick Start virtual machine appliance</a> file in VirtualBox, your environment is already set-up and ready to go, so you can skip to the next section: <a href="#read_prep">Read Preparation</a>.</p>
<p>For SEAStAR installation instructions see the <code>README</code> file in the SEAStAR root directory. Although the SEAStAR tools can be quite memory and resource intensive, with the data used in this tutorial you should be able to comforably complete all of the steps below on a modern laptop with a dual-core processor and 2GB of RAM.</p>
<p>The data used in these examples are actual raw SOLiD sequence reads corresponding to the <a href="http://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1">Lambda phage</a> genome, and can be found in the <code>test_data</code> subdirectory of the SEAStAR source directory.</p>
<p>Create a new working directory and copy the following files from the <code>test_data</code> directory into that directory.</p>
<pre><code># Substitute the actual path to your SEAStAR &quot;source directory&quot; for
# [SEASTAR_src] in the copy commands below:

mkdir Quick_start
cd Quick_start
cp [SEASTAR_src]/test_data/lambda_reads_F3.csfasta.gz .
cp [SEASTAR_src]/test_data/lambda_reads_F3_QV.qual.gz .
cp [SEASTAR_src]/test_data/lambda_reads_R3.csfasta.gz .
cp [SEASTAR_src]/test_data/lambda_reads_R3_QV.qual.gz .</code></pre><p>And make sure that your <code>PATH</code> environment variable points to the executable files in your SEAStAR binaries directory:</p>
<pre><code># Where [SEASTAR_dest] below is the fully qualified path to your
# SEAStAR build destination directory.

export PATH=$PATH:[SEASTAR_dest]/bin</code></pre><p>If the <code>PATH</code> above is working, you should be able to run <code>ref_select --version</code> and get back something like: <code>ref_select version: v0.4.x</code>. If this doesn&#39;t work, you will need to figure out where SEAStAR is installed before proceeding so that the instructions that follow in this tutorial can find the SEAStAR tools on your computer.</p>
<p><strong>NOTE:</strong> To work your way through all of the examples in this tutorial, you will need the following external tools (in addition to those that were required to build SEAStAR):</p>
<ul>
<li><a href="http://www.ebi.ac.uk/~zerbino/velvet/">Velvet</a> -- You need the <strong>colorspace</strong> version. Build it with: <code>make color</code>, see the Velvet <code>README</code> file for instructions.</li>
<li><a href="http://sourceforge.net/projects/bio-bwa/files/">BWA</a> -- <strong>Version 0.5.10</strong>. <em>Important!! Because newer versions do not support the SOLiD colorspace reads used by this tutorial</em></li>
<li><a href="http://www.graphviz.org/Download.php">GraphViz</a></li>
</ul>
<p>Please ensure that the directories where these tools reside are also in your <code>PATH</code> environment variable before proceeding. You can test this by typing <code>velveth_de</code>, <code>bwa</code>, and <code>dot -V</code> at your command line. Each of these commands should write its version string (and perhaps other help messages) in response to being run.</p>
<h3 id="a-nameread_prepread-preparationa"><a name="read_prep">Read preparation</a></h3>
<p>Convert SOLiD <code>.csfasta</code> and <code>.qual</code> files to gzipped <code>.fastq</code> files. For real projects, users with no colorspace data will omit this step in their analysis pipeline, although the FASTQ file naming conventions still need to be followed for nucleotide (e.g. Illumina) reads; see the <a href="#FASTQ">FASTQ</a> appendix for details.</p>
<pre><code>solid2fastq -z lambda_reads lambda_conv</code></pre><p>Remove presumed PCR duplicate reads, identified by mate-pairs seen more than once.</p>
<pre><code>fastq_nodup -z -l 13 -d 2 -e 3 lambda_conv lambda_dedup</code></pre><p>Trim and filter the de-duplicated FASTQ reads based on read quality, information content, and length.</p>
<pre><code>trimfastq -z --mates_file -p 0.9 -l 34 -m 34 --add_len -e 3.0 lambda_dedup lambda_trim</code></pre><p>Randomly sample reads from <code>.fastq</code> files, retaining approximately 10% of the original reads. Randomly subsampled read sets are useful for quickly tuning the parameters of <code>trimfastq</code> for a given set of reads against some reference. We won&#39;t do that here, but tuning the <code>trimfastq -p</code> setting is important for getting clean assemblies and maximizing the useful information in your sequence data.</p>
<pre><code>samplefastq -z -f 0.1 lambda_trim lambda_samp</code></pre><h3 id="de-novo-contig-assembly"><em>De novo</em> contig assembly</h3>
<p>If you have installed the colorspace aware build of the <em>de novo</em> assembly tool <a href="http://www.ebi.ac.uk/~zerbino/velvet/">Velvet</a>, it can be used to assemble lambda-phage colorspace contigs:</p>
<pre><code>velveth_de lambda_asm/ 19 -fastq.gz -shortPaired lambda_trim.mates.fastq.gz -short lambda_trim.single.fastq.gz &gt; lambda_asm.velveth_de.log 2&gt;&amp;1
velvetg_de lambda_asm/ -scaffolding no -read_trkg no -ins_length auto -ins_length_sd auto -exp_cov 50 -cov_cutoff 5 -min_contig_lgth 50 &gt; lambda_asm.velvetg_de.log 2&gt;&amp;1</code></pre><p>There will now be a subdirectory called <code>lambda_asm</code> with a file called <code>contigs.fa</code> in it. These are the colorspace contigs we will use in the next section.</p>
<h3 id="aligning-reads-to-a-reference">Aligning reads to a reference</h3>
<p>For this example, you need the short read aligner <a href="http://sourceforge.net/projects/bio-bwa/files/">BWA</a> (version &lt;= 0.5.10). It can be used to align the de-duplicated and trimmed lambda FASTQ reads against the lambda-phage colorspace contigs. IMPORTANT: This quickstart example will not work with versions of BWA version 0.6.0 or newer because colorspace support was removed from BWA at that point. For your own work, if you are using nucleotide reads, you are free to use any alignment software you wish that produces standard SAM files.</p>
<pre><code># Because these are colorspace contigs and BWA expects nucleotide reference sequences,
# we need to convert the colorspace contigs to nucleotides using a single starting
# nucleotide (this will be wrong 3/4 of the time, but it doesn&#39;t matter because
# BWA immediately converts the reference back to colorspace internally...)

csfasta2ntfasta.awk lambda_asm/contigs.fa &gt; lambda_contigs.fna

bwa index -a is -c lambda_contigs.fna
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.read1.fastq.gz &gt; lambda_trim.read1.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.read1.sai lambda_trim.read1.fastq.gz 2&gt;lambda_trim.read1.samse.log &gt; lambda_trim.read1.sam
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.read2.fastq.gz &gt; lambda_trim.read2.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.read2.sai lambda_trim.read2.fastq.gz 2&gt; lambda_trim.read2.samse.log &gt; lambda_trim.read2.sam
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.single.fastq.gz &gt; lambda_trim.single.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.single.sai lambda_trim.single.fastq.gz 2&gt;lambda_trim.single.samse.log &gt; lambda_trim.single.sam</code></pre><p>The resulting alignment files are in SAM format, which is what the next operation requires.</p>
<h3 id="constructing-an-assembly-graph">Constructing an assembly graph</h3>
<p>Now we will use the <code>ref_select</code> tool to convert the colorspace contig sequences to nucleotides and output the matepair assembly graph JSON file.</p>
<pre><code>ref_select -q -m --mp_mate_cnt=10 -r lambda_trim.read1.fastq.gz -r lambda_trim.read2.fastq.gz -r lambda_trim.single.fastq.gz lambda_trim.read1.sam lambda_trim.read2.sam lambda_trim.single.sam &gt; lambda_asm.json</code></pre><h3 id="producing-scaffolded-sequence">Producing scaffolded sequence</h3>
<pre><code># The following chain of graph_ops commands accomplish the following steps:
# MST - Convert the assembly graph into a directed spanning tree
# PLUCK - Remove short (stub) branches from the tree, leaving a mostly linear structure
# PRUNE - Cleave any remaining branches from the tree, leaving linear backbones
# PUSH - Put back contigs removed from the ends of the scaffolds by PLUCK
# INSERT - Put back small contigs that fit between the backbone contigs
# SCAFF - Re-linearize the scaffolds based on constraints added by INSERT
# FASTA - Join all neighboring contigs together (looking for overlaps) and write FASTA to STDOUT

graph_ops lambda_asm.json MST PLUCK PRUNE PUSH INSERT SCAFF FASTA &#39;{&quot;scaff&quot;:true}&#39; &gt; lambda_scaffs.fna</code></pre><h3 id="visualizing-the-sequence-graph-and-using-script-files-with-graph_ops">Visualizing the sequence graph, and using <code>SCRIPT</code> files with <code>graph_ops</code>.</h3>
<p>If you rerun <code>graph_ops</code> using the <code>DOT</code> command at each stage of the pipeline, you can then visualize the assembly graph as it progresses through each stage of processing using <a href="http://www.graphviz.org/">Graphviz</a>.</p>
<p>First, create a text file called <code>lambda_viz.go</code> with the following contents:</p>
<pre><code># Sample script file for SEAStAR graph_ops tool
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
MST
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
PLUCK
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
PRUNE
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
PUSH
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
INSERT
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
SCAFF
DOT {&quot;file&quot;:&quot;lambda_asm#.dot&quot;}
FASTA {&quot;scaff&quot;:true,&quot;file&quot;:&quot;lambda_scaffs.fna&quot;}</code></pre><p>Now run the following command:</p>
<pre><code>graph_ops lambda_asm.json lambda_viz.go</code></pre><p>This will produce the same scaffolded assembly as the previous example, but in addition, it will write a <code>DOT</code> (GraphViz) format file for each step in the scaffolding pipeline (with an incrementing number at the end of the filename: <code>lambda_asm0.dot</code>, <code>lambda_asm1.dot</code>, etc.) This next command will convert all of these DOT files to corresponding PDF figures using the <code>neato</code> layout engine in Graphviz.</p>
<pre><code># After this command, load the output PDF files in a viewer to see the scaffold layout
# process, step by step. Circles are contigs with area proportional to sequence length
# and color by %GC. Black arrows are mate-pair connections with thickness indicating
# bitscore. Red arrows are added dependencies to produce a fully ordered layout for SCAFF.

for i in lambda_asm?.dot; do neato -Tpdf $i &gt; ${i%.*}.pdf; done</code></pre><p>You should now be able to load the PDF files with your favorite PDF viewer in order from <code>lambda_asm0.pdf</code> through <code>lambda_asm6.pdf</code> to see the effect of each <code>graph_ops</code> command in the script above. Each graph represents the state of the assembly before/after each successive step, showing the operation of graph_ops as it operated on this example dataset.</p>
<h3 id="estimating-16s-sequence-abundance">Estimating 16S sequence abundance</h3>
<p>The tools in SEAStAR can be used to create a pipeline for estimating sequence abundance in a short-read data set.  This is particularly useful for assessing community composition in a metagenomic sample through alignments to a 16S database.  See vignettes/RDP/RDP_vignette.html for a walk-through of an implementation of this pipeline using the <a href="http://rdp.cme.msu.edu/">RDP</a> 16S database.</p>
<hr>
<h1 id="a-namecommand_referencecommand-referencea"><a name="command_reference">Command Reference</a></h1>
<p>Tools described in this user guide:</p>
<ul>
<li><strong><a href="#solid2fastq"><code>solid2fastq</code></a></strong> -- Conversion of <em>SOLiD specific</em> CSFASTA/QUAL or FASTQ files to SEAStAR style <a href="#FASTQ">FASTQ</a> format files</li>
<li><strong><a href="#fastq_nodup"><code>fastq_nodup</code></a></strong> -- Reference-free PCR deduplication of paired reads</li>
<li><strong><a href="#samplefastq"><code>samplefastq</code></a></strong> -- Subsamples reads randomly</li>
<li><strong><a href="#trimfastq"><code>trimfastq</code></a></strong> -- Tunable, pairing aware read-trimmer</li>
<li><strong><a href="#ref_select"><code>ref_select</code></a></strong> -- Converts SAM reference alignments into <a href="#JSON">JSON</a> &quot;sequence graph&quot; files for assembly and abundance analyses</li>
<li><strong><a href="#graph_ops"><code>graph_ops</code></a></strong> -- A toolkit for operating on the <code>ref_select</code> &quot;sequence graph&quot; files to produce assemblies and data tables</li>
<li><strong><a href="#seq_scaffold"><code>seq_scaffold</code></a></strong> -- Converts pre-ordered and oriented contigs into scaffolded sequence, filling gaps where possible</li>
<li><strong><a href="#tetracalc"><code>tetracalc</code></a></strong> -- Clusters sequence scaffolds into &quot;species-level&quot; taxonomic bins for final assembly</li>
<li><strong><a href="#misc_scripts">misc_scripts</a></strong> -- Miscellaneous scripts</li>
</ul>
<hr>
<h2 id="a-namesolid2fastqsolid2fastqa"><a name="solid2fastq"><code>solid2fastq</code></a></h2>
<p>This tool converts SOLiD colorspace <code>.csfasta</code> and <code>.qual</code> input files (raw reads transferred off the instrument) to output files that are an interoperable variant of the FASTQ file format (see SEAStAR format details in section <a href="#FASTQ">FASTQ</a>).  It can also convert nucleotide FASTQ files produced by runs using Exact Call Chemistry to SEAStAR style FASTQ files.  This is conversion necessary to ensure that read pairs are synced in <code>.read1.fastq</code> and <code>.read2.fastq</code>.</p>
<h4 id="usage---solid2fastq-options-in_prefix-out_prefix">Usage:   <code>solid2fastq [options] &lt;in_prefix&gt; &lt;out_prefix&gt;</code></h4>
<p>Where <code>&lt;in_prefix&gt;</code> denotes the input prefix which is the part of the input filename shared in common between all <code>.csfasta</code> and <code>.qual</code> or <code>.fastq</code> files generated by the same sequence library. It is typically shown in the <code>Title:</code> comment line near the top of <code>.csfasta</code> format input files.  For example:</p>
<p><code># Title: GG050409_20090604_matepair_50</code></p>
<p>is found near the top of input files:</p>
<p><code>GG050409_20090604_matepair_50_R3.csfasta</code>, <code>GG050409_20090604_matepair_50_R3_QV.qual</code>, <code>GG050409_20090604_matepair_50_F3.csfasta</code> and <code>GG050409_20090604_matepair_50_F3_QV.qual</code>.</p>
<p>So use of <code>GG050409_20090604_matepair_50</code> for <code>&lt;in_prefix&gt;</code> will specify these four files for input to the <code>solid2fastq</code> tool.</p>
<p>Note that <code>solid2fastq</code> relies on the SOLiD naming conventions for the suffixes of these files (e.g. <code>_R3.csfasta</code>).  You are free to rename these files with different prefixes, but the suffixes must remain the same.  You may, however, compress these files using gzip and add the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>solid2fastq</code> to decompress the files as it reads them.</p>
<p>The <code>&lt;out_prefix&gt;</code> parameter is similar to the <code>&lt;in_prefix&gt;</code> as it is used to name the output <code>.fastq</code> file(s) generated by this tool.  The output prefix may be any text, except you should generally avoid whitespace and punctuation characters other than &quot;<code>_</code>&quot;.  This is because, by default, the names of all output reads are also prefixed with this string and the variant of the FASTQ format used by the SEAStAR tools depends on certain punctuation characters to quickly parse these read names (see SEAStAR <a href="#FASTQ">FASTQ</a> format ).  This file naming constraint may be relaxed through the use of optional parameters described below.</p>
<p>Note that <code>&lt;in_prefix&gt;</code> and/or <code>&lt;out_prefix&gt;</code> may contain directory path information, but in the case of <code>&lt;out_prefix&gt;</code> this will necessitate the use of the <code>--prefix</code> or <code>--no_prefix</code> options (described below).</p>
<p><code>solid2fastq</code> writes a single output <code>.fastq</code> file for each matching pair of <code>.csfasta</code> / <code>.qual</code> input files or for each single <code>.fastq</code> input file.  For example, given a set of non-barcoded colorspace input files, the names of these output files are constructed as:</p>
<p><code>&lt;in_prefix&gt;_R3[.csfasta|_QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code></p>
<p><code>&lt;in_prefix&gt;_F3[.csfasta|_QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code></p>
<p>singlets from F3/R3 &rarr; <code>&lt;out_prefix&gt;.single.fastq</code></p>
<p>Note that for paired-end libraries, alternate naming suffix conventions for the R3 input are also recognized (substituting <code>F5-BC</code>, <code>F5-P2</code>, <code>F5-DNA</code>, <code>F5-RNA</code>, for <code>R3</code>).</p>
<p>Additionally, for paired libraries (mate-paired or paired-end) should <code>solid2fastq</code> encounter any singlet reads (those with a missing mate), they are by default given the appropriate suffix and then written to a file named: <code>&lt;out_prefix&gt;.single.fastq</code></p>
<p>The complete list of file name patterns that will be recognized by <code>solid2fastq</code> are:</p>
<p><code>&lt;in_prefix&gt;_&lt;BC&gt;_R3.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code>
<code>&lt;in_prefix&gt;_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA][.csfasta|_QV.qual|.QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code>
<code>&lt;in_prefix&gt;_&lt;BC&gt;_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA][.csfasta|.QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code>
<code>&lt;in_prefix&gt;_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA]_[&lt;BC&gt;.csfasta|_QV_&lt;BC&gt;.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code>
<code>&lt;in_prefix&gt;_&lt;BC&gt;_F3.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code>
<code>&lt;in_prefix&gt;_F3[.csfasta|_QV.qual|.QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code>
<code>&lt;in_prefix&gt;_&lt;BC&gt;_F3[.csfasta|.QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code>
<code>&lt;in_prefix&gt;_F3_[&lt;BC&gt;.csfasta|_QV_&lt;BC&gt;.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code></p>
<h4 id="solid2fastq-optional-parameters-options"><code>solid2fastq</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[--prefix=&lt;string&gt;]</code> <code>[-n|--no_prefix]</code> <code>[-x|--no_suffix]</code> <code>[-z|--gzip]</code> <code>[-b &lt;BC&gt;|--bc=&lt;BC&gt;]</code> <code>[-s|--singles]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>--prefix=&lt;string&gt;</code></b></p>
<p>Specify the prefix string to add to read identifiers. For example, <code>.csfasta</code> read identifier: <code>&gt;1_68_381_R3</code> will become: <code>@xx+&lt;string&gt;:1_68_381/1</code> in the <code>.fastq</code> output file.  By default this is <code>&lt;out_prefix&gt;</code> described above.</p>
<p>The use of a terse, but unique, prefix string is advisable because it helps identify the source, and processing steps, performed on a set of reads as they flow through an analysis pipeline. Keep in mind that this text will be added to the output file for every read, so short strings are preferable. The output prefix may be any text you would like, except you may not use whitespace or punctuation characters other than &quot;<code>_</code>&quot;.</p>
<p><b><code>-n</code> / <code>--no_prefix</code></b></p>
<p>Overrides the default read prefixing behavior and instead preserves the read identifiers as they are in the input <code>.csfasta</code> file(s).</p>
<p>For example: <code>&gt;1_68_381_R3</code> simply becomes <code>@xx+:1_68_381/1</code></p>
<p><b><code>-x</code> / <code>--no_suffix</code></b></p>
<p>Suppress /1 or /2 suffix additions to read IDs.  This ensures matching read IDs for paired reads in FASTQ files and non-BWA aligner generated SAM files.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output <code>.fastq</code> files compressed in the gzip format, and with filenames suffixed <code>.gz</code>.</p>
<p><b><code>-b &lt;BC&gt;</code> / <code>--bc=&lt;BC&gt;</code></b></p>
<p>Used only for barcoded SOLiD libraries, in addition to <code>&lt;in_prefix&gt;</code>, to specify input files for a specific barcode (<code>&lt;BC&gt;</code>).</p>
<p><b><code>-s</code> / <code>--singles</code></b></p>
<p>Write two separate singlet files that segregate the singlets by input files. This is useful for doing strand-specific analyses using paired-end libraries, where the mated reads are sequenced from opposite strands. For example:</p>
<p><code>&lt;in_prefix&gt;_R3[.csfasta|_QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code></p>
<p>singlets from <code>R3</code> &rarr; <code>&lt;out_prefix&gt;.single1.fastq</code></p>
<p><code>&lt;in_prefix&gt;_F3[.csfasta|_QV.qual]</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code></p>
<p>singlets from <code>F3</code> &rarr; <code>&lt;out_prefix&gt;.single2.fastq</code></p>
<hr>
<h2 id="a-namefastq_nodupfastq_nodupa"><a name="fastq_nodup"><code>fastq_nodup</code></a></h2>
<p>This tool reads paired <code>.fastq</code> files (with associated singlets), and removes the lowest quality pair(s) of presumed PCR duplicate reads. This operation assumes that multiple read-pairs sharing substantially the same sequences for both mates are statistically unlikely to occur frequently at random (due to the distribution of insert sizes), and therefore represent likely artifactual duplications resulting from PCR over-amplification during library construction. If this assumption does not hold for a given library, then use of this tool is inappropriate.</p>
<p>Notably, this tool is most appropriate for metagenomes (or any other sample type) where no reference genome is available for performing duplicate removal through reference alignments. This is because <code>fastq_nodup</code> works entirely through error-tolerant internal comparisons of read-pairs to each other within an entire sequenced library.</p>
<p>To accomplish this, <code>fastq_nodup</code> builds a large table of sequence prefixes, requiring a substantial amount of memory. The amount of memory required can be controlled through tuning the various parameters, but a minimum of 32Gb will probably be necessary for most read datasets.</p>
<p>Note that singlet reads are also randomly removed in the same proportion as matching mated reads. This is done assuming that PCR duplicated pairs are as likely to produce singlets as non-duplicated pairs. For example:</p>
<p><code>S1</code>, <code>S1</code>, <code>M1 -- M2</code>, <code>M1 -- M2</code>, <code>M1 -- M3</code></p>
<p>In the above example <code>Sn</code> = singlet reads, <code>Mn</code> = mated reads, and the numeric suffixes indicate matching sequences. So the lowest quality pair of <code>M1 -- M2</code> mates will be removed from the output, and each of the <code>S1</code> reads will have a 1/3 probability of being removed (disregarding quality). The 1/3 probability is calculated from the fact that 1/3 of the pairs containing sequence <code>M1</code> were presumed PCR duplicates and therefore removed.</p>
<h4 id="usage--fastq_nodup-options-in_prefix-out_prefix">Usage:  <code>fastq_nodup [options] &lt;in_prefix&gt; &lt;out_prefix&gt;</code></h4>
<p>where <code>&lt;in_prefix&gt;</code> and <code>&lt;out_prefix&gt;</code> specify the input and output filename prefixes to use for input files and naming output files. For example:</p>
<p><code>&lt;in_prefix&gt;.read1.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code>
<code>&lt;in_prefix&gt;.read2.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code>
<code>&lt;in_prefix&gt;.single.fastq</code> &rarr; <code>&lt;out_prefix&gt;.single.fastq</code></p>
<p>The input files may be gzip format compressed files with the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>fastq_nodup</code> to decompress the files as it reads them. Multiple single files (i.e. <code>.single1.fastq</code> and <code>.single2.fastq</code> suffixes) are also recognized.</p>
<h4 id="fastq_nodup-optional-parameters-options"><code>fastq_nodup</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[-v|--verbose]</code> <code>[-z|--gzip]</code> <code>[--prefix=&lt;string&gt;]</code> <code>[--no_prefix]</code> <code>[-l &lt;u&gt;|--index_len=&lt;u&gt;]</code> <code>[-m &lt;u&gt;|--match_len=&lt;u&gt;]</code> <code>[-d &lt;u&gt;|--index_err=&lt;u&gt;]</code> <code>[-e &lt;u&gt;|--match_err=&lt;u&gt;]</code> <code>[--seed=&lt;n&gt;]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>-v</code> / <code>--verbose</code></b></p>
<p>Print additional statistics and diagnostic information about the run.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output <code>.fastq</code> files compressed in the gzip format, and with filenames suffixed <code>.gz</code>.</p>
<p><b><code>--prefix=&lt;string&gt;</code></b></p>
<p>Specify the prefix string to add to read identifiers. For example: <code>@&lt;string&gt;:1_68_381</code></p>
<p>Note, this will replace any existing prefix. By default this is <code>&lt;out_prefix&gt;</code> described above.</p>
<p><b><code>--no_prefix</code></b></p>
<p>Overrides the default read prefixing behavior and instead preserves the read identifiers exactly as they are in the input fastq file(s).</p>
<p><b><code>-l &lt;u&gt;</code> / <code>--index_len=&lt;u&gt;</code></b></p>
<p>Controls the read sequence prefix length of the search index. This should be set long enough so that random prefix collisions are infrequent, but short enough so that a large number of sequence errors are unlikely to accumulate. This parameter also impacts the memory use of this tool, with each increment increasing the index size by a factor of 4. This parameter defaults to <code>&lt;u&gt;</code>=14 nucleotides and may not exceed 32 nucleotides.</p>
<p>Both mates of each read-pair are indexed, and this indexed look-up is used to find a list of mate sequences corresponding to each sequence prefix. Given that the mates also must match, it is acceptable for a low-level of prefix collisions to occur in this index. That is, assuming the level of prefix collision is low (only a small number of non-duplicated reads share any given prefix), it will be highly improbable that reads matching a given prefix will share mates with matching sequence at random.</p>
<p><b><code>-m &lt;u&gt;</code> / <code>--match_len=&lt;u&gt;</code></b></p>
<p>Controls the required match length of the unindexed mate in detecting a duplicate pair. Like the previous parameter (<code>--index_len</code>), this setting has an impact on memory use, although memory requirements only increase linearly with match length. This parameter should be set as high as memory permits to improve specificity.</p>
<p>This parameter defaults to the full read length, although lower settings are acceptable assuming that the <code>--index_len</code> is appropriately set, and the sum of <code>--index_len</code> and <code>--match_len</code> exceeds about 35 nucleotides (that is, random 35 nucleotide joint index-match collisions are extremely unlikely assuming each match is sufficiently long).</p>
<p><b><code>-d &lt;u&gt;</code> / <code>--index_err=&lt;u&gt;</code></b></p>
<p>Controls the number of mismatches tolerated in the indexed sequence prefix while still permitting a sequence match. This value defaults to <code>&lt;u&gt;</code>= 1 and may be set in the range 0-2.  This setting has a major impact on run time, with each increment increasing it by a factor of about 4. Duplicated reads with more than <code>--index_err</code> mismatches in the <code>--index_len</code> prefix may still be found by the mate&#39;s indexed prefix, assuming that it does not also have more than <code>--index_err</code> errors.</p>
<p><b><code>-e &lt;u&gt;</code> / <code>--match_err=&lt;u&gt;</code></b></p>
<p>Controls the number of mismatches tolerated in a matched mate sequence (in the <code>--match_len</code> prefix). This value defaults to 3 and must be &gt; 0. This setting has little impact on run time.  Setting <code>--match_err</code> to higher values will reduce the specificity of the mate-matching and should therefore be accompanied by correspondingly larger values of <code>--match_len</code>.</p>
<p><b><code>-seed=&lt;n&gt;</code></b></p>
<p>Integer seed used by the internal pseudo-random number generator. This only affects the random selection of matching singlets (as described above). Defaults to <code>&lt;n&gt;</code>= 12345.</p>
<hr>
<h2 id="a-namesamplefastqsamplefastqa"><a name="samplefastq"><code>samplefastq</code></a></h2>
<p>This tool randomly samples reads from a given set of fastq files, producing a corresponding set of output fastq files.  This is useful for more quickly tuning parameters for the <code>trimfastq</code>tool (described below) and for producing simulated read datasets for various other purposes.</p>
<h4 id="usage-samplefastq--f--frac_samplefloatint-options-in_prefix-out_prefix">Usage: <code>samplefastq -f|--frac_sample=&lt;float&gt;|&lt;int&gt; [options] &lt;in_prefix&gt; &lt;out_prefix&gt;</code></h4>
<p>where the mandatory <code>-f</code> parameter indicates how the input files should be sampled and <code>&lt;in_prefix&gt;</code> and <code>&lt;out_prefix&gt;</code> specify the input and output filename prefixes to use for input files and naming output files. For example:</p>
<p><code>&lt;in_prefix&gt;.read1.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code>
<code>&lt;in_prefix&gt;.read2.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code>
<code>&lt;in_prefix&gt;.single.fastq</code> &rarr; <code>&lt;out_prefix&gt;.single.fastq</code></p>
<p>The input files may be gzip format compressed files with the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>samplefastq</code> to decompress the files as it reads them. Multiple single files (i.e. <code>.single1.fastq</code> and <code>.single2.fastq</code> suffixes) are also recognized.</p>
<p>Note that for paired sequences (as shown above) read pairs are always sampled together; that is, each pair of reads counts at a single read unit for sampling purposes, as does each singlet read.</p>
<h4 id="samplefastq-mandatory-parameter"><code>samplefastq</code> mandatory parameter:</h4>
<p><code>[-f|--frac_sample=&lt;float&gt;|&lt;int&gt;]</code></p>
<p><b><code>-f &lt;float&gt;</code> / <code>-f &lt;int&gt;</code> / <code>--frac_sample=&lt;float&gt;</code> / <code>--frac_sample=&lt;int&gt;</code></b></p>
<p>The <code>-f</code> parameter tells <code>samplefastq</code>, either directly or indirectly, how many output reads to randomly sample from the input <code>.fastq</code> files.  Values &lt; 1.0 are interpreted as a fraction of the input reads to retain, whereas values &gt; 1.0 are rounded to an integer count of reads to retain.</p>
<p>Note that when a read count is provided, the input files must be read twice by <code>samplefastq</code>; once to count the number of input reads, and then again to randomly sample those reads. Once the number of input reads is known, the requested read count is converted into a fraction and operation proceeds as though that fraction had been initially provided to the <code>-f</code> parameter. Therefore, if a given set of inputs is to be sampled repeated, it will be considerably more efficient to obtain the count of reads once and calculate the fraction to retain for each subsequent sampling. The requested (or calculated) fraction of reads to sample is used internally as the probability of retaining each read (or pair) and each is considered an independent trial. Because of this, the precise fraction (or count) of reads requested is unlikely to be generated, although the sampled output will typically be very close to that requested.</p>
<h4 id="samplefastq-optional-parameters-options"><code>samplefastq</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[--seed=&lt;int&gt;]</code> <code>[-z|--gzip]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output fastq files compressed in the gzip format, and with filenames suffixed <code>.gz</code>.</p>
<p><b><code>--seed=&lt;n&gt;</code></b></p>
<p>Integer seed used by the internal pseudo-random number generator. To obtain different samples from the same input file(s), set this parameter to different values.  Repeated runs with all parameters the same will produce identical outputs.  Defaults to <code>&lt;n&gt;</code>= 12345.</p>
<hr>
<h2 id="a-nametrimfastqtrimfastqa"><a name="trimfastq"><code>trimfastq</code></a></h2>
<p>This tool is used to trim or reject individual reads on the basis of quality scores. Reads may also be rejected based on low information content (entropy). When run on paired reads, this operation is performed on mates independently, and when one mate of a pair is rejected, the other becomes a singlet.</p>
<p>Trimming based on quality scores is performed with the goal of producing trimmed reads that have a calculated probability of error below some threshold. That is, bases are trimmed off the 3&#39; end of the read until the remaining bases have a joint probability of error below a threshold. For example, if a threshold of 0.5 is selected, that means the output reads will be expected to have, on average, 0.5 erroneous nucleotides per read.</p>
<p>For colorspace (SOLiD differentially encoded) reads, the probability of error is calculated as the joint probability of error in each of two consecutive color positions. That is, because single color errors will typically be correctable, the probability of a nucleotide error at a given position is the probability that both corresponding color positions are erroneous (i.e. the pairwise product of color error probabilities yields the probability of uncorrectable nucleotide errors).</p>
<p>Minimum read lengths may also specified such that reads trimmed shorter than the specified minimum to meet an error probability threshold will instead be rejected.</p>
<p>Reads may also be independently rejected for failing to have a minimum mean information content. <code>trimfastq</code> optionally calculates the mean entropy of each read (as average bits per dinucleotide) and reject reads that fail to meet a given threshold. A common error for next-generation sequencers is to produce junk reads where all or part of the read contains highly repetitive (low information) sequence. Random DNA sequence contains 4 bits per dinucleotide of entropy, and we have observed that (non-repeat) natural sequences typically contain more than 3 bits per dinucleotide.</p>
<h4 id="usage--trimfastq-options-in_prefix-out_prefix">Usage:  <code>trimfastq [options] &lt;in_prefix&gt; &lt;out_prefix&gt;</code></h4>
<p>where <code>&lt;in_prefix&gt;</code> and <code>&lt;out_prefix&gt;</code> specify the input and output filename prefixes to use for input files and naming output files. For example:</p>
<p><code>&lt;in_prefix&gt;.read1.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read1.fastq</code></p>
<p><code>&lt;in_prefix&gt;.read2.fastq</code> &rarr; <code>&lt;out_prefix&gt;.read2.fastq</code></p>
<p><code>&lt;in_prefix&gt;.single.fastq</code> &rarr; <code>&lt;out_prefix&gt;.single.fastq</code></p>
<p>The input files may be gzip format compressed files with the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>trimfastq</code> to decompress the files as it reads them. Multiple single files (i.e. <code>.single1.fastq</code> and <code>.single2.fastq</code> suffixes) are also recognized. It is also possible to write FASTA format files (i.e. lacking quality scores) and interleaved paired-read files (for the Velvet assembler). See the options below for more information.</p>
<h4 id="trimfastq-optional-parameters-options"><code>trimfastq</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[-c|--color_space]</code> <code>[-z|--gzip]</code> <code>[-v|--invert_singles]</code> <code>[-s|--singles]</code> <code>[-p &lt;d&gt;|--correct_prob=&lt;d&gt;]</code> <code>[-l &lt;u&gt;|--min_read_len=&lt;u&gt;]</code> <code>[-m &lt;u&gt;|--min_mate_len=&lt;u&gt;]</code> <code>[-f &lt;u&gt;|--fixed_len=&lt;u&gt;]</code> <code>[--prefix=&lt;string&gt;]</code> <code>[--add_len]</code> <code>[--no_prefix]</code> <code>[-e &lt;d&gt;|--entropy_filter=&lt;d&gt;]</code> <code>[--entropy_strict]</code> <code>[--mates_file]</code> <code>[--no_rev]</code> <code>[--force_rev]</code> <code>[--only_mates]</code> <code>[--fasta]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output fastq files compressed in the gzip format, and with filenames suffixed `.gz+.</p>
<p><b><code>--prefix=&lt;string&gt;</code></b></p>
<p>Specify the prefix string to add to read identifiers. For example: <code>@&lt;string&gt;:1_68_381</code></p>
<p>Note, this will replace any existing prefix. By default this is  <code>&lt;out_prefix&gt;</code> described above.</p>
<p><b><code>-n</code> / <code>--no_prefix</code></b></p>
<p>Overrides the default read prefixing behavior and instead preserves the read identifiers exactly as they are in the input fastq file(s).</p>
<p><b><code>--add_len</code></b></p>
<p>Add the final trimmed length value to the read prefix.  For example:</p>
<pre><code>@lambda:1_81_912
TGTTGTGCGCGCCATAGCGAGGGGCTCAGCACGCGTCCCTCCGCCCCAC
+
5/0)358)%57*)%881/(/4&#39;.&#39;53*&amp;#91*-&#39;&amp;,&amp;%$&#39;&#39;%&amp;&#39;46-+#`</code></pre><p>Trims to:  (&quot;37&quot; in the first line indicates the final trimmed length)</p>
<pre><code>@37|lambda_trim:1_81_912
TGTTGTGCGCGCCATAGCGAGGGGCTCAGCACGCGTC
+
5/0)358)%57*)%881/(/4&#39;.&#39;53*&amp;#91*-&#39;&amp;,&amp;</code></pre><p><b><code>-v</code> / <code>--invert_singles</code></b></p>
<p>Causes singlet file(s) output to be the opposite configuration of the input.  That is:</p>
<p><code>&lt;in_prefix&gt;.single.fastq</code> &rarr; <code>&lt;out_prefix&gt;.single[12].fastq</code></p>
<p>or</p>
<p><code>&lt;in_prefix&gt;.single[12].fastq</code> &rarr; <code>&lt;out_prefix&gt;.single.fastq</code></p>
<p>Note that this option is only valid when there are input singlet reads.</p>
<p><b><code>-s</code> / <code>--singles</code></b></p>
<p>Write two singlet files (<code>.single1.fastq</code> and <code>.single2.fastq</code>), one for new singlets generated from each paired input file. Note that this option is only valid when there are no input singlet reads. The default behavior in this case is to write a single combined singlet file (<code>.single.fastq</code>).</p>
<p><b><code>-p &lt;d&gt;</code> / <code>--correct_prob=&lt;d&gt;</code></b></p>
<p>Minimum mean probability that trimmed output reads are free of uncorrectable errors (or all errors with <code>-c</code>).  Default value is <code>&lt;d&gt;</code>= 0.5  Setting <code>-p</code> to 0.0 completely disables quality trimming, which is sometimes appropriate for file format conversion and interleaving operations (e.g. see <code>--fasta</code> and <code>--mates_file</code>).</p>
<p><b><code>-l &lt;u&gt;</code> / <code>--min_read_len=&lt;u&gt;</code></b></p>
<p>Minimum length of a singlet or longest-mate of a pair, in nucleotides. Default value is <code>&lt;u&gt;</code>= 24 bases.</p>
<p><b><code>-m &lt;u&gt;</code> / <code>--min_mate_len=&lt;u&gt;</code></b></p>
<p>Minimum length of the shortest mate of a pair, in nucleotides.  Default value is the value provided for  <code>--min_read_len</code>. This value must be &lt; the value used for <code>--min_read_len</code>. Use of this parameter allows paired reads shorter than <code>--min_read_len</code> to be retained as long as their mate satisfies <code>--min_read_len</code>.</p>
<p><b><code>-f &lt;u&gt;</code> / <code>--fixed_len=&lt;u&gt;</code></b></p>
<p>Trim all reads to a fixed length, still rejecting reads that don&#39;t meet specified quality thresholds at this length. Default is no fixed length.</p>
<p><b><code>-e &lt;d&gt;</code> / <code>--entropy_filter=&lt;d&gt;</code></b></p>
<p>Reject reads with mean per position measured information (entropy) below the given value (in bits per dinucleotide). The range of valid values is 0.0-4.0 inclusive. By default this filter is off.</p>
<p><b><code>--entropy_strict</code></b></p>
<p>Reject reads for low entropy overall, not just the retained part after trimming. By default this setting is off. Use of this setting requires use of <code>--entropy_filter</code>.</p>
<p><b><code>--mates_file</code></b></p>
<p>In addition to other outputs, produce a Velvet compatible interleaved paired output file (named <code>&lt;out_prefix&gt;_mates.fastq</code>) with read2 mates reversed by default (to support SOLiD mate-paired reads.)</p>
<p><b><code>--no_rev</code></b></p>
<p>By default, for SOLiD colorspace read-pairs, the second read of each pair is reversed in the <code>mates</code> file produced by <code>--mates_file</code> (to put the mates on opposite strands, as Velvet expects). <code>--no_rev</code> disables this reversing, which is useful for SOLiD colorspace paired-end files.  Nucleotide inputs are by default not reversed in the output generated by <code>--mates_file</code>, so this option has no effect in that case.</p>
<p><b><code>--force_rev</code></b></p>
<p>By default, for nucleotide (non-colorspace) read-pairs, the second read of each pair isn&#39;t reversed in the <code>mates</code> file produced by <code>--mates_file</code> (to keep the mates on opposite strands, as Velvet expects, assuming Illumina sequence). <code>--force_rev</code> enables this reversing, which is useful for SOLiD 5500+ nucleotide mate-paired files (nucleotide sequences are correctly reverse complemented).  Colorspace inputs are reversed by default in the output generated by <code>--mates_file</code>, so this option has no effect in that case.</p>
<p><b><code>--only_mates</code></b></p>
<p>Surpress writing <code>&lt;out_prefix&gt;.read1.fastq</code> and <code>&lt;out_prefix&gt;.read2.fastq</code> output files. <code>--only_mates</code> may only be used with the <code>--mates_file</code> option. This option is useful when reads are being trimmed exclusively for assembly, to save unnecessary processing for files that will not be used. See also the --fasta option below to save even more space.</p>
<p><b><code>--fasta</code></b></p>
<p>Write FASTA format files instead of FASTQ files for all outputs. FASTA files are about half the size of FASTQ, so for some purposes (e.g. assembly, alignment) that may not take quality scores into account, FASTA may be a preferable output format to save disk space. When <code>--fasta</code> is used, no <code>.fastq</code> files will be written, and the output files will instead end with <code>.fasta[.gz]</code>.</p>
<hr>
<h2 id="a-nameref_selectref_selecta"><a name="ref_select"><code>ref_select</code></a></h2>
<p><code>ref_select</code> is a utility for calculating a wide variety of useful information from alignments of next-generation sequence reads to large reference sequence datasets. For example, it can be used to build a meta-genomic assembly graph from paired-reads aligned back to <em>de novo</em> assembled contigs. It can also be used to precisely select (and accurately estimate the relative abundances of) sequences from an alignment with a reference library (e.g. metagenomic reads to 16S rDNA database, or mRNA transcript reads to a database of genes from one or more genomes.)</p>
<p><code>ref_select</code> processes <a href="http://samtools.sourceforge.net">SAM</a> format alignment files into a <a href="#JSON">JSON format &quot;sequence graph&quot;</a> file ready to be processed by the <code>graph_ops</code> tool. Optionally, <code>ref_select</code> will also read FASTQ files and filter out reads (and mates) which are or aren&#39;t present in the provided SAM alignment. The JSON output file is structured as a graph of &quot;nodes&quot; (selected sequences) and &quot;edges&quot; (connections between nodes based on read-pairing) A variety of useful information for the nodes and edges is included in the output JSON file, which can be accessed using the <code>graph_ops</code> tool or custom code written in any language with a JSON library.</p>
<h4 id="usage-ref_select-options-sam_file1-sam_file2--outfilejson">Usage: <code>ref_select [options] &lt;sam_file1&gt; [&lt;sam_file2&gt;]... &gt; &lt;outfile.json&gt;</code></h4>
<p><code>ref_select</code> has a lot of options, and depending on what type of analysis you are doing, you may use quite a few of them. However, at its simplest, <code>ref_select</code> reads one or more SAM alignment files and writes a JSON &quot;sequence graph&quot; to STDOUT:</p>
<pre><code>ref_select alignment.single.sam &gt; seq_graph.json</code></pre><p><code>ref_select</code> requires any SAM (or FASTQ) format input files to be named using the <a href="#FASTQ">SEAStAR naming convention used for FASTQ files</a>. Crucially, individual SAM alignment files must not mix alignments for reads from different categories (<code>read1</code>, <code>read2</code>, <code>single</code>), and the corresponding filename suffixes of the SAM files must match those of the FASTQ files used to generate them:</p>
<p><code>sample.read1.fastq</code> &rarr; <code>alignment.read1.sam</code>
<code>sample.read2.fastq</code> &rarr; <code>alignment.read2.sam</code>
<code>sample.single.fastq</code> &rarr; <code>alignment.single.sam</code></p>
<p>Note that <code>ref_select</code> may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved JSON data file syntax may be corrupted by messages written to STDERR.</p>
<p>Of course, since these types of files may be very large, you may prefer to keep everything compressed:</p>
<pre><code>ref_select alignment.single.sam.gz | gzip -c &gt; seq_graph.json.gz</code></pre><p>Because <code>ref_select</code> has many options, and produces many different kinds of outputs, we made the decision to organise all of the output data into a single output stream in a standard text-based format (<a href="http://www.json.org">JSON</a>), that is flexible and straightforward to process in any language. The JSON formatted output produced by <code>ref_select</code> is documented in the appendix: <a href="#JSON">JSON Sequence Graph File Format Reference</a>. However, for most tasks you will not need to look directly into these files because the <code>graph_ops</code> tool will handle that work for you (including the work of extracting data into other standard file formats).</p>
<p><code>ref_select</code> has four main pieces of functionality:</p>
<ol>
<li><p>Quantitatively analyzing reads aligned to a known reference database (e.g. metagenomic reads to a 16S rDNA database, or reads from mRNA transcripts to the gene models of one or more genomes):</p>
<pre><code> # Select only reference sequences with reads aligned scoring 100 or more bits
 ref_select -t 100.0 alignment.single.sam &gt; ref_stats.json

 # Convert ref_select output to a tab separated (TSV) table for use downstream
 graph_ops ref_stats.json TABLE &gt; ref_stats.tsv</code></pre></li>
<li><p>Analyzing reads mapped back to <em>de novo</em> assembled contigs, to build a &quot;sequence graph&quot; for downstream quality checking, scaffolding, binning, visualization, etc.</p>
<pre><code> # Construct a sequence graph from the alignment, inserting the assembled contig seqs into the graph
 ref_select --ref=contigs.fna -m alignment.read1.sam alignment.read2.sam &gt; seq_graph.json

 # Produce scaffolded sequence (where .go file is graph_ops SCRIPT)...
 graph_ops seq_graph.json run_scaffold_pipeline.go &gt; scaffolds.fna</code></pre></li>
<li><p>Converting SOLiD colorspace contigs into nucleotide contigs based on aligned reads</p>
<pre><code> # Rather than inserting contigs as above, they are reconstructed from the alignment
 # and the provided read FASTQ files
 ref_select -q -m -r mates.read1.fastq -r mates.read2.fastq alignment.read1.sam alignment.read2.sam &gt; seq_graph.json

 # Produce scaffolded sequence (where .go file is graph_ops SCRIPT)...
 graph_ops seq_graph.json run_scaffold_pipeline.go &gt; scaffolds.fna</code></pre></li>
<li><p>Filtering out a sub-set of reads (and their mates) based on the results of an alignment.</p>
<pre><code> # Write two new read files for all reads (and their mates) that did not map in the alignment
 # NOTE: this is the one exception to the &quot;all data goes into the JSON stream&quot; principle described above.
 # Filtered read output data is written to `FASTQ` format files, with a prefix specified using --read_output
 ref_select --read_output=filtered_ --output_nomatch -r mates.read1.fastq -r mates.read2.fastq alignment.read1.sam alignment.read2.sam</code></pre></li>
</ol>
<p>The approaches used by <code>ref_select</code> to accomplish these tasks are described in more detail in the methods section of (<a href="http://www.sciencemag.org/content/335/6068/587.abstract">Iverson, <em>et al.</em> 2012</a>).</p>
<h3 id="the-options-reference-below-organizes-the-various-parameters-into-general-categories-related-to-the-above-tasks">The <code>[options]</code> reference below organizes the various parameters into general categories related to the above tasks:</h3>
<ul>
<li><a href="#ref_select_basic">Basic parameters</a></li>
<li><a href="#ref_select_bitscore">Bitscore calculation and reference selection parameters</a></li>
<li><a href="#ref_select_coverage">Coverage calculation parameters</a></li>
<li><a href="#ref_select_recon">Sequence reconstruction and read filtering parameters</a></li>
<li><a href="#ref_select_pairing">Read pairing statistical parameters</a></li>
<li><a href="#ref_select_inputs">Input file parameters</a><br>
<br>

</li>
</ul>
<h3 id="a-nameref_select_basicref_select-basic-parametersa-options-"><a name="ref_select_basic"><code>ref_select</code> basic parameters</a> <code>[options]</code> :</h3>
<p><code>[-h|--help]</code> <code>[-v|--version]</code> <code>[--verbose]</code> <code>[-d &lt;seq_id&gt;|--detail=&lt;seq_id&gt;]...</code> <code>[--detail_file=&lt;detail_file&gt;]</code> <code>[-q|--recon_seq]</code> <code>[--ref=&lt;ref_file&gt;]</code> <code>[-m|--mate_pair]</code> <code>[--split=&lt;n&gt;]</code> <code>[--separate_strands]</code>  <code>[--rollup]</code> <code>[--seed=&lt;n&gt;]</code> <code>[--num_threads=&lt;n&gt;]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Request help.</p>
<p><b><code>-v</code> / <code>--version</code></b></p>
<p>Print the build version and exit.</p>
<p><b><code>--verbose</code></b></p>
<p>Print detailed diagnostic status to stderr during run. Off by default.</p>
<p><b><code>-d &lt;seq_id&gt;</code> / <code>--detail=&lt;seq_id&gt;</code></b></p>
<p>Only produce per-base statistics and reconstructed sequence for listed seq ids. By default these things are done for all selected sequences. For very large reference sequence databases, <code>ref_select</code> needs to perform a lot of work and can use a lot of memory and produce a very large output. If you want to analyze the entire reference dataset, but only produce detailed results for a small subset of sequences, the <code>--detail</code> option allows you to specify just those sequences (by <code>&lt;seq_id&gt;</code>, one per use of <code>-d</code> / <code>--detail</code>). To specify more than a few sequences, use <code>--detail_file</code> instead.</p>
<p><b><code>--detail_file=&lt;detail_file&gt;</code></b></p>
<p>Filename of file containing sequence ids to treat as in --detail. Format is a plain text file with one <seq_id> per line.</p>
<p><b><code>-q</code> / <code>--recon_seq</code></b></p>
<p>Enable reconstruction of contig sequences using alignment data (SAM) and reads (FASTQ). This works for both colorspace and nucleotide data (or a combination of the two), and the resulting sequence is always nucleotide sequence. The reconstructed contigs may contain ambiguity codes (e.g. for SNPs or variable positions) and the sensitivity of this process is controlled with the <code>ambig_tol</code> parameter. NOTES: Requires use of --read_file. If --detail is used, only specified sequences are reconstructed.</p>
<p><b><code>--ref=&lt;ref_file&gt;</code></b></p>
<p>For assemblies where the contigs and alignments are already nucleotide sequences, this option allows you to specify a FASTA file containing the reference contigs used in the alignment (as an alternative to using --recon_seq). In this case, the reference sequences are simply copies into the output JSON sequence graph for use in downstream analysis (scaffolding, binning, etc).</p>
<p><b><code>-m</code> / <code>--mate_pair</code></b></p>
<p>This option is required to enable the pairing analysis between mated reads (<code>.read1</code> and <code>.read2</code> alignments). When paired reads are available and <code>--mate_pair</code> is used, &quot;edges&quot; will be included in the output JSON data file, connecting the &quot;nodes&quot; into a full sequence graph structure. Enable this option when you want to produce scaffolds for a <em>de novo</em> assembly.</p>
<p><b><code>--split=&lt;n&gt;</code></b></p>
<p>Split reference sequences into n-base pieces for analysis. The default is no splitting. This is a specialized option that performs all <code>ref_select</code> analyses as though large reference sequences (e.g. whole genomes) are actually split into many separate chunks. This can be useful for testing, but also for performing genome connection and rearrangement analyses when pairing information is available and the sequence read data is from a different organism (e.g. strain or closely related species) than the reference sequence used.</p>
<p><b><code>--separate_strands</code></b></p>
<p>Top and bottom stands of reference sequences are considered separately. Each reference sequence effectively becomes two references, with reads mapping to each strand contributing to the analysis for that strand&#39;s reference only. NOTE: <code>--separate_strands</code> may not be used with: mate-pairing (<code>--mate_pair</code>), sequence reconstruction (<code>--recon_seq</code>) or splitting (<code>--split</code>)</p>
<p><b><code>--rollup</code></b></p>
<p>When <code>--split</code> is used, or when a <code>--catalog</code> file is specified (e.g. to join reference scaffolds or chromosomes into whole genomes), the output JSON data will include a <code>rollup_stats</code> object with information about each parent reference sequence (e.g. whole genome) calculated from the stats of the individual underlying sequences.</p>
<p><b><code>--seed=&lt;n&gt;</code></b></p>
<p>SEAStAR uses its own random number generator (for reproducibility among different platforms). You may use a different seed to investigate what difference randomized decisions made in the analysis are having on the results. By default <code>--seed=12345</code>.</p>
<p><b><code>--num_threads=&lt;n&gt;</code></b></p>
<p><code>ref_select</code> is highly multithreaded to support modern multi-core processors. Sometimes you will want to restrict the number of cores it uses (e.g. to prevent resource competition on clusters or shared computers, or when you are running more than one instance of <code>ref_select</code> at a time on a given computer.)  By default <code>--num_threads</code> is the number of cores on the machine (including &quot;hyper-threads&quot;). The minimum number of threads required for proper operation is 2 or 3 depending on the parmeters being used, and ref_select checks to ensure that the proper minimum number of threads is allocated.</p>
<h3 id="a-nameref_select_bitscoreref_select-parameters-for-bitscore-calculation-and-reference-selectiona-options-"><a name="ref_select_bitscore"><code>ref_select</code> parameters for bitscore calculation and reference selection</a> <code>[options]</code> :</h3>
<p><code>[-t &lt;n&gt;|--bit_thresh=&lt;n&gt;]</code> <code>[-f &lt;n&gt;|--bit_fraction=&lt;n&gt;]</code> <code>[-l &lt;n&gt;|--read_map_limit=&lt;n&gt;]</code> <code>[-s|--second_chance_reads]</code> <code>[--relax_read_sharing]</code> <code>[--all_taxa]</code> <code>[-a|--absolute_bitscores]</code> <code>[--no_rand]</code> <code>[--sim_frac=&lt;n&gt;]</code> <code>[-e &lt;seq_id&gt;|--exclude=&lt;seq_id&gt;]...</code> <code>[--exclude_file=&lt;exclude_file&gt;]</code> <code>[--invert_exclude]</code></p>
<p><b><code>-t &lt;n&gt;</code> / <code>--bit_thresh=&lt;n&gt;</code></b></p>
<p>Minimum bitscore value for a ref sequence to be selected for output. Default &gt; 0.0, all sequences with a positive bitscore. NOTE: if both --bit_fraction and --bit_thresh are used, the highest resulting threshold will be used. Use of selection thresholds is useful for selecting a &quot;noise floor&quot; to exclude sequences that have only a small number of reads uniquely aligning, which in a large sequence database (e.g. 16S rDNA databases) may only reflect sequencing errors.</p>
<p><b><code>-f &lt;n&gt;</code> / <code>--bit_fraction=&lt;n&gt;</code></b></p>
<p>Minimum bitscore value for a ref sequence to be selected for output, as a fraction of the bitscore for the top scoring sequence. NOTE: if both --bit_fraction and --bit_thresh are used, the highest resulting threshold will be used.</p>
<p><b><code>-l &lt;n&gt;</code> / <code>--read_map_limit=&lt;n&gt;</code></b></p>
<p>Maximum number of reference mappings before a read is rejected from consideration. Default is 50000 reference sequences. Reads that map with a large fraction of the sequences in a reference database (e.g. low complexity reads, or highly conserved sequence) carry very little information but require a lot of computational effort and memory to process.  If your sequence database has a lot of conserved or redundant sequence (e.g. 16S rDNA), then this parameter will save a lot of computational expense with little if any effect on the results. If desired, a 2-pass approach can be used where reads are aligned to the full database and run through <code>ref_select</code> first to eliminate the vast majority of the database from consideration. Then in a second pass, reads are re-aligned to just the sequences selected in the first round, so that all aligning reads can be accounted for (e.g. in coverage and relative abundance calculations).</p>
<p><b><code>-s</code> / <code>--second_chance_reads</code></b></p>
<p>Allow reads to be mapped to edit dist +1 mappings when their best ref is eliminated. By default, if a read maps to a sequence that ultimately doesn&#39;t meet the bitscore threshold being used limit selection, that read is removed from further downstream consideration, if it didn&#39;t also map at least as well with some other sequence that is ultimately selected for output. When <code>--second_chance_reads</code> is used, such reads gain a &quot;second chance&quot; to be reallocated to reference sequence(s) where they map with edit distance +1 from the eliminated best match(es). This parameter causes <code>ref_select</code> to use significantly more memory. Note: this has no effect when <code>--bit_fraction</code> and <code>--bit_thresh</code> are at their default values, or when <code>--relax_read_sharing</code> is used.</p>
<p><b><code>--relax_read_sharing</code></b></p>
<p>Allow reads to be mapped to all sequences, regardless of edit distance. By default, reads are only associated with the sequence(s) that they align with at the minimum edit distance. <code>--relax_read_sharing</code> causes reads to be associated with all sequence alignments, even if they are suboptimal relative to the best match.</p>
<p><b><code>--all_taxa</code></b></p>
<p>Use count of all taxa in catalog to calculate bitscores. By default, any taxa in the reference database that have zero read alignments are excluded from consideration in the information theoretic bitscore calculations (ie. the calculations proceed as though those sequences didn&#39;t exist in the reference database at all).</p>
<p><b><code>-a</code> / <code>--absolute_bitscores</code></b></p>
<p>Use absolute bitscores to score and select reference sequences. By default length normalized scores (bits/nt) are used to control for differences in reference sequence length.</p>
<p><b><code>--no_rand</code></b></p>
<p>By default, reads that map equally well to multiple locations within a single reference sequence are randomly assigned to one of the possible positions. The <code>--no_rand</code> option changes this behavior so that such a read will be mapped to the first position in the reference seen in the alignment file. Counterintuitively, it is likely that using <code>--no_rand</code> will make the output of <code>ref_select</code> <strong><em>less</em></strong> deterministic. This is because the pseudo-random number generator used by SEAStAR is cross-platform deterministic, whereas the process by which such multiple alignments are ordered in an alignment SAM file may be at least paritally random (say through multi-thread race contingencies in the alignment software.)</p>
<p><b><code>--sim_frac=&lt;n&gt;</code></b></p>
<p>Fraction of aligned reads to use. Reads (and their mates) are randomly sampled from the input alignment(s) in this proportion. Primarily for use with simulated datasets.</p>
<p><b><code>-e &lt;seq_id&gt;</code> / <code>--exclude=&lt;seq_id&gt;</code></b></p>
<p>This option in conceptually similar to the <code>--detail</code> option, except that reference sequences identified with <code>--exclude</code> are discarded from the analysis entirely, as though they were not in the reference database to begin with. This is useful if a long, expensive alignment to a large reference sequence database has been completed, and then some downstream analysis calls for processing alignments with only a subset of that database.</p>
<p><b><code>--exclude_file=&lt;exclude_file&gt;</code></b></p>
<p>Filename of file containing sequence ids to treat as in <code>--exclude</code>. Format is a plain text file with one <code>&lt;seq_id&gt;</code> per line.</p>
<p><b><code>--invert_exclude</code></b></p>
<p><code>--exclude</code> or <code>--exclude_file</code> sequences are inverted; that is, exclude all sequences except those specified.</p>
<h3 id="a-nameref_select_coverageref_select-parameters-for-coverage-calculationsa-options-"><a name="ref_select_coverage"><code>ref_select</code> parameters for coverage calculations</a> <code>[options]</code> :</h3>
<p><code>[--per_base]</code> <code>[-w &lt;n&gt;|--cov_window=&lt;n&gt;]</code> <code>[--gc_window=&lt;n&gt;]</code> <code>[--detect_dups]</code></p>
<p><b><code>--per_base</code></b></p>
<p>Per base statistics (coverage; and when <code>--mate_pair</code> is enabled, physical coverage and mean insert length) will be generated and output for all selected reference sequences (or only those specified by <code>--detail</code>).</p>
<p><b><code>-w &lt;n&gt;</code> / <code>--cov_window=&lt;n&gt;</code></b></p>
<p>Fixed read length for use in calculating per position coverage. By default the actual length of each read is used when available in the FASTQ readnames (<a href="#FASTQ">see the SEAStAR FASTQ format guide</a> for details.)  If the readnames do not include length information and <code>--cov_window</code> isn&#39;t specified, then 49 nucleotides is the default fixed read length used.</p>
<p><b><code>--gc_window=&lt;n&gt;</code></b></p>
<p>Enables per position %GC moving average calculation, within the specified window size. The results are output in the <code>per_nt_mean_gc</code> array in the JSON output for each selected node (or only those specified by <code>--detail</code>). Use of this option requires <code>--per_base</code> statistics and either <code>--recon_seq</code> or <code>--ref</code> to provide the sequence data nrcessary for the calculation. If <code>--mp_circular</code> is used, then the %GC calculation wraps around at the sequence ends with no discontinuity, otherwise the N/2 %GC values at each end of the sequence all have the same value as the first (or last) valid window.</p>
<p><b><code>--detect_dups</code></b></p>
<p>Perform additional analysis to detect likely collapsed duplications. Assumes relatively uniform coverage (i.e. a single isolate genome). Results are reported in the <code>contig_problems</code> section of the <a href="#JSON">JSON sequence graph</a>, a summary of which is available via the <code>graph_ops</code> <a href="#PROBS"><code>PROBS</code></a> command.</p>
<h3 id="a-nameref_select_reconref_select-parameters-for-sequence-reconstruction-and-read-filteringa-options-"><a name="ref_select_recon"><code>ref_select</code> parameters for sequence reconstruction and read filtering</a> <code>[options]</code> :</h3>
<p><code>[--ambig_tol=&lt;n&gt;]</code> <code>[--read_output=&lt;prefix&gt;]</code> <code>[--output_nomatch]</code> <code>[--read_output_gzip]</code></p>
<p><b><code>--ambig_tol=&lt;n&gt;</code></b></p>
<p>Tolerance adjusting the sensitivity for generating ambiguity codes in output sequence reconstructed from read alignments. In the range: 0.0 - 1.0. 1.0 = no ambiguities, majority rules base calling. Default is 0.2</p>
<p><b><code>--read_output=&lt;prefix&gt;</code></b></p>
<p>Enables output of reads aligning with the refernce database (and their mates, regardless of whether the mates align.) <code>&lt;prefix&gt;</code> indicates the FASTQ output filename prefix to use in writing the new ouput files. Must be used with <code>--read_file</code> and/or <code>rev_read_file</code>.</p>
<p><b><code>--output_nomatch</code></b></p>
<p>Inverse the meaning <code>--read_output</code> to write only reads that do not align with any reference sequence (whose mates also do not align).</p>
<p><b><code>--read_output_gzip</code></b></p>
<p>Modifies <code>--read_output</code> to write gzip compressed output files.</p>
<h3 id="a-nameref_select_pairingref_select-parameters-for-generating-read-pairing-statisticsa-options-"><a name="ref_select_pairing"><code>ref_select</code> parameters for generating read pairing statistics</a> <code>[options]</code> :</h3>
<p><code>[--mp_mate_lim=&lt;n&gt;]</code> <code>[--mp_share_lim=&lt;n&gt;]</code> <code>[--mp_strict]</code> <code>[--mp_inserts]</code> <code>[--mp_circular]</code> <code>[--mp_cutoff=&lt;n&gt;]</code></p>
<p><b><code>--mp_mate_cnt=&lt;n&gt;</code></b></p>
<p>Minimum count of mapping read-pairs for generation of a mate-pair linking edge in the output graph (normal or internal). Default is 2, meaning that no edges consisting of a single read-pair will be output.</p>
<p><b><code>--mp_mate_lim=&lt;n&gt;</code></b></p>
<p>Minimum bitscore threshold for generation of a mate-pair linking edge in the output graph (normal or internal). Default is 0.0 bits, however <code>--mp_mate_count</code> above will put a floor on lowest scoring edges output.</p>
<p><b><code>--mp_share_lim=&lt;n&gt;</code></b></p>
<p>Minimum bitscore threshold for generation of a shared sequence edge in the output graph. Default is 500.0 bits.</p>
<p><b><code>--mp_strict</code></b></p>
<p>Do not consider paired reads that span reference sequences when the same pair of reads also map entirely within some single reference sequence.</p>
<p><b><code>--mp_inserts</code></b></p>
<p>Perform insert size estimation for all reference sequences where at least one pair of reads both map within that sequence.</p>
<p><b><code>--mp_circular</code></b></p>
<p>Allow circular self-linking mate-pairs to join the opposite ends of a single sequence.</p>
<p><b><code>--mp_cutoff=&lt;n&gt;</code></b></p>
<p>Insert size cutoff for inclusion of an individual mate-pair in per base statistics. Setting this option for some reasonable upper limit of the length of a valid insert will prevent sequence duplications from improperly skewing estimation of the true mean insert size.</p>
<h3 id="a-nameref_select_inputsref_select-parameters-for-specifying-input-filesa-options-"><a name="ref_select_inputs"><code>ref_select</code> parameters for specifying input files</a> <code>[options]</code> :</h3>
<p><code>[-r &lt;read_file&gt;|--read_file=&lt;read_file&gt;]...</code> <code>[--rev_read_file=&lt;read_file&gt;]...</code> <code>[--old_bwa_samse]</code> <code>[-c &lt;catalog_file&gt;|--catalog=&lt;catalog_file&gt;]</code> <code>[--rev_align=&lt;sam_file&gt;]...</code></p>
<p><strong>NOTE!</strong> all inputs to <code>ref_select</code> involving per read data (FASTQ and SAM files) must be named using the SEAStAR <a href="#FASTQ">FASTQ naming conventions</a>, also see the <a href="#ref_select">introductory description of ref_select</a> for an example.</p>
<p><b><code>-r &lt;read_file&gt;</code> / <code>--read_file=&lt;read_file&gt;</code></b></p>
<p>Input filename(s) for FASTQ read files used for sequence reconstruction and/or read filtering.</p>
<p><b><code>--rev_read_file=&lt;read_file&gt;</code></b></p>
<p>Same as --read_file above, but used for paired reads that are on the opposite strand from each other. By default, <code>ref_select</code> assumes that paired reads are from the same DNA strand.  When this is not true, then one of the two readsets (read1/read2) must be specified using <code>--rev_read_file</code>.  Note, this must be done consistently with the use of <code>--rev_align</code>.</p>
<p><b><code>--old_bwa_samse</code></b></p>
<p>Use the old BWA-style &quot;samse&quot; alignment format instead of SAM. This format was used by the BWA <code>samse -n</code> command until BWA version 0.5.6.</p>
<p><b><code>-c &lt;catalog_file&gt;</code> / <code>--catalog=&lt;catalog_file&gt;</code></b></p>
<p>Filename of a reference catalog file. Required for use <code>--old_bwa_samse</code>. This is optional otherwise, although it is required to associate sequence descriptions with reference sequences in the JSON output, or for use of the <code>--rollup</code> functionality when reference sequences are actually subsequences of parent sequences (e.g. scaffolds or chromosomes of a whole genome). A catalog is a tab separated (TSV) plain text file with the following columns:</p>
<pre><code>  seq_id     seq_len     parent_id   seq_description</code></pre><p>For example:</p>
<pre><code>  ABCD123    1102938     ORG1        Organism 1, Scaffold 1
  EFGH987    482273      ORG1        Organism 1, Scaffold 2
  ZYXW765    193475      ORG2        Organism 2, Scaffold 1
  RTSP345    782273      ORG2        Organism 2, Scaffold 2</code></pre><p>Note that each <code>seq_id</code> must be unique, but duplicated <code>parent_id</code> values indicate relationships among sequences (a shared parent sequence).</p>
<p><b><code>--rev_align=&lt;sam_file&gt;</code></b></p>
<p>Same as the main <sam_file> input filenames, but used for paired reads that are on the opposite strand from each other. By default, <code>ref_select</code> assumes that paired reads are from the same DNA strand.  When this is not true, then alignments for one of the two readsets must be specified using <code>--rev_align</code>.  Note, this must be done consistently, and also with the use of <code>--rev_read_file</code> when input FASTQ files are also used.</p>
<hr>
<h2 id="a-namegraph_opsgraph_opsa"><a name="graph_ops"><code>graph_ops</code></a></h2>
<p><code>graph_ops</code> reads <a href="#JSON">JSON sequence graph files</a> produced by the <code>ref_select</code> program. It implements a variety of commands for manipulating this data for assembly, visualization or output to a variety of file formats (as detailed for each subcommand below). Because the JSON sequence graph objects are often very large, and therefore may take significant time to read and write, we designed <code>graph_ops</code> as a federated suite of tools that can pass the current sequence graph state from one tool to the next in-memory. This saves considerable time and I/O traffic relative to an alternative design that used separate command line tools for each operation performed by <code>graph_ops</code>. A consequence of this design is that <code>graph_ops</code> may be used in either a scripted mode, where all commands are supplied via the command-line or a script (<code>.go</code>) file; or in an interactive mode, where commands are manually entered one at a time.</p>
<h4 id="usage-graph_ops--h--help-inputjson-scriptgo-command-params">Usage: <code>graph_ops [-h|--help] [&lt;input.json&gt;] [&lt;script.go&gt;] [&lt;command&gt; [&#39;{params}&#39;]]...</code></h4>
<p>where:</p>
<ul>
<li><p><b><code>-h</code> / <code>--help</code></b> provides command line help regarding program usage and version information</p>
<pre><code>  # Provide help about shell command line usage
  graph_ops --help
  # Provide help about graph_ops commands
  graph_ops HELP   # HELP is actually a graph_ops command</code></pre></li>
<li><p><b><code>&lt;command&gt;</code></b> is some valid command and <b><code>&#39;{params}&#39;</code></b> optionally specifies parameters for each command, in the form: <code>&#39;{&quot;parm1&quot;:value1,&quot;parm2&quot;:value2...}&#39;</code>. Multiple commands with optional parameters may be provided in succession for execution. NOTE: the parameters for each command must be single quoted to ensure that it is passed to <code>graph_ops</code> as a single item, and to prevent interactions with special characters used by the UNIX shell.</p>
<pre><code>  # Read a sequence graph and output all of the sequences to a FASTA file
  graph_ops LOAD &#39;{&quot;file&quot;,&quot;assembly.json&quot;}&#39; FASTA &#39;{&quot;file&quot;:&quot;sequence.fna&quot;}&#39;</code></pre><p> Like most of the other tools in SEAtAR, <code>graph_ops</code> can read and write gzip compressed versions of all of its input and output files:</p>
<pre><code>  # Read a compressed sequence graph and output all of the sequences to a compressed FASTA file
  graph_ops LOAD &#39;{&quot;file&quot;,&quot;assembly.json.gz&quot;}&#39; FASTA &#39;{&quot;file&quot;:&quot;sequence.fna.gz&quot;}&#39;</code></pre></li>
<li><p><b><code>&lt;input.json&gt;</code></b> is an optional shortcut specifying a sequence graph file to initially <code>LOAD</code></p>
<pre><code>  # Load the sequence graph file and go interactive
  graph_ops assembly.json

  # Equivalent to
  graph_ops LOAD &#39;{&quot;file&quot;:&quot;assembly.json&quot;}&#39; SCRIPT</code></pre></li>
<li><p><b><code>&lt;script.go&gt;</code></b> is an optional <code>SCRIPT</code> to initially execute</p>
<pre><code>  # Load the sequence graph file and run the scaffold script
  graph_ops assembly.json write_fasta.go

  # Equivalent to
  graph_ops LOAD &#39;{&quot;file&quot;:&quot;assembly.json&quot;}&#39; SCRIPT &#39;{&quot;file&quot;:&quot;write_fasta.go&quot;}&#39;

  # If the write_fasta script contains the LOAD command, then just this will suffice
  graph_ops write_fasta.go</code></pre><p> In this last case, the <code>write_fasta.go</code> script could be (note that single quotes around the parameters in a <code>.go</code> file are not necessary):</p>
<pre><code>  # This is the write_fasta.go file
  LOAD {&quot;file&quot;:&quot;assembly.json&quot;}
  FASTA {&quot;file&quot;:&quot;sequence.fna&quot;}</code></pre></li>
</ul>
<p>Running <code>graph_ops</code> without <strong><em>any</em></strong> options will launch in interactive mode, equivalent to: <code>graph_ops SCRIPT</code></p>
<p>Note that many <code>graph_ops</code> commands may write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when using commands such as <code>DUMP</code>, <code>DOT</code>, <code>TABLE</code>, or <code>FASTA</code> (without specifying an output filename) and redirecting the STDOUT of <code>graph_ops</code> to a file. Likewise, only one such output command may use STDOUT per run of <code>graph_ops</code> (e.g. redirecting the output of both <code>FASTA</code> and <code>DUMP</code> to the same file will result in a file that is both invalid JSON and invalid FASTA format.)</p>
<h3 id="graph_ops-command-summary"><code>graph_ops</code> <code>&lt;command&gt;</code> summary:</h3>
<h4 id="file-io-commands">File I/O commands</h4>
<ul>
<li><strong><a href="#LOAD"><code>LOAD</code></a></strong> -- Input current data structure from a file, or by default from STDIN</li>
<li><strong><a href="#DUMP"><code>DUMP</code></a></strong> -- Output current data structure to a file, or by default to STDOUT</li>
<li><strong><a href="#TABLE"><code>TABLE</code></a></strong> -- Write <code>ref_select</code> statistics for selected reference sequences to a TSV file</li>
<li><strong><a href="#FASTA"><code>FASTA</code></a></strong> -- Write sequences contained in the selected graph to the FASTA format</li>
<li><strong><a href="#DOT"><code>DOT</code></a></strong> -- Write the current assembly graph to the graphviz DOT format</li>
</ul>
<h4 id="assembly-pipeline-commands-in-this-order">Assembly pipeline commands (in this order)</h4>
<ul>
<li><strong><a href="#MST"><code>MST</code></a></strong> -- Calculate the Maximal Spanning Tree of all connected components</li>
<li><strong><a href="#SST"><code>SST</code></a></strong> -- Calculate the improved Scaffold Spanning Tree of all connected components. <code>SST</code> is an optional replacement for <code>MST</code>, useful for metagenomes or relatively short contigs.</li>
<li><strong><a href="#PLUCK"><code>PLUCK</code></a></strong> -- Remove all leaf contig nodes (in or outdegree == 0) from the graph</li>
<li><strong><a href="#PRUNE"><code>PRUNE</code></a></strong> -- Split the assembly graph at all contig nodes with in or out degree &gt; 1</li>
<li><strong><a href="#SLICE"><code>SLICE</code></a></strong> -- Break linear scaffolds at a GC / coverage discontinuity. <code>SLICE</code> is optional, but recommended for metagenomes.</li>
<li><strong><a href="#PUSH"><code>PUSH</code></a></strong> -- Extends scaffold ends (reversing the action of PLUCK at scaffold ends)</li>
<li><strong><a href="#INSERT"><code>INSERT</code></a></strong> -- Reconnect contigs with ambiguous placement within a scaffold using mapped pair position information to resolve ambiguities</li>
<li><strong><a href="#SCAFF"><code>SCAFF</code></a></strong> -- Lay out a fully linear scaffold from contigs in unambiguously ordered connected components</li>
<li><strong><a href="#CLUST"><code>CLUST</code></a></strong> -- Cluster scaffolds using the <code>tetracalc</code> tool</li>
</ul>
<h4 id="assembly-graph-filter--edit-utilities">Assembly graph filter / edit utilities</h4>
<ul>
<li><strong><a href="#CIRCLE"><code>CIRCLE</code></a></strong> -- Find circular sequences among scaffolds.</li>
<li><strong><a href="#CCOMPS"><code>CCOMPS</code></a></strong> -- Calculate the current graph connected components</li>
<li><strong><a href="#SELCC"><code>SELCC</code></a></strong> -- Select specific connected components for further processing</li>
<li><strong><a href="#SELCLUST"><code>SELCLUST</code></a></strong> -- Select clusters of scaffolds for further processing</li>
<li><strong><a href="#SELND"><code>SELND</code></a></strong> -- Select contigs from the connected neighborhoods of the given contigs</li>
<li><strong><a href="#SCAFLNK"><code>SCAFLNK</code></a></strong> -- Find edge connections linking scaffold ends</li>
<li><strong><a href="#RELINK"><code>RELINK</code></a></strong> -- Reconnect previously removed edge connections between contigs</li>
<li><strong><a href="#EDGFLT"><code>EDGFLT</code></a></strong> -- Remove all edges scoring less than some number of bits</li>
<li><strong><a href="#PROBS"><code>PROBS</code></a></strong> -- Generate a report of contigs with likely internal assembly issues</li>
<li><strong><a href="#EDIT"><code>EDIT</code></a></strong> -- Make manual explicit edits to the selected graph structure</li>
<li><strong><a href="#CUTND"><code>CUTND</code></a></strong> -- Create a new node from an existing node using the given coordinates</li>
</ul>
<h4 id="information-and-control">Information and control</h4>
<ul>
<li><strong><a href="#GC"><code>GC</code></a></strong> -- Generate statistics about the assembly graph</li>
<li><strong><a href="#GCC"><code>GCC</code></a></strong> -- Generate statistics about the assembly graph, with details about each connected component</li>
<li><strong><a href="#STASH"><code>STASH</code></a></strong> -- Copy (push) the current graph to the top of a stack in memory</li>
<li><strong><a href="#UNSTASH"><code>UNSTASH</code></a></strong> -- Restore (pop) the current graph from the top of the stack (undo changes)</li>
<li><strong><a href="#SCRIPT"><code>SCRIPT</code></a></strong> -- Use contents of file as a series of commands, or read from the console if no file provided</li>
<li><strong><a href="#HELP"><code>HELP</code></a></strong> -- List all valid commands, or provide detailed help for a specific command with <code>HELP &lt;COMMAND&gt;</code></li>
</ul>
<hr>
<h3 id="graph_ops-detailed-command-reference"><code>graph_ops</code> detailed <code>&lt;command&gt;</code> reference:</h3>
<h3 id="a-nameloadloada"><a name="LOAD"><code>LOAD</code></a></h3>
<p>Input JSON sequence graph data from a file, or by default from STDIN.</p>
<p>On the command line, if the first parameter isn&#39;t a valid command string, and it ends in <code>.json[.gz]</code>, then it is assumed to be the name of a JSON file, and an implicit <code>LOAD</code> command will be run using that filename.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><b><code>file : &quot;filename.json[.gz]&quot;</code></b></p>
<p> Specify the name of a JSON format sequence graph file.</p>
<p> Examples:</p>
<pre><code>  # Read data from the file my_assembly.json
  LOAD {&quot;file&quot;:&quot;my_assembly.json&quot;}

  # Read data piped from STDIN (default)
  LOAD {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
<li><p><b><code>files : [&quot;filename1.json[.gz]&quot;,...]</code></b></p>
<p> Specify the names of multiple JSON format sequence graph files to merge together.</p>
<p> NOTE: The files must have uniquely named sequence nodes (contigs) with no nodes in common among the multiple files being loaded. Only the processing history from the first JSON file is retained (the other histories may still be found in their original files.) Other processing output such as scaffolds and clusters are lost.</p>
<p> Example:</p>
<pre><code>  # Read and merge data from the files my_assembly_1.json and my_assembly_2.json
  LOAD {&quot;files&quot;:[&quot;my_assembly_1.json&quot;,&quot;my_assembly_2.json&quot;]}</code></pre></li>
<li><p><b><code>edges : &quot;edge_file.json[.gz]&quot;</code></p>
<p> Specify the name of a JSON format sequence graph file to load edges and mate-pair statistics from.</p>
<p> This option works with the <code>file</code> option, and requires that the edge file contain a graph with a subset of the nodes of the main graph being loaded with <code>file</code>. Any edges and mate-pair statisics present in the main graph are discarded in favor of those from the <code>edge</code> file.</p>
<p> Example:</p>
<pre><code>  # Read nodes and sequence data from main_file.json, and edges and connection
  # data from edge_file.json
  LOAD {&quot;file&quot;:&quot;main_file.json&quot;,&quot;edge_file&quot;:&quot;edge_file.json&quot;}</code></pre></li>
</ul>
<h3 id="a-namedumpdumpa"><a name="DUMP"><code>DUMP</code></a></h3>
<p>Output current sequence graph data to a JSON file, or by default to STDOUT</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><b><code>file : &quot;filename.json[.gz]&quot;</code></b></p>
<p> Specify a file name to write the JSON format sequence graph to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name.</p>
<p> Examples:</p>
<pre><code>  # Write data to the file my_assembly.json
  DUMP {&quot;file&quot;:&quot;my_assembly.json&quot;}

  # Write data piped to STDOUT (default)
  DUMP {&quot;file&quot;:&quot;-&quot;}

  # Write data to the file: my_assembly_000.json (first time this is run)
  DUMP {&quot;file&quot;:&quot;my_assembly_###.json&quot;}

  # Run again, write data to the file: my_assembly_001.json
  DUMP {&quot;file&quot;:&quot;my_assembly_###.json&quot;}</code></pre></li>
</ul>
<h3 id="a-nametabletablea"><a name="TABLE"><code>TABLE</code></a></h3>
<p>Write <code>ref_select</code> statistics for selected reference sequences to a TSV file. The fields are (left to right by column):</p>
<ul>
<li><code>bitscore</code> - Information content of reads aligning with the ref sequence</li>
<li><code>read_cnt</code> - Number of (possibly fractional) reads aligning with the ref sequence</li>
<li><code>norm_cnt</code> - Read_cnt normalized to ref sequence length</li>
<li><code>rel_abun</code> - Relative fractional abundance of (copy number of) the ref sequence to all selected sequences (those with bitscores above some thresh)</li>
<li><code>mean_cov</code> - Mean coverage of the reference sequence by (possibly fractional) reads</li>
<li><code>read_len</code> - Mean length of reads aligning with this sequence</li>
<li><code>seq_len</code>  - Length of the ref sequence</li>
<li><code>pct_gc</code>   - Percent GC content of ref sequence (NA if not calculated)</li>
<li><code>name</code>     - Catalog name of the ref sequence</li>
<li><code>desc</code>     - Catalog description of the ref sequence</li>
</ul>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.tsv[.gz]&quot;</code></p>
<p> Specify a file name to write the TSV format stats table to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name. (See <code>DUMP</code> command an example using this behavior)</p>
<p> Examples:</p>
<pre><code>  # Write stats to the file my_seq.tsv
  TABLE {&quot;file&quot;:&quot;my_seq.tsv&quot;}

  # Write stats to STDOUT (default)
  TABLE {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
<li><p><code>header : true</code></p>
<p> Write a header row in the output table with labels for each column.</p>
<p> Example:</p>
<pre><code>  # Write a header row (false is default).
  TABLE {&quot;header&quot;:true}</code></pre></li>
</ul>
<h3 id="a-namefastafastaa"><a name="FASTA"><code>FASTA</code></a></h3>
<p>Write sequences contained in the current graph data to a FASTA format file</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.fasta[.gz]&quot;</code></p>
<p> Specify a file name to write the FASTA format sequence data to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name. (See <code>DUMP</code> command an example using this behavior)</p>
<p> Examples:</p>
<pre><code>  # Write stats to the file my_seq.fasta
  FASTA {&quot;file&quot;:&quot;my_seq.fasta&quot;}

  # Write sequence to STDOUT (default)
  FASTA {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
<li><p><code>scaff : true</code> or <code>scaff : &quot;[options]&quot;</code></p>
<p> Output fully scaffolded sequences (using the <code>seq_scaffold</code> tool). Using this parameter with the value <code>true</code> run <code>seq_scaffold</code> with its default settings. As an advanced option, this parameter can also accept a string argument, which will be passed along to the external <code>seq_scaffold</code> tool as its <code>[options]</code> parameter string. Run with option string <code>&quot;--help&quot;</code> for help with the settings offered by <code>seq_scaffold</code> and its default values.</p>
<p> Examples:</p>
<pre><code>  # Output scaffolded contig sequences using default settings.
  FASTA {&quot;scaff&quot;:true}

  # Get seq_scaffold help
  FASTA {&quot;scaff&quot;:&quot;--help&quot;}

  # Pass seq_scaffold some options
  FASTA {&quot;scaff&quot;:&quot;--overlap=7 --heal=othercontigs.fna&quot;}</code></pre></li>
<li><p><code>no_merge_scaffs : true</code></p>
<p> Write contig sequences in scaffold order with a scaffold ID in each header, ready to be provided to the <code>seq_scaffold</code> tool (but do not run it).</p>
<p> Example:</p>
<pre><code>  # Output ordered contig sequences
  FASTA {&quot;no_merge_scaffs&quot;:true}</code></pre></li>
<li><p><code>abundance : true</code></p>
<p> Append relative abundance values to the FASTA sequence IDs. This is for use with downstream classification and visualization tools, such as for the 16S rDNA analysis pipeline vignette.</p>
<p> Example:</p>
<pre><code>  # Attach abundance values
  FASTA {&quot;abundance&quot;:true} -- Attach abundances</code></pre></li>
</ul>
<h3 id="a-namedotdota"><a name="DOT"><code>DOT</code></a></h3>
<p>Write the current graph data to a <a href="http://www.graphviz.org/">graphviz</a> format <a href="http://www.graphviz.org/content/dot-language">DOT</a> file for visualization or other processing. Some of the parameters listed below allow you to change node and edge parameters that will affect the output style of graphs rendered using the <a href="http://www.graphviz.org/pdf/dotguide.pdf"><code>dot</code></a> or <a href="http://www.graphviz.org/pdf/neatoguide.pdf"><code>neato</code></a> layout tools that are part of the graphviz package. The parameters provided here give only very limited control of the most common style choices available. The <a href="http://www.graphviz.org/pdf/gvpr.1.pdf"><code>gvpr</code></a> language supplied by graphviz is much better suited for more detailed manipulation of the myriad of rendering options available.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.dot[.gz]&quot;</code></p>
<p> Specify a file name to write the DOT format graph data to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name. (See <code>DUMP</code> command an example using this behavior)</p>
<p> Example:</p>
<pre><code>  # Write stats to the file my_graph.dot
  DOT {&quot;file&quot;:&quot;my_graph.dot&quot;}

  # Write graph to STDOUT (default)
  DOT {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
<li><p><code>detail : true</code></p>
<p> Draw labelled contig nodes</p>
<p> Example:</p>
<pre><code>  # Draw contigs as ovals containing the node names of each contig.
  # Note that this disables scaling the node size by contig length.
  DOT {&quot;detail&quot;:true}</code></pre></li>
<li><p><code>arrowtype : &quot;arrowtype_string&quot;</code></p>
<p> Control the type of arrowheads drawn. See graphviz documentation at: <a href="http://www.graphviz.org/doc/info/attrs.html#k:arrowType">http://www.graphviz.org/doc/info/attrs.html#k:arrowType</a></p>
<p> Example:</p>
<pre><code>  # Draw normal arrows (default)
  DOT {&quot;arrowtype&quot;:&quot;normal&quot;}</code></pre></li>
<li><p><code>pen_scale : &lt;float&gt;</code></p>
<p> Control the relative thickness of the arrow lines. See graphviz documentation at: <a href="http://www.graphviz.org/doc/info/attrs.html#d:penwidth">http://www.graphviz.org/doc/info/attrs.html#d:penwidth</a></p>
<p> Example:</p>
<pre><code>  # Draw normal arrows (default)
  DOT {&quot;pen_scale&quot;:0.1}</code></pre></li>
<li><p><code>const_edge : true</code></p>
<p> Draw connection arrow lines at constant width</p>
<p> Example:</p>
<pre><code>  # Draw edge arrows of constant width regardless of bitscore.
  DOT {&quot;const_edge&quot;:true}</code></pre></li>
<li><p><code>colored_edges : true</code></p>
<p> Draw edges colored by the GC% of the connected contigs. Note that this overrides any other edge coloring that would have otherwise have been applied.</p>
<p> Example:</p>
<pre><code>  # Draw colored arrows
  DOT {&quot;colored_edges&quot;:true}</code></pre></li>
<li><p><code>problems : true</code></p>
<p> Draw nodes with potential assembly problems as red double circles with the node label text inside. Note that this overrides the normal node text that would otherwise have been rendered.</p>
<p> Example:</p>
<pre><code>  # Highlight problem nodes
  DOT {&quot;problems&quot;:true}</code></pre></li>
</ul>
<h3 id="a-namemstmsta"><a name="MST"><code>MST</code></a></h3>
<p>Calculate the <strong>M</strong>aximal <strong>S</strong>panning <strong>T</strong>ree of all connected components. This command reduces the connection graph to a tree-structure that optimally maximizes the total remaining edge bitscores, preserving all previously connected components of the graph, and directs all remaining edges of the graph based on the pairing information.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>bits : true</code></p>
<p> Use raw connection bitscores instead the default heuristically adjusted scores (based on GC% / coverage differences)</p>
<p> Example:</p>
<pre><code>  # Use raw connection bitscores from mate-pairing
  MST {&quot;bits&quot;:true}</code></pre></li>
</ul>
<h3 id="a-namesstssta"><a name="SST"><code>SST</code></a></h3>
<p>As an alternative to the <code>MST</code> command, calculate the <strong>S</strong>caffold <strong>S</strong>panning <strong>T</strong>ree of all connected components, producing a sub-optimal tree from the perspective of maximizing edge bitscores, but with important properties for successful contig scaffolding in the following steps. This command is generally preferable to the MST command when median contig length is less than the mean distance between paired reads. That is, when a relatively large mate-pair insert size was selected for the sequencing library, relative to the size of the contigs that could ultimately be assembled. This condition is often the case for metagenomic samples, where inter-strain variability adversely affects the median contig length that can be achieved.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>bits : true</code></p>
<p> Use raw connection bitscores instead the default heuristically adjusted scores (based on GC% / coverage differences)</p>
<p> Example:</p>
<pre><code>  # Use raw connection bitscores from mate-pairing
  SST {&quot;bits&quot;:true}</code></pre></li>
</ul>
<h3 id="a-namepluckplucka"><a name="PLUCK"><code>PLUCK</code></a></h3>
<p>Remove all leaf contig nodes (in- or out-degree == 0) from the (tree-structured) graph.  Two (or sometimes three) iterations are usually sufficient to achieve the goal of removing all non-terminal short branches from the tree. These contigs will be added back to the assembly in later operations once the &quot;backbone&quot; scaffolds have been determined.  NOTE: <code>PLUCK</code> does not remove two &quot;leaf&quot; contig nodes that form a connected pair; that
is, multiple iterations of <code>PLUCK</code> will not completely remove contigs that happen to form short chains. Also,
<code>PLUCK</code> preserves entirely unconnected contigs with more than 20000 bases of sequence by default. See the <code>min_len</code> parameter below.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>iterate : &lt;int&gt;</code></p>
<p> Specify the number of iterations of PLUCK to run. Each is equivalent to running PLUCK again as a separate command. By default <code>&lt;int&gt;</code>= 3.</p>
<p> Examples:</p>
<pre><code>  # Default is {&quot;iterate&quot;:2}
  PLUCK

  # This is the equivalent of the above command
  PLUCK {&quot;iterate&quot;:1}
  PLUCK {&quot;iterate&quot;:1}</code></pre></li>
<li><p><code>min_len : &lt;int&gt;</code></p>
<p> Minimum length of isolated nodes to keep.  Default <code>&lt;int&gt;</code>= 20000 bases.</p>
<p> Example:</p>
<pre><code>  # Do not &quot;pluck away&quot; an unconnected contig containing 20000 or more bases of sequence.
  PLUCK  {&quot;min_len&quot;:20000}</code></pre></li>
</ul>
<h3 id="a-namepruneprunea"><a name="PRUNE"><code>PRUNE</code></a></h3>
<p>Split the assembly graph at all contig nodes with in or out degree &gt; 1. This command prunes remaining branches from the assembly tree by selectively removing the lowest scoring extra in- or out- edges from a node. After this operation is complete, all remaining connected components will be linear and properly directed.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>strict : true</code></p>
<p> Always cut all but the strongest link when there are more than two in + out links</p>
<p> Examples:</p>
<pre><code>  # The default case: attempt to determine which branch(es) to remove in order to
  # preserve one high weight path across a multiply linked contig node.
  PRUNE {&quot;strict&quot;:false}

  # Always strictly prune. Equivalent to {&quot;ratio&quot;:0.0} below
  PRUNE {&quot;strict&quot;:true}</code></pre></li>
<li><p><code>ratio : &lt;float&gt;</code></p>
<p> Ratio of the strongest to next strongest links to trigger strict pruning</p>
<p> Examples:</p>
<pre><code>  # Always strictly prune.
  PRUNE {&quot;ratio&quot;:0.0}

  # Never strictly prune.
  PRUNE {&quot;ratio&quot;:1.0}

  # Default. Don&#39;t strictly filter when the score of the highest scoring edge
  # is &gt;= 5x greater than the next highest scoring link of the same direction
  # (for both in and outlinks).
  PRUNE {&quot;ratio&quot;:0.2}</code></pre></li>
<li><p><code>verbose : true</code></p>
<p> Output diagnostics on STDERR</p>
<p> Example:</p>
<pre><code>  # Generate extra output information
  PRUNE {&quot;verbose&quot;:true}</code></pre></li>
</ul>
<h3 id="a-namesliceslicea"><a name="SLICE"><code>SLICE</code></a></h3>
<p>Break linear scaffolds at a %GC / Coverage discontinuity. Each run of <code>SLICE</code> will break a given scaffold at no more than one position. <code>SLICE</code> should be run multiple times if multiple misassemblies per scaffold are suspected.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>thresh : &lt;float&gt;</code></p>
<p> Threshold used to determine whether or not to break a connection</p>
<p> Example:</p>
<pre><code>  # Default. Medium strength heuristic score based on GC% / Coverage statistics of a
  # scaffold on either side of a given contig&#39;s connection edge. The lower the
  # threshold, the more sensitive the filter is to such discontinuities.
  SLICE &#39;{&quot;thresh&quot;:0.5}&#39;</code></pre></li>
</ul>
<h3 id="a-namepushpusha"><a name="PUSH"><code>PUSH</code></a></h3>
<p>Extends scaffold ends, reversing the (undesired) action of <code>PLUCK</code> at scaffold ends. Because the <code>PLUCK</code> command can&#39;t distinguish short-branches from what will ultimately become scaffold ends, it essentially erodes a contig off of the each end with every iteration. To reverse this, <code>PUSH</code> should be run the same number of iterations as was <code>PLUCK</code> previously.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>iterate : &lt;int&gt;</code></p>
<p> Number of iterations of PUSH to run. Each is equivalent to running <code>PUSH</code> again as a separate command. By default <code>&lt;int&gt;</code>= 3.</p>
<p> Examples:</p>
<pre><code>  # Default case.
  PUSH {&quot;iterate&quot;:3}

  # Equivalent to above.
  PUSH {&quot;iterate&quot;:2}
  PUSH {&quot;iterate&quot;:1}</code></pre></li>
</ul>
<h3 id="a-nameinsertinserta"><a name="INSERT"><code>INSERT</code></a></h3>
<p>Reconnect contigs with ambiguous placement within a scaffold using relative mean read-pair mapping position information to resolve ambiguities. By default this will run on all scaffolds. <code>INSERT</code> works by fully connecting contigs within the scaffold when their relative position can be unambiguously determined, and adding new &quot;constraint&quot; edges (supported only by the relative mapped position information) between such contigs so that they can be ordered correctly by the <code>SCAFF</code> command.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>ccname : &quot;contig_name&quot;</code></p>
<p> Use only the connected component (scaffold) containing the named contig.</p>
<p> Example:</p>
<pre><code>  # Insert removed contigs within the connected component containing this contig
  INSERT {&quot;ccname&quot;:&quot;NODE_1234&quot;}</code></pre></li>
<li><p><code>ccnames : [&quot;contig_name1&quot;,&quot;contig_name2&quot;,...]</code></p>
<p> Insert removed contigs within only the connected components (scaffolds) containing these contigs.</p>
<p> Example:</p>
<pre><code>  INSERT {&quot;ccnames&quot;:[&quot;NODE_1234&quot;,&quot;NODE_567&quot;]}</code></pre></li>
<li><p><code>thresh : &lt;float&gt;</code></p>
<p> Minimum contig connection bitscore to consider</p>
<p> Example:</p>
<pre><code>  # Default. Only connections scoring at least 250.0 bits
  # will count in the calculations
  INSERT {&quot;thresh&quot;:250.0}</code></pre></li>
<li><p><code>min_pos_diff : &lt;int&gt;</code></p>
<p> Minimum mapping position difference (within a neighboring contig) considered reliable enough to use for resolving positional ambiguities.</p>
<p> Example:</p>
<pre><code>  # Default. Only pairing information yielding a relative position difference of
  # 75 nucleotides (between candidate contigs and a neighboring existing contig)
  # will be considered significant enough to use in determining the relative
  # ordering of the candidate contigs
  INSERT {&quot;min_pos_diff&quot;:75}</code></pre></li>
<li><p><code>dup_kmer : &lt;int&gt;</code></p>
<p> Length of k-mers to use for detecting variant duplicate contigs</p>
<p> Examples:</p>
<pre><code>  # Default. Use 14-mers to detect duplicate variant contigs before inserting
  # them between scaffold backbone contigs.
  INSERT {&quot;dup_kmer&quot;:14}

  # Disable duplicate variant contig checks.
  INSERT {&quot;dup_kmer&quot;:0}</code></pre></li>
<li><p><code>dup_thresh : &lt;float&gt;</code></p>
<p> Fraction of kmers from a contig with hits to scaffold backbone.</p>
<p> Example:</p>
<pre><code>  # Default. If more than 15% of kmers for a contig hit kmers from the scaffold
  # backbone contigs, then do not insert that contig.
  INSERT {&quot;dup_thresh&quot;:0.15}</code></pre></li>
<li><p><code>external : true</code></p>
<p> Controls whether contigs may be inserted outside of an input scaffold.</p>
<p> Examples:</p>
<pre><code>  # Default.
  # Contigs will not be added outside of the first and last contigs in the input
  # scaffold; that is, the only inserts performed will be between existing connected
  # contigs.
  INSERT {&quot;external&quot;:false}

  # Contigs may be added to the scaffold before the first, or after the last, contig
  # in an input scaffold.
  INSERT {&quot;external&quot;:true}</code></pre></li>
<li><p><code>verbose : true</code></p>
<p> Output diagnostics on STDERR</p>
<p> Example:</p>
<pre><code>  # Generate extra output information
  INSERT {&quot;verbose&quot;:true}</code></pre></li>
</ul>
<h3 id="a-namescaffscaffa"><a name="SCAFF"><code>SCAFF</code></a></h3>
<p>Lay out a fully linear scaffold from contigs in unambiguously ordered connected components (as produced by the <code>INSERT</code> command)</p>
<p><strong>Parameters: NONE.</strong></p>
<h3 id="a-nameclustclusta"><a name="CLUST"><code>CLUST</code></a></h3>
<p>Cluster scaffolds using the external <code>tetracalc</code> tool</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>options : &lt;string&gt;</code></p>
<p> Command line options for the <code>tetracalc</code> tool, otherwise tool defaults used. Run with --help for help with the settings offered by <code>tetracalc</code>.</p>
<p> Examples:</p>
<pre><code>  CLUST {&quot;options&quot;:&quot;--merge_tar=0.95 -m 7500&quot;}

  CLUST {&quot;options&quot;:&quot;--help&quot;}

  CLUST {&quot;options&quot;:&quot;--fixed -t 0.9 -s 0.8 -r 0.9&quot;}</code></pre></li>
</ul>
<h3 id="a-namecirclecirclea"><a name="CIRCLE"><code>CIRCLE</code></a></h3>
<p>Find circular sequences among scaffolds</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>circular : true</code></p>
<p> Return circularized scaffolds.</p>
<p> Default (<code>false</code>): Produce linear scaffolds placing the ends within the largest
 contig of each connected component. The node with the longest sequence will be
 divided equally into two and the mate-pair connections also divided, such that
 they form two clean ends for the new scaffold.</p>
<p> Example:</p>
<pre><code>  # Produce circular scaffolds.
  CIRCLE {&quot;circular&quot;:true}</code></pre></li>
<li><p><code>thresh : &lt;float&gt;</code></p>
<p> Minimum bitscore of end-connecting pairing information</p>
<p>Example:</p>
<pre><code>  # Default. Circle completing end connections must have a positive bit score to qualify.
  CIRCLE {&quot;thresh&quot;:0.0}</code></pre></li>
</ul>
<h3 id="a-nameccompsccompsa"><a name="CCOMPS"><code>CCOMPS</code></a></h3>
<p>Calculate the current assembly graph connected components. Note, this is a low-level operation which is typically done automatically by other commands as required.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>sortby : &lt;string&gt;</code></p>
<p> Specify how to sort the resulting list of CCs. Must be: &quot;nodes&quot; or &quot;sequences&quot;.</p>
<p> Examples:</p>
<pre><code>  # Default. Sort CCs in descending order of number of nodes.
  CCOMPS {&quot;sortby&quot;:&quot;nodes&quot;}

  # Sort CCs in descending order of amount of sequence.
  CCOMPS {&quot;sortby&quot;:&quot;sequences&quot;}</code></pre></li>
</ul>
<h3 id="a-nameselccselcca"><a name="SELCC"><code>SELCC</code></a></h3>
<p>Select specific connected components for further processing. &quot;Unselected&quot; connected components are saved in a &quot;removed&quot; data structure and may be recalled with other commands.</p>
<p>NOTE: There are two types of parameters below. &quot;Selection parameters&quot; allow a subset of connected components to be selected by contig names or component number. &quot;Filter parameters&quot; are applied to the set of &quot;selected&quot; connected components (or by default, to the set of all connected components.) These filters may also be used in combination, resulting in a logical &quot;AND&quot; relationship (ie. connected components not satisfying all of the filters are removed).</p>
<p><strong>Selection Parameters:</strong></p>
<ul>
<li><p><code>ccname : &quot;contig_name&quot;</code></p>
<p> Select the connected component containing the named contig</p>
<p> Example:</p>
<pre><code>  # Select the connected component containing the contig named NODE_1234
  SELCC {&quot;ccname&quot;:&quot;NODE_1234&quot;}</code></pre></li>
<li><p><code>ccnames : [&quot;contig_name1&quot;,&quot;contig_name2&quot;,...]</code>
 Select the connected component(s) containing the named contigs</p>
<p> Example:</p>
<pre><code>  # Select the connected components containing the contigs named NODE_1234 and NODE_5678
  SELCC {&quot;ccnames&quot;:[&quot;NODE_1234&quot;,&quot;NODE_5678&quot;]}</code></pre></li>
<li><p><code>ccnum : &lt;int&gt;</code></p>
<p> Select connected component number <code>&lt;int&gt;</code></p>
<p> Example:</p>
<pre><code>  # Default. Select connected component number 0 (the first CC).
  SELCC {&quot;ccnum&quot;:0}</code></pre></li>
<li><p><code>ccnums : [&lt;int&gt;,&lt;int&gt;,...]</code></p>
<p> Select the connected components from the list of numbers</p>
<p> Example:</p>
<pre><code>  # Select the second and third connected components (numbering is zero based)
  SELCC {&quot;ccnums&quot;:[1,2]}</code></pre></li>
<li><p><code>ccrange : [&lt;int1&gt;,&lt;int2&gt;]</code></p>
<p> Select the connected components numbered in the range <code>&lt;int1&gt;..&lt;int2&gt;</code> (inclusive). <code>&lt;int&gt;</code> may be negative, indicating positions at the end of the list of connected components.</p>
<p> Examples:</p>
<pre><code>  # Select the first 6 connected components
  SELCC {&quot;ccrange&quot;:[0,5]}

  # Select the last 5 connected components
  SELCC {&quot;ccrange&quot;:[-5,-1]}</code></pre></li>
<li><p><code>shift : true</code></p>
<p> Select all connected components except the first one. This is like <code>SELCC {&quot;ccrange&quot;:[1,-1]}</code> except it doesn&#39;t generate an error when there is only one remaining connected component, allowing processing to potentially continue in any calling <code>SCRIPT</code> commands.</p>
<p> Example:</p>
<pre><code>  # Drop the first connected component.
  SELCC {&quot;shift&quot;:true}</code></pre></li>
</ul>
<p><strong>Filter Parameters:</strong></p>
<ul>
<li><p><code>min_nodes : &lt;int&gt;</code></p>
<p> Select connected components with <int> or more nodes.</p>
<p> Example:</p>
<pre><code>  # Select connected components with 2 or more nodes.
  SELCC {&quot;min_nodes&quot;:2}</code></pre></li>
<li><p><code>min_seqlen : &lt;int&gt;</code></p>
<p> Select connected components with <int> or more sequence within nodes.</p>
<p> Example:</p>
<pre><code>  # Select connected components containing at least 1000 bases of sequence.
  SELCC {&quot;min_seqlen&quot;:1000}</code></pre></li>
<li><p><code>sequence : &lt;string&gt;</code></p>
<p> Select connected components containing the provided DNA sequence. NOTE! This isn&#39;t BLAST, the sequence must match exactly. Any differences, including ambiguity codes, will prevent matching. The only extra thing that is done is the reverse complement of the provided sequence is also searched.</p>
<p> Example:</p>
<pre><code>  # Select connected components containing the provided sequence
  # (or its reverse complement).
  SELCC {&quot;sequence&quot;:&quot;AGACTAGCAGATATACGATAACGATACGATACGAT&quot;}</code></pre></li>
<li><p><code>sequences : [&lt;string&gt;,...]</code></p>
<p> Like <code>sequence</code> parameter above, but using a list of sequences</p>
<p> Example:</p>
<pre><code>  # Select connected components containing the provided sequences
  # (or their reverse complements).
  SELCC {&quot;sequences&quot;:[&quot;AGACTAGCAGATATACG&quot;,&quot;ATAACGATACGATACGAT&quot;]}</code></pre></li>
<li><p><code>soft : true</code></p>
<p> Determines whether unselected ccomp nodes are deleted or kept as &quot;removed&quot;</p>
<p> Examples:</p>
<pre><code>  # The contigs of first connected component will remain selected, all
  # others are retained as unselected.
  SELCC {&quot;soft&quot;:true}

  # The contigs of first connected component will remain selected,
  # neighboring contigs (one hop away) are unselected and added to the
  # &quot;removed&quot; pool. All others are permanently deleted.
  SELCC {&quot;soft&quot;:false}  # Default</code></pre></li>
</ul>
<h3 id="a-nameselclustselclusta"><a name="SELCLUST"><code>SELCLUST</code></a></h3>
<p>Select clusters of scaffolds for further processing</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>clustnum : &lt;int&gt;</code></p>
<p> Select a specific cluster</p>
<p> Example:</p>
<pre><code>  # select the first cluster of scaffolds
  SELCLUST {&quot;clustnum&quot;:0}</code></pre></li>
<li><p><code>clustnums : [&lt;int1&gt;, &lt;int2&gt;, ...]</code></p>
<p> Select multiple specific clusters</p>
<p> Example:</p>
<pre><code>  # Select these three clusters
  SELCLUST {&quot;clustnums&quot;:[0,3,5]}</code></pre></li>
<li><p><code>clustrange : [&lt;int1&gt;, &lt;int2&gt;]</code></p>
<p> Select a range of clusters</p>
<p> Examples:</p>
<pre><code>  # Select the first six clusters
  SELCLUST {&quot;clustrange&quot;:[0,5]}

  # Select the last five clusters
  SELCLUST &#39;{&quot;clustrange&quot;:[-5,-1]}&#39;

  # select all clusters except the last one
  SELCLUST &#39;{&quot;clustrange&quot;:[0,-2]}&#39;</code></pre></li>
<li><p><code>shift : true</code></p>
<p> Select all clusters except the first one. This is like <code>SELCLUST {&quot;clustrange&quot;:[1,-1]}</code> except it doesn&#39;t generate an error when there is only one remaining cluster, allowing processing to potentially continue in any calling <code>SCRIPT</code> commands.</p>
<p> Example:</p>
<pre><code>  # Drop the first cluster.
  SELCLUST {&quot;shift&quot;:true}</code></pre></li>
<li><p><code>exclusive : true</code></p>
<p> Remove all scaffolds outside of the selected clusters</p>
<p> Example:</p>
<pre><code>  # Default. Keep all unselected scaffolds for subsequence processing
  SELCLUST &#39;{&quot;exclusive&quot;:false}&#39;</code></pre></li>
</ul>
<h3 id="a-nameselndselnda"><a name="SELND"><code>SELND</code></a></h3>
<p>Select contigs from the connected neighborhood(s) of the given contig(s)</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>name : &quot;contig_name&quot;</code></p>
<p> Use a single contig by name.</p>
<p> Example:</p>
<pre><code>  # Select NODE_1234 (and its neighbors)
  SELND {&quot;name&quot;:&quot;NODE_1234&quot;}</code></pre></li>
<li><p><code>names : [&quot;contig_name1&quot;,&quot;contig_name2&quot;,...]</code></p>
<p> Use multiple contigs by name.</p>
<p> Example:</p>
<pre><code>  # Select these two contigs (and their neighbors...)
  SELND {&quot;names&quot;:[&quot;NODE_1234&quot;,&quot;NODE_5678&quot;]}</code></pre></li>
<li><p><code>radius : &lt;int&gt;</code></p>
<p> Size of the neighborhood of contigs to select</p>
<p> Examples:</p>
<pre><code>  # Select all neighbors and neighbors of neighbors
  SELND {&quot;radius&quot;:2} --

  # Default. Select only the named contig(s)
  SELND {&quot;radius&quot;:0}</code></pre></li>
<li><p><code>sequence : &lt;string&gt;</code></p>
<p> Select contig(s) found to contain the provided DNA sequence.</p>
<p> NOTE! This isn&#39;t BLAST, the sequence must match exactly. Any differences, including ambiguity codes, etc. will prevent matching. The only extra thing that is done is the reverse complement of the provided sequence is also searched.</p>
<p> Example:</p>
<pre><code>  # Select contig(s) containing the provided sequence (or its reverse complement).
  SELND {&quot;sequence&quot;:&quot;AGACTAGCAGATATACGATAACGATACGATACGAT&quot;}</code></pre></li>
<li><p><code>sequences : [&lt;string&gt;,...]</code></p>
<p> Like <code>sequence</code> parameter above, but using a list of sequences</p>
<p> Example:</p>
<pre><code>  # Select contigs containing the provided sequences (or their reverse complements).
  SELCC {&quot;sequences&quot;:[&quot;AGACTAGCAGATATACG&quot;,&quot;ATAACGATACGATACGAT&quot;]}</code></pre></li>
</ul>
<h3 id="a-namescaflnkscaflnka"><a name="SCAFLNK"><code>SCAFLNK</code></a></h3>
<p>Find connections linking scaffold ends</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>scaff_names : [&quot;Scaffold_name1&quot;,&quot;Scaffold_name2&quot;,...]</code></p>
<p> Select a subset of scaffolds to attempt to link to other scaffolds. Default: use all scaffolds.</p>
<p> Example:</p>
<pre><code>  # Select only the scaffolds named Scaffold_1234 and Scaffold_5678 for processing.
  # Only scaffold links involving one of these scaffolds will be considered.
  SCAFLNK {&quot;scaff_names&quot;:[&quot;Scaffold_1234&quot;,&quot;Scaffold_5678&quot;]}</code></pre></li>
<li><p><code>exclusive : true</code></p>
<p> Only look for connections within the <code>scaff_names</code> list. Default: look for connections between those in <code>scaff_names</code> and all other scaffolds.</p>
<p> Example:</p>
<pre><code>  # Only look for connections between Scaffold_1234 and Scaffold_5678.
  # All other scaffolds are ignored.
  SCAFLNK {&quot;exclusive&quot;:true,&quot;scaff_names&quot;:[&quot;Scaffold_1234&quot;,&quot;Scaffold_5678&quot;]}</code></pre></li>
<li><p><code>thresh : &lt;float&gt;</code></p>
<p> Bitscore threshold</p>
<p> Example:</p>
<pre><code>  # Default. Only reconnect scaffold ends with connections scoring
  # at least 1000.0 bits
  SCAFLNK {&quot;thresh&quot;:1000.0}</code></pre></li>
</ul>
<h3 id="a-namerelinkrelinka"><a name="RELINK"><code>RELINK</code></a></h3>
<p>Reconnect previously removed connections between contigs. <code>RELINK</code> can move contigs from out of the &quot;unselected&quot; pool.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>name : &quot;contig_name&quot;</code></p>
<p> Use a single contig by name.</p>
<p> Example:</p>
<pre><code>  # Restore removed edges to this contig (reconnecting with the contigs on the
  # other side of the edges)
  RELINK {&quot;name&quot;:&quot;NODE_1234&quot;}</code></pre></li>
<li><p><code>names : [&quot;contig_name1&quot;,&quot;contig_name2&quot;,...]</code></p>
<p> Use multiple contigs by name.</p>
<p> Example:</p>
<pre><code>  # Restore edges to these two contigs (though not necessarily between them...)
  RELINK {&quot;names&quot;:[&quot;NODE_1234&quot;,&quot;NODE_5678&quot;]}</code></pre></li>
<li><p><code>ccname : &quot;contig_name&quot;</code></p>
<p> Use all contigs in the connected component containing the named contig.</p>
<p> Example:</p>
<pre><code>  # Restore edges to all contigs in the connected component containing this contig
  RELINK {&quot;ccname&quot;:&quot;NODE_1234&quot;}</code></pre></li>
<li><p><code>ccnames : [&quot;contig_name1&quot;,&quot;contig_name2&quot;,...]</code></p>
<p> Restore edges to all contigs in the connected components containing these contigs.</p>
<p> Example:</p>
<pre><code>  # Restore edges to all contigs within the connected compontent(s) containing
  # these two contigs
  RELINK {&quot;ccnames&quot;:[&quot;NODE_1234&quot;,&quot;NODE_5678&quot;]}</code></pre></li>
<li><p><code>radius : &lt;int&gt;</code></p>
<p> Expand the sphere of restored connections to neighbors.</p>
<p> Examples:</p>
<pre><code>  # Default value. Restore edges only to neighboring contigs
  RELINK {&quot;radius&quot;:0}

  # Recursively restore edges to all contigs within two hops of any selected contigs
  # (including along newly restored edge paths)
  RELINK {&quot;radius&quot;:2}</code></pre></li>
<li><p><code>complete : true</code></p>
<p> Restore connections among all types of contigs.</p>
<p> Examples:</p>
<pre><code>  # Restore connections among contigs which are both removed from, and part of,
  # currently selected connected components. This is like running with both modes of
  # the &#39;existing&#39; parameter (below) simultaneously.
  #
  RELINK {&quot;complete&quot;:true}

  # Default case.
  # The &#39;existing&#39; parameter (below) determines which type of connections are restored.
  RELINK {&quot;complete&quot;:false}</code></pre></li>
<li><p><code>existing : true</code></p>
<p> Only restore connections between nodes currently part of selected connected components.  This parameter has no effect when the &#39;complete&#39; parmeter (above) is true.</p>
<p> Examples:</p>
<pre><code>  # Only restore connections between currently selected contigs
  RELINK {&quot;existing&quot;:true}

  # Default case.
  # Only restore connections between currently selected and currently unselected
  # contigs (this causes the unselected contigs to become part of the selected set
  # again). That is, no new connections within the contigs of currently selected
  # connected components are restored.
  RELINK {&quot;existing&quot;:false}</code></pre></li>
<li><p><code>problems : true</code></p>
<p> Restore all connections to contigs that are marked with potential assembly problems. Uses <code>radius:1</code> and <code>existing:true</code></p>
<p> Example:</p>
<pre><code>  # Reconnect all problem contigs to all of their potential neighbors.
  RELINK {&quot;problems&quot;:true}</code></pre></li>
</ul>
<h3 id="a-nameedgfltedgflta"><a name="EDGFLT"><code>EDGFLT</code></a></h3>
<p>Remove all edges scoring less than thresh bits</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>thresh : &lt;float&gt;</code></p>
<p> Bitscore threshold which must be met to keep an edge.</p>
<p> Example:</p>
<pre><code>  # Default. Remove all edges scoring less than 500.0 bits.
  EDGFLT {&quot;thresh&quot;:500.0}</code></pre></li>
</ul>
<h3 id="a-nameprobsprobsa"><a name="PROBS"><code>PROBS</code></a></h3>
<p>Generate a report of contigs with likely internal assembly issues from the <code>contig_problems</code> section of the JSON graph file. This information is optionally generated by using the <code>ref_select</code> <code>--detect_dups</code> parameter.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.txt[.gz]&quot;</code></p>
<p> Name of txt format file to write statistics to</p>
<p> Examples:</p>
<pre><code>  # Write report to the file problems.txt
  PROBS {&quot;file&quot;:&quot;problems.txt&quot;}

  # Default. Write report to STDOUT
  PROBS {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
<li><p><code>detail : true</code></p>
<p> Provide extra per contig detail</p>
<p> Example:</p>
<pre><code>  PROBS {&quot;detail&quot;:true}</code></pre></li>
</ul>
<h3 id="a-nameeditedita"><a name="EDIT"><code>EDIT</code></a></h3>
<p>Make manual explicit edits to the selected graph structure. Using this function is preferable to directly editing the JSON structure because it provides a direct record of what was done, and makes it easily repeatable in case changes to upstream analysis are made. If more than one node-independent sets of edges are to be added or removed (that is, those not sharing any contig(s) in common), then multiple calls to <code>EDIT</code> are required to accomplish this task.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>rem_nodes : [&quot;contig1&quot;,&quot;contig2&quot;,...]</code></p>
<p> Remove the specified contig nodes by name</p>
<p> Example:</p>
<pre><code>  # Remove these two contigs from the graph
  EDIT {&quot;rem_nodes&quot;:[&quot;NODE_1234&quot;,&quot;NODE_5678&quot;]}</code></pre></li>
<li><p><code>add_nodes : [&quot;contig1&quot;,&quot;contig2&quot;,...]</code></p>
<p> Add back the specified contig nodes by name</p>
<p> Example:</p>
<pre><code>  # Move these two contigs from the &quot;unselected&quot; pool back to the selected graph
  EDIT {&quot;add_nodes&quot;:[&quot;NODE_1234&quot;,&quot;NODE_5678&quot;]}</code></pre></li>
<li><p><code>rem_edges : [&quot;contig1&quot;,[&quot;contig2&quot;,...]]</code></p>
<p> Remove the specified connections between contig1 and any number of other contigs.</p>
<p> Example:</p>
<pre><code>  # Remove two connection edges from the graph, both connected to NODE_1234
  EDIT {&quot;rem_edges&quot;:[&quot;NODE_1234&quot;,[&quot;NODE_5678&quot;,&quot;NODE_9&quot;]]}</code></pre></li>
<li><p><code>add_edges : [&quot;contig1&quot;,[&quot;contig2&quot;,...]]</code></p>
<p> Add back the specified connections between contig1 and any number of others.</p>
<p> Example:</p>
<pre><code>  # Add two connection edges from the graph, both connected to NODE_1234
  EDIT {&quot;add_edges&quot;:[&quot;NODE_1234&quot;,[&quot;NODE_5678&quot;,&quot;NODE_9&quot;]]}</code></pre></li>
</ul>
<p>NOTE: for <code>rem_edges</code> and <code>add_edges</code> above, if only a single edge is involved, then the parameter syntax may optionally be flattened.</p>
<p>For example:</p>
<pre><code>EDIT {&quot;add_edges&quot;:[&quot;NODE_1234&quot;,[&quot;NODE_9&quot;]]}  # is equivalent to
EDIT {&quot;add_edges&quot;:[&quot;NODE_1234&quot;,&quot;NODE_9&quot;]}</code></pre><h3 id="a-namecutndcutnda"><a name="CUTND"><code>CUTND</code></a></h3>
<p>Create a new node from an existing node using the given coordinates. Using this function is preferable to directly editing the JSON structure because it provides a direct record of what was done, and makes it easily repeatable in case changes to upstream analysis are made.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>name : &quot;contig_name&quot;</code></p>
<p> Name of the contig node to be copied</p>
<p> Example:</p>
<pre><code>  # Use sequence from NODE_1234
  CUTND {&quot;name&quot;:&quot;NODE_1234&quot;}</code></pre></li>
<li><p><code>new_name : &quot;new_contig_name&quot;</code></p>
<p> Name for the newly created contig node</p>
<p> Example:</p>
<pre><code>  # New node will be named NODE_1234a
  CUTND {&quot;new_name&quot;:&quot;NODE_1234a&quot;}</code></pre></li>
<li><p><code>include : [&quot;contig_name1&quot;,&quot;contig_name2&quot;,...]</code></p>
<p> Move mate-pair edges from the listed contig(s) to the newly created &quot;cut&quot; version of the named contig.</p>
<p> Example:</p>
<pre><code>  # Edges between NODE_1234 and the &quot;included&quot; contigs will be moved to NEW_1234.
  CUTND {&quot;name&quot;:&quot;NODE_1234&quot;,&quot;new_name&quot;:&quot;NEW_1234&quot;,&quot;include&quot;:[&quot;NODE_567&quot;,&quot;NODE_234&quot;]}</code></pre></li>
<li><p><code>begin : &lt;int&gt;</code> and <code>end : &lt;int&gt;</code></p>
<p> <code>begin</code> is the beginning sequence coordinate for the new node (from within the original).
 <code>end</code> is the ending sequence coordinate for the new node (from within the original).</p>
<p> NOTE: To facilitate trimming sequences to a fixed maximum length, it is allowable for negative <code>begin</code> values and positive <code>end</code> values to be longer than a sequence. In such cases, the values are set to the beginning and end of the sequence, respectively. Positive <code>begin</code> and negative <code>end</code> positions must fall within the sequence, otherwise the <code>begin</code> position will be after the <code>end</code> position.</p>
<p> Deprecated with SEAStAR version v0.4.13: <code>start</code> may be used as a synonym for <code>begin</code>.</p>
<p> Examples:</p>
<pre><code>  # The new node will include sequence from positions 123 to 456
  CUTND {&quot;begin&quot;:123,&quot;end&quot;:456}

  # The new node will include sequence from position 0 (implied) to 456.
  # end may also be omitted, implying the last position
  CUTND {&quot;end&quot;:456}

  # The new node will include at most the last 1000 bases of sequence.
  # If the sequence is shorter than 1000 bases, it will be copied without
  # modification.
  CUTND {&quot;begin&quot;:-1000}</code></pre></li>
</ul>
<h3 id="a-namegcgca"><a name="GC"><code>GC</code></a></h3>
<p>Generate a report of statistics about the current assembly graph</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.txt[.gz]&quot;</code></p>
<p> Examples:</p>
<pre><code>  # Write stats to the file my_graph.txt
  GC {&quot;file&quot;:&quot;my_graph.txt&quot;}

  # Default. Write stats to STDOUT
  GC {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
<li><p><code>ccdetail : true</code></p>
<p> Write connected component and cluster details.  Note the <code>GCC</code> command is shorthand for this.</p>
<p> Example:</p>
<pre><code>  GC {&quot;ccdetail&quot;:true}   # Equivalent to GCC</code></pre></li>
</ul>
<h3 id="a-namegccgcca"><a name="GCC"><code>GCC</code></a></h3>
<p>Generate a report of detailed statistics about the current assembly graph, with details about each connected component and scaffold cluster (if present)</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.txt[.gz]&quot;</code></p>
<p> Filename for writing statistics report</p>
<p> Examples:</p>
<pre><code>  #  Write stats to the file my_graph_ccs.txt
  GCC {&quot;file&quot;:&quot;my_graph_ccs.txt&quot;}

  # Default. Write stats to STDOUT
  GCC {&quot;file&quot;:&quot;-&quot;}</code></pre></li>
</ul>
<h3 id="a-namestashstasha"><a name="STASH"><code>STASH</code></a></h3>
<p>Copy (push) the current graph to the top of a stack in memory. This allows you to quickly save the current graph state (without erasing any previous &quot;stashes&quot;) so that operations can later be undone with the <code>UNSTASH</code> command. WARNING! Each time the <code>STASH</code> command is run (without a corresponding <code>UNSTASH</code>) a <strong>new</strong> copy of the current sequence graph data is made and stored in memory. This is convenient, but can add up quickly and rapidly lead to out-of-memory errors and other such problems if not managed carefully.</p>
<p><strong>Parameters:  None.</strong></p>
<h3 id="a-nameunstashunstasha"><a name="UNSTASH"><code>UNSTASH</code></a></h3>
<p>Restore (pop) the current graph from the top of the stack (undo changes since most recent <code>STASH</code>). This command removes (and frees) this state from the stack while restoring it.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>free : true</code></p>
<p> Remove the stashed data from the top of the stack without restoring it (do not replace current state)</p>
<p> Examples:</p>
<pre><code>  STASH
  # do something destructive here
  UNSTASH
  # undo destructive action...

  STASH
  # do something destructive here
  UNSTASH {&quot;free&quot;:true}
  # keep effects of destructive action, but free up the STASHed state</code></pre></li>
</ul>
<h3 id="a-namescriptscripta"><a name="SCRIPT"><code>SCRIPT</code></a></h3>
<p>Use contents of a file as a series of <code>graph_ops</code> commands to run. Commands and their parameters are specified one per line. Lines begging If no file is provided, enter interactive mode, reading commands one at a time from a command prompt. Note that while any extension may be used with the <code>file</code> option below, the extension <code>.go</code> is required if you wish for the script file to be recognized directly by <code>graph_ops</code> at the command line:</p>
<pre><code>graph_ops write_fasta.go   # Execute commands in write_fasta.go

graph_ops SCRIPT &#39;{&quot;file&quot;:&quot;anything.txt&quot;}&#39;  # This doesn&#39;t require &quot;.go&quot;

graph_ops SCRIPT   # Enter an interactive prompt environment

graph_ops   # Same as above</code></pre><p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : &quot;filename.go[.gz]&quot;</code></p>
<p> Read commands from the named file. A <code>.go</code> file is a plain text format file with one valid <code>graph_ops</code> command and its parameters per line. Blanks lines and other whitespace are not significant. Comments may be added by prefacing them with <code>#</code>. Note that unlike on the UNIX command line, it is unnecessary to single quote the command parameters in a <code>.go</code> file.</p>
<pre><code>  # This is the write_fasta.go file
  LOAD { &quot;file&quot; : &quot;assembly.json&quot; }

  # scripts can use SCRIPT (with a file)
  SCRIPT { &quot;file&quot; : &quot;do_something.go&quot; }

  FASTA { &quot;file&quot; : &quot;sequence.fna&quot; }</code></pre></li>
<li><p><code>self : true</code></p>
<p> When used with a SCRIPT command inside a SCRIPT file, causes that script to call itself.</p>
<pre><code>  # When called inside the script file &quot;file.go&quot;
  SCRIPT { &quot;self&quot; : true }
  # Is exactly equivalent to:
  SCRIPT { &quot;file&quot; : &quot;file.go&quot; }
  # The only difference is that if file.go is renamed, the &quot;self&quot; version will
  # continue to work as expected</code></pre></li>
<li><p><code>tag : &lt;string&gt;</code></p>
<p> Substitute the provided string in place of the &#39;@&#39; symbol in any output file names used by commands (e.g. <code>DUMP</code>) within the provided <code>SCRIPT</code> file.  NOTE: this tag renaming is not used within interactive SCRIPT command sessions.</p>
<pre><code>  # Add the string &quot;Run1&quot; in place of the &#39;@&#39; character in the filenames of any
  # output files written by commands in the script my_script.go (e.g. &quot;@_output.json&quot;
  # will become &quot;Run1_output.json&quot; in a DUMP command). Works with any command that
  # can write its output to a file.

  SCRIPT {&quot;file&quot; : &quot;my_script.go&quot;, &quot;tag&quot; : &quot;Run1&quot;}</code></pre><p> The tag string will also substitute for runs of consecutive &#39;@&#39; characters, and for multiple distinct positions anywhere in an output file path. Note that as with all <code>graph_ops</code> output filenames, any resulting directory path must already exist. <code>graph_ops</code> will not create new directories in the course of writing output files.</p>
<pre><code>  # String \&quot;Sample_A\&quot; will replace all runs of &#39;@&#39; characters anywhere in the
  # file path of files written by commands in the script my_script.go, for example:
  # ~/data/samples/@@@/@@@_output.json
  # will become:
  # ~/data/samples/Sample_A/Sample_A_output.json

  # NOTE! In this case, the ~/data/samples/Sample_A/ directory must already exist.

  SCRIPT {&quot;file&quot; : &quot;my_script.go&quot;, &quot;tag&quot; : &quot;Sample_A&quot;}</code></pre><p> Note that the <code>SCRIPT</code> command and the associated <code>.go</code> scripts are not intended to replace the UNIX shell. Notably, the scripts are entirely declarative, there are no general variables, or ability to branch (if/then/else) or explicitly loop. However, using a combination of <code>graph_ops</code> commands, it is still possible to do some unexpectedly powerful things very efficiently:</p>
<pre><code>  # This script is called: iterate.go
  STASH
  SELCC
  FASTA {&quot;file&quot;:&quot;@_ccseq_####.fna&quot;}
  UNSTASH
  SELCC {&quot;shift&quot;:true}
  SCRIPT {&quot;file&quot;:&quot;iterate.go&quot;}</code></pre><p> If the above script is invoked using this command:</p>
<pre><code>  graph_ops assembly.json CCOMPS SCRIPT &#39;{&quot;file&quot;:&quot;iterate.go&quot;, &quot;tag&quot;:&quot;Run1&quot;}&#39;</code></pre><p> <code>graph_ops</code> will read in <code>assembly.json</code>, calculate the connected components of the sequence graph once, and then write the sequence from each individual connected component to a separate FASTA file, with output filenames automatically incrementing from <code>Run1_ccseq_0000.fna</code> to <code>Run1_ccseq_0234.fna</code> (assuming there are 235 connected components in the assembly).</p>
</li>
</ul>
<h3 id="a-namehelphelpa"><a name="HELP"><code>HELP</code></a></h3>
<p>List all valid commands, or provide detailed help for a specific command with <code>HELP &lt;COMMAND&gt;</code>. The <code>HELP</code> command when used as the first and only command on the command-line (or within an interactive <code>SCRIPT</code> session) may include the topic command directly as the second command line parameter rather than using the <code>topic : &quot;command&quot;</code> parameter described below. For example:</p>
<pre><code>graph_ops HELP GCC

# Is shorthand for:

graph_ops HELP &#39;{&quot;topic&quot; : &quot;GCC&quot;}&#39;</code></pre><p><strong>Parameters:</strong></p>
<ul>
<li><p><code>topic : &quot;command&quot;</code></p>
<p> Provide detailed help for a specific command. Using the special word <code>ALL</code> (which is not a valid command) in place of the command string will cause detailed help to be printed for <em>all</em> commands.</p>
<p> Examples:</p>
<pre><code>  # Provide help about the HELP command itself
  HELP {&quot;topic&quot;:&quot;HELP&quot;}

  # Provide a dump of detailed help about all commands
  HELP {&quot;topic&quot;:&quot;ALL&quot;}</code></pre></li>
<li><p><code>cmd_line : true</code></p>
<p> Provide detailed help for UNIX shell command line use. This help can also be obtained from the shell by running with the <code>-h</code> or <code>--help</code> options.</p>
<p> Example:</p>
<pre><code>  # UNIX command line usage help.
  HELP {&quot;cmd_line&quot;:&quot;true&quot;}

  # Prints the same information as (at the command line):
  graph_ops -h
  # OR
  graph_ops --help</code></pre></li>
</ul>
<hr>
<h2 id="a-nameseq_scaffoldseq_scaffolda"><a name="seq_scaffold"><code>seq_scaffold</code></a></h2>
<p>This program reads a nucleotide FASTA file with already properly oriented and ordered contig sequences (in scaffold order, as produced by the <code>graph_ops FASTA</code> command run with the parameter setting <code>&quot;no_merge_scaffs&quot;:true</code>), and produces a FASTA format output with the contigs joined into scaffolds. <strong>NOTE:</strong> this command can be run automatically (with default settings) from within the <code>graph_ops FASTA</code> command, using the parameter setting <code>&quot;scaff&quot;:true</code>).</p>
<p>In accomplishing this task, this tool attempts to do three main things:</p>
<ol>
<li><p>Look for short sequence overlaps at the ends of neighboring contigs, while possibly disregarding a few potentially erroneous bases at the ends of either contig, in order to close the gap between the contigs. For example:</p>
<pre><code> AGATACATCTACGCGATCGACGATCGATCCCATCGAcgt
                        aCGATCCCATCGATCGAAGTCGCGCGCATGCAGACTACGC</code></pre></li>
<li><p>Optionally, use alternatively assembled contigs (e.g. generated using a different <em>de novo</em> assembler, or with different settings) in an attempt to &quot;heal&quot; the remaining gaps between contigs that persist after step 1. For example, the middle sequence below represents part of a contig from an alternate assembly that bridges a gap between neighboring contigs:</p>
<pre><code> ATCTACGCGATCGACGATCGATCCCATCGA
          ...GACGATCGATCCCATCGAAGACATACCGAAGTCGCGCGCATGC...   &lt;-- Alternative assembly contig
                                       CGAAGTCGCGCGCATGCAGACTACGC</code></pre></li>
<li><p>Where a gap cannot be bridged, fill it with a sequence of <code>n</code> (unknown bases), with the sequence length possibly adjusted (+1, or +2) to preserve the longest open reading frame that could potentially span the gap (i.e. avoid adding an artifactual frame-shift). Resulting gene models that span gaps should be thoroughly checked for occasional improper gene fusions.</p>
<pre><code> AGATACATCTACGCGATCGAC
                      nnnnnnnnnnnnnnn  &lt;- Possibly pad with 1 or 2 more n&#39;s
                                     TCGAAGTCGCGCGCATGCAGACTACGC</code></pre></li>
</ol>
<h4 id="usage--seq_scaffold-options-infilefna--outfilefna">Usage:  <code>seq_scaffold [options] &lt;infile.fna&gt; &gt; &lt;outfile.fna&gt;</code></h4>
<p>where <code>&lt;infile.fna&gt;</code> is the scaffold ordered contigs and <code>&lt;outfile.fna&gt;</code> is the output scaffolds. All sequence files are in FASTA format, and the input files may be compressed in the <code>gzip</code> format if the file extension ends with <code>.gz</code>. The output is written to <code>STDOUT</code>, and so may be saved to a file via redirection or piped to another command (such as <code>gzip</code>). To further facilitate pipelining, <code>&lt;infile.fna&gt;</code> may instead be supplied from <code>STDIN</code> by substituting &quot;<code>-</code>&quot; for the filename.  Some examples:</p>
<pre><code>seq_scaffold contigs.fna &gt; scaffolds.fna   # Simplest case, run with default settings

seq_scaffold contigs.fna.gz alt_contigs.fna.gz | gzip -c &gt; scaffolds.fna.gz   # gzipped inputs and outputs

bunzip2 -c contigs.fna.bz2 | seq_scaffold - &gt; scaffolds.fna   # Piped input from compressed bzip2 file</code></pre><p>The FASTA format for the scaffold ordered contigs follows a simple convention:</p>
<pre><code>&gt;contig1 scaffold_name1
ATAGACGA....
&gt;contig2 scaffold_name1
GTCAGCTC....
&gt;contig3 scaffold_name2
CGAGCAAC....
&gt;contig4 scaffold_name2
TTAGCGAC....</code></pre><p>The above example shows an input FASTA file with two scaffolds of two contigs each. The output of <code>seq_scaffold</code> for this file will be the two joined scaffolds:</p>
<pre><code>&gt;scaffold_name1
ATAGACGA....GTCAGCTC....
&gt;scaffold_name2
CGAGCAAC....TTAGCGAC....</code></pre><p>In <code>&lt;healfile.fna&gt;</code> the sequences do not need to be in any particular order and the FASTA header sequence identifiers do not need to be unique. Note, that it is possible to simultaneously use more than one alternate assembly for &quot;healing&quot; by simply concatenating those assemblies into a single file.</p>
<p>Note that <code>seq_scaffold</code> may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved FASTA data may be corrupted by messages written to STDERR.</p>
<h4 id="seq_scaffold-optional-parameters-options"><code>seq_scaffold</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help|]</code> <code>[--overlap=&lt;n&gt;]</code> <code>[--trim=&lt;n&gt;]</code> <code>[--max=&lt;n&gt;]</code> <code>[--heal=&lt;healfile.fna&gt;]</code> <code>[--heal_overlap=&lt;n&gt;]</code> <code>[--heal_trim=&lt;n&gt;]</code> <code>[--heal_max=&lt;n&gt;]</code> <code>[--verbose]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.  Also prints the version string for this tool.</p>
<p><b><code>--overlap=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the minimum number of bases of overlap required to join contigs.</p>
<p><b><code>--max=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the maximum number of bases of overlap to look for.</p>
<p><b><code>--trim=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the maximum number of non-ambiguous bases to try trimming from contig ends when looking for overlap.</p>
<p><b><code>--heal=&lt;healfile.fna&gt;</code></b></p>
<p>Use alternatively assembled contigs in the referenced FASTA file to try to &quot;heal&quot; gaps between neighboring contigs.</p>
<p><b><code>--heal_overlap=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the minimum number of bases of overlap required to join contigs.</p>
<p><b><code>--heal_max=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the maximum number of bases of gap to try to heal.</p>
<p><b><code>--heal_trim=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the maximum number of non-ambiguous bases to try trimming from contig ends during healing.</p>
<p><b><code>--gap=&lt;n&gt;</code></b></p>
<p><code>&lt;n&gt;</code> is the bases number of ambiguous &#39;n&#39; bases to insert between contigs in the scaffold when a gap remains. Note that this base value may be adjusted by +1 or +2 to avoid introducing frame-shifts into long open reading frames that span a particular gap.</p>
<p><b><code>--verbose</code></b></p>
<p>Output extra diagnostic information to STDERR.</p>
<hr>
<h2 id="a-nametetracalctetracalca"><a name="tetracalc"><code>tetracalc</code></a></h2>
<p><code>tetracalc</code> is a high-performance tool for taxonomically clustering sequences using tetranucleotide (and trinucleotide) usage statistics. It works by calculating pairwise correlation coefficients of the &quot;Z-statistics&quot; (<a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1462-2920.2004.00624.x/abstract">Teeling <em>et al.</em>, 2004</a>) for all input sequences. Sequences are then binned using the pairwise nearest neighbor (<a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=35395">Equitz, 1989</a>) greedy clustering approach. As sequences are clustered together, the Z-statistics are intelligently recalculated for the merged bins (<a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=319811">Iverson &amp; Riskin, 1993</a>) so that as sequences are clustered they build statistical power. The pairwise clustering of sequence bins proceeds until no remaining pairwise combinations meet a cutoff threshold criteria. While it is possible to use fixed correlation thresholds, <code>tetra_calc</code> contains a large table of optimized sequence-length dependent thresholds, which have been carefully tuned against a large number of finished prokaryotic genomes taken from the RefSeq database, and this is the default (and recommended) way to use the tool in most cases. More information about SEAStAR scaffold clustering may also be found in the methods section of (<a href="http://www.sciencemag.org/content/335/6068/587.abstract">Iverson, <em>et al.</em>, 2012</a>).</p>
<p>In most cases you will run the <code>tetracalc</code> tool indirectly from within <code>graph_ops</code> by using the <code>CLUST</code> command. However, it can be run independently.</p>
<h4 id="usage--tetracalc-options-infile1fna-infile2fna--outfilejson">Usage:  <code>tetracalc [options] &lt;infile1.fna&gt; [&lt;infile2.fna&gt;...] &gt; &lt;outfile.json&gt;</code></h4>
<pre><code>tetracalc input_seqs.fna &gt; output.json  # Cluster sequences in input_seq.fna using default settings</code></pre><p><code>tetracalc</code> produces a simple custom JSON output file format:</p>
<pre><code>{ &quot;clusters&quot; :                                    # An object with a single attribute
       [                                          # Which is a list
             [&quot;NODE_123&quot;, &quot;NODE_456&quot;, ... ],      # Of lists, of contig node ids
             [&quot;NODE_987&quot;, &quot;NODE_654&quot;, ... ],      # that clustered together
             ...
       ]
}</code></pre><p>Note that <code>tetracalc</code> may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved JSON data may be corrupted by messages written to STDERR.</p>
<h4 id="tetracalc-optional-parameters-options"><code>tetracalc</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[--merge_tar=&lt;n&gt;]</code> <code>[-m &lt;n&gt;|--min_len=&lt;n&gt;]</code> <code>[--num_threads=&lt;n&gt;]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>--merge_tar=&lt;n&gt;</code></b></p>
<p>Target fraction of each genome merging to be merged into a single output bin. The valid of values for <code>&lt;n&gt;</code> is in the range 0.0-1.0 and the default is <code>&lt;n&gt;</code>= 0.95, which is a fairly conservative value, unlikely to lead to much sequence ending up in the wrong main genome bin. The value of <code>--merge_tar</code> represents the fraction of all sequence from a single genome that is likely to end up in the main (ie. largest by sequence length) bin corresponding to that genome in the output. Put more concretely, the default value means that ~95% (on average) of the sequence belonging to each distinct species-level genome in the input will cluster together in a single output bin, on average. Most of the remainder (%5 of the sequence in this case) will cluster into one or more smaller genome-specific bins; which can potentially be reconnected later using other information (e.g. pairing information). Lower values of <code>&lt;n&gt;</code> are more conservative because they are associated with higher correlation cutoff thresholds in the tuning tables. Conversely, values of <code>&lt;n&gt;</code> that are very close to 1.0 increase the likelihood of obtaining nearly complete genomes in each individual cluster bin, but at the expense of an increased chance that some sequences will be misclassified in the main genome bins in the output (as opposed to remaining unclustered, or in much smaller &quot;birds-of-feather&quot; bins of sequences from different genomes that are more similar to one another than they are to any single genome as a whole).</p>
<p><b><code>-m &lt;n&gt;</code> / <code>--min_len=&lt;n&gt;</code></b></p>
<p>Minimum sequence length for consideration. Default <code>&lt;n&gt;</code>= 5000 nucleotides. Shorter <code>--min_lengths</code> are possible, but very short sequences have poor statistical power for correlation using tetranucleotide statistics (the fall-back use of trinucleotides helps somewhat, but is less specific). There are 136 possible reverse-compliment unique tetranucleotide sequences (not 128, because 16 tetranucleotide sequences are palindromic); so for a pair of sequences to correlate well in their tetranucleotide usage, they must be long enough that the occurrence of relatively rare tetranucleotides can be accurately estimated; and that noise in the statistics due to undersampling is minimized.</p>
<h4 id="additional-experimental--tuning-options">Additional experimental / tuning [options]:</h4>
<p><code>[--fixed] [-t &lt;n&gt;|--cor_thresh=&lt;n&gt;]</code> <code>[-s &lt;n&gt;|--sec_thresh=&lt;n&gt;]</code> <code>[-r &lt;n&gt;|--tri_thresh=&lt;n&gt;]</code> <code>[-c &lt;n&gt;|--sec_len=&lt;n&gt;]</code> <code>[-f|--seq_files]</code> <code>[-g|--greedy]</code> <code>[--chunk=&lt;n&gt;]</code> <code>[-z &lt;filename&gt;|--z_stats=&lt;filename&gt;]</code> <code>[--verbose]</code></p>
<p><b><code>--fixed</code></b></p>
<p>Disable use of the <code>--merge_tar</code> tuned threshold tables, and instead use default fixed thresholds (perhaps modified by the parameters below).</p>
<p><b><code>-t &lt;n&gt;</code> / <code>--cor_thresh=&lt;n&gt;</code></b></p>
<p>Minimum tetra-nuc correlation (R value) for a link to be reported. Default <code>&lt;n&gt;</code>= 0.9</p>
<p><b><code>-s &lt;n&gt;</code> / <code>--sec_thresh=&lt;n&gt;</code></b></p>
<p>Secondary minimum tetra-nuc correlation for a sequence to be linked via a tri-nucleotide correlation. Default <code>&lt;n&gt;</code>= <code>--cor_thresh</code></p>
<p><b><code>-r &lt;n&gt;</code> / <code>--tri_thresh=&lt;n&gt;</code></b></p>
<p>Minimum tri-nuc correlation for a sequence to be linked via a secondary correlation. Default <code>&lt;n&gt;</code>= <code>--cor_thresh</code></p>
<p><b><code>-c &lt;n&gt;</code> / <code>--sec_len=&lt;n&gt;</code></b></p>
<p>Maximum sequence length for consideration of tri-nuc secondary correlation. Default <code>&lt;n&gt;</code>= <code>--min_len</code></p>
<p><b><code>-f</code> / <code>--seq_files</code></b></p>
<p>Treat the sequences in each input file as a single scaffold.</p>
<p><b><code>-g</code> / <code>--greedy</code></b></p>
<p>Perform a greedy recalculation when merged seqs were the best correlation for other seqs.</p>
<p><b><code>--num_threads=&lt;n&gt;</code></b></p>
<p>Number of threads to use on multicore systems. Default <code>&lt;n&gt;</code>= NUM_PROCESSORS</p>
<p><b><code>--chunk=&lt;n&gt;</code></b></p>
<p>Break input sequences into chunks of N bases for analysis. Default: Do not chunk.</p>
<p><b><code>-z &lt;filename&gt;</code> / <code>--z_stats=&lt;filename&gt;</code></b></p>
<p>Output the Z-statistics for each input sequence/scaffold as a table to <code>&lt;filename&gt;</code>.</p>
<p>The format of the output table is tab separated values, with a header row:</p>
<pre><code> Sequence AAAA AAAC AAAG AAAT ... TTTA TTTC TTTG TTTT</code></pre><p>Where the column of each tetramer is one plus an 8-bit number composed of 2-bit encoded nucleotide values concatenated in sequence order:</p>
<pre><code> A --&gt; 00b   // &#39;b&#39; denotes binary
 C --&gt; 01b
 G --&gt; 10b
 T --&gt; 11b

 AAAA --&gt; 00000000b + 1 = 1
 AAAC --&gt; 00000001b + 1 = 2
 ...
 GCTA --&gt; 10011100b + 1 = 157
 ...
 TTTG --&gt; 11111110b + 1 = 255
 TTTT --&gt; 11111111b + 1 = 256</code></pre><p>Following the header, there is one row of tab separated values for each input sequence/scaffold, beginning with the sequence identifier and followed by the 256 Z-statistic values corresponding with the tetramers in the header row.</p>
<p><b><code>--verbose</code></b></p>
<p>Output extra diagnostic information to STDERR.</p>
<h2 id="a-namemisc_scriptsmiscellaneous-scriptsa"><a name="misc_scripts">Miscellaneous scripts</a></h2>
<p>There are a number of utility GAWK scripts that can be found in the source <code>scripts</code> directory and get copied to the build <code>bin</code> directory. Their names should give a reasonable summary of purpose and further descriptions can be found in header comments of the source.</p>
<hr>
<h2 id="a-namefile_formatsfile-formatsa"><a name="file_formats">File Formats</a></h2>
<h2 id="a-namefastqfastqa-format-reference"><a name="FASTQ">FASTQ</a> Format Reference</h2>
<p>The SEAStAR tools use certain conventions for organizing read data in <code>.fastq</code> files. This section describes these conventions in sufficient detail to understand the operation of the tools described in this document.</p>
<p><code>.fastq</code> files are named as: <code>&lt;prefix&gt;.[single|single1|single2|read1|read2].fastq</code></p>
<p><code>&lt;prefix&gt;</code> is a sample specific string used to identify a set of related read files.</p>
<p><code>.read1.fastq</code> and <code>.read2.fastq</code> files contain the first and second reads in a pair respectively.  Each read pair should be at the same position in both files.</p>
<p><code>.single.fastq</code> files contain reads without a corresponding mate, either because the mate never existed (fragment run) or the mate was discarded during filtering.</p>
<p><code>.single1.fastq</code> and <code>.single2.fastq</code> files contain reads without a corresponding mate and for which the original orientation of the read is known.  This orientation information is useful for distinguishing sense from anti-sense transcription in strand-specific RNA-Seq data sets.</p>
<p>Each <code>.fastq</code> read ID should be formatted as <code>@&lt;read_prefix&gt;:&lt;read_ID&gt;</code>, where <code>&lt;read_prefix&gt;</code> is an identifier for this set of reads and <code>&lt;read_ID&gt;</code> is a read identifier which in combination with <code>&lt;read_prefix&gt;</code> forms a unique ID for this read.  This combination only needs to be unique within reads of a single type, e.g. within all <code>.read1.fastq</code> reads.</p>
<p><code>.fastq</code> files may be processed though a variety of quality control tools that modify (e.g. trim, error correct) and discard reads. Trimming can result in reads that vary in length, and since variable read length information is not always preserved through some downstream analyses (e.g. some alignment outputs), it is occasionally desirable to (optionally) preserve this information in the read header as well.  The read length can be added to the beginning of read header as below, where <code>49|</code> indicates the actual number of nucleotides:</p>
<pre><code>@49|lambda:1_68_381
CGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGACCGGA
+
9&#39;4,)9,*%4/&#39;#,32)*.1-1(5&amp;3,%,$:*&amp;#&#39;8,+2&#39;40$)05(,&amp;</code></pre><p>Quality scores are encoded as ASCII characters starting at a character value offset of 33, following the standard Sanger format for FASTQ files.</p>
<h3 id="solid">SOLiD</h3>
<p>SOLiD reads and associated quality statistics from the instrument are packaged in FASTA-like format files called <code>.csfasta</code> and <code>.qual</code> files. Fragment (non-paired) libraries produce a single matched set of files named by convention as:  <code>&lt;prefix&gt;_F3.csfasta</code> and <code>&lt;prefix&gt;_F3_QV.qual</code></p>
<p>Paired libraries (either mate-paired or paired-end) additionally produce a second matched set of files named: <code>&lt;prefix&gt;_R3.csfasta</code> and <code>&lt;prefix&gt;_R3_QV.qual</code></p>
<p>For some paired-end libraries, the <code>R3</code> in these (second mate) files may be replaced by <code>F5-BC</code> or <code>F5-P2</code>.</p>
<p>There is no guarantee that every read in an <code>F3</code> file has a corresponding mate in an <code>R3</code> file, or vice-versa. These unmated reads are called &quot;singlets&quot;.</p>
<p>Most open source tools do not support these off-instrument SOLiD FASTA-like files, and so they must be converted to FASTQ format files for compatibility. This is complicated by the fact that SOLiD reads do not represent nucleotides, but are instead differential encoded colorspace reads, where each numbered color represents a coded di-nucleotide transition (e.g. <code>AA</code>=<code>0</code>, <code>AC</code>=<code>1</code>, etc.). For colorspace reads to be translated into nucleotide space, an initial known base, the primer base, is required. The following example shows the colorspace sequence for one read prefixed with primer base <code>T</code>:</p>
<pre><code>&gt;1_68_381_F3
T01200011211220011021332321220121211221111322011220</code></pre><p>One simple approach to using these reads would be to simply convert them to nucleotide space <code>.fastq</code> files and from that point onwards ignore the colorspace source of the reads. Unfortunately, this does not work well because colorspace errors at any position in a read will effectively corrupt all of the resulting converted nucleotide sequence from the erroneous position(s) onwards. In addition, there are distinct potential advantages to processing (aligning and assembling) SOLiD reads in colorspace, delaying conversion to nucleotide space until later in the analysis pipeline when it can be done much more accurately (due to the error correction potential of the di-nucleotide encoding). For these reasons, the SEAStAR tools preserve the colorspace information in the FASTQ representation.</p>
<p>To accomplish this, we follow the convention used by MAQ, BWA and Velvet, mapping colorspace numbers to colorspace letters as <code>[0123.]</code> &rarr; <code>[ACGTN]</code>. Because the initial primer base and its following color value are not from the sampled DNA, and thus will not correspond to the actual sequence ~75% of the time, these two values need to be removed from the colorspace read data that is actually used for alignment or assembly. However, they need to be kept in some form so that the original nucleotide sequence of the read can be reconstructed at a later stage in the pipeline. To handle this, we have developed the following convention:</p>
<p>Original colorspace <code>.csfasta</code> read:</p>
<pre><code>&gt;1_68_381_F3
T012000112112200110213323212201212112211113220..2.0</code></pre><p>Convert to colorspace letters:</p>
<pre><code>&gt;1_68_381_F3
TACGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGANNGNA</code></pre><p>Convert to FASTQ format (with quality information added), relocating the primer base and following color <code>TA</code> to the header line (discarding the initial quality value).</p>
<pre><code>@TA+lambda:1_68_381/2
CGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGACCGGA
+
9&#39;4,)9,*%4/&#39;#,32)*.1-1(5&amp;3,%,$:*&amp;#&#39;8,+2&#39;40$)05(,&amp;</code></pre><p>The read prefix <code>lambda</code> is added as an identifier for this set of reads, by default, from the <code>.csfasta</code> filename prefix.  The read name suffix changes from the SOLiD specific <code>F3</code> to <code>/2</code>‚ designating it as the second mate or the only read in a fragment run. Similarly for mated reads, <code>R3</code> (or <code>F5-BC</code>/<code>F5-P2</code>) suffixes are changed to <code>/1</code>, designating them as the first mates.</p>
<p>If the resulting <code>.fastq</code> files are then be processed by <code>trimfastq</code> with <code>--add_len</code>, read lengths will be added to the read header after the primer base and first color.  Below, the <code>49|</code> indicates the actual number of color encoded nucleotides:</p>
<pre><code>@TA+49|lambda:1_68_381/2
CGAAACCGCCGGAACCAGCTTGTGCGGACGCGCCGGCCCCTGGACCGGA
+
9&#39;4,)9,*%4/&#39;#,32)*.1-1(5&amp;3,%,$:*&amp;#&#39;8,+2&#39;40$)05(,&amp;</code></pre><p>Reads conforming to these FASTQ conventions are used (and assumed) throughout the SEAStAR analysis pipeline, and any additional tools (custom scripts, etc.) that modify FASTQ read data will need to preserve these conventions in their outputs.</p>
<hr>
<h2 id="a-namejsonjsona-sequence-graph-file-format-reference"><a name="JSON">JSON</a> Sequence Graph File Format Reference</h2>
<p>A <a href="http://www.json.org">JSON format</a> &quot;sequence graph&quot; file is a data representation describing sequences as nodes (or vertices) in a mathematical <a href="http://en.wikipedia.org/wiki/Graph_%28mathematics%29">graph</a>. When pairing information is available, there are also edges. Edges are a summary of many connections between nodes formed by read pairs in which the two mates align with different sequences (connecting those sequence nodes in the graph).</p>
<p>The top level structure of a JSON sequence graph is:</p>
<pre><code>{
    ### Attributes created by ref_select

    &quot;SEASTAR_tool&quot; : &quot;&quot;,         #  Path to ref_select executable used
    &quot;SEASTAR_version&quot; : &quot;&quot;,      #  Version of ref_select used
    &quot;runtime_parameters&quot; : {
          &quot;parm&quot; : &quot;val&quot;, ...    #  Object containing the ref_select parameters used
     },
    &quot;nodes&quot; : {
          &lt;node_id&gt; : { }, ...   #  Object containing sequence node objects
     },
    &quot;edges&quot; : [                  #  List of pairing edges connecting sequence nodes
          { }, ...               #  These edges are from mates spanning two sequences
     ],
    &quot;internal_edges&quot; : [         #  List of pairing edges within nodes. These edges
          { }, ...               #  are from mates aligning to the same sequence
     ],
    &quot;shared_seq_edges&quot; : [       #  List of edges formed by multi-node mapping reads
          { }, ...               #  These edges are from reads mapping to multiple
     ],                          #  sequences, indicating duplication
    &quot;rollup_stats&quot; : {           #  Optional, catalog per parent sequence statistics
          &quot;seq_id&quot; : {},         #  A catalog file may link sequences together, into
             ...                 #  parent seqs, (e.g. scaffolds of a genome) Rollup
     },                          #  calculates summary stats for the parent seq.
    &quot;run_stats&quot; : {
          &quot;stat&quot; : val, ...      #  Run statistics generated by ref_select
     },

     ###  In addition, graph_ops commands add other attributes to the sequence graph
     ###  Some or all of the attributes below may be absent depending on which
     ###  commands have been run

    &quot;processing&quot; : [             #  List of graph_ops commands that have been run
          [ ], ...               #  Each sublist contains details for one command
     ],
    &quot;removed_nodes: : {          #  Same as &quot;nodes&quot; above, but currently unselected
          &quot;node_id&quot; : { }, ...   #  Object containing sequence node objects
     },
    &quot;removed_edges&quot; : [          #  Same as &quot;edges&quot; above, but currently unselected
          { }, ...               #  These edges are from mates spanning two sequences
     ],
    &quot;digraph&quot; : true,            #  Indicates that the graph is now directed
    &quot;connected_comps&quot; : [
          [&quot;node1&quot;, ... ], ...   #  List connected components. Sublists of node ids
     ],
    &quot;scaffolds&quot; : {              #  Object containing scaffolds, keyed by scaffold id
          &quot;scaf0&quot; : { }, ...     #  Each scaffold is a sub object
     },
    &quot;clusters&quot; : [               #  Same as the JSON output of tetracalc
          [ ], ...               #  Each cluster is a list of scaffold ids
     ],
    &quot;merged_edges&quot; : &lt;bool&gt;      #  When present (and true) indicates that the edges
                                 #  and mate-pairing statistics are originally from a
                                 #  separate (merged) JSON dataset. See the graph_ops
                                 #  LOAD command &quot;edges&quot; option.
}</code></pre><p>The <code>nodes</code> (and <code>removed_nodes</code>) object contains one attribute for each sequence node selected by <code>ref_select</code>. These attributes are keyed by <code>&lt;node_id&gt;</code>, and are themselves objects:</p>
<pre><code>&quot;nodes&quot; : {
     &lt;node_id&gt; : {
          &quot;bits&quot; : &lt;float&gt;,      #  Total bitscore of reads mapping to this sequence
          &quot;rd_cnt&quot; : &lt;float&gt;,    #  Number of (perhaps fractional) reads mapping
          &quot;int_cov&quot; : &lt;float&gt;,   #  Mapped reads per nucleotide (rd_cnt / seq_len)
          &quot;rel_ab&quot; : &lt;float&gt;,    #  Relative abundance of this sequence, fraction of 1
          &quot;cov&quot; : &lt;float&gt;,       #  Coverage, mean reads at each position in sequence
          &quot;rd_len&quot; : &lt;float&gt;,    #  Mean read length of mapping reads
          &quot;seq_len&quot; : &lt;int&gt;,     #  Reference sequence length in nucleotides
          &quot;pct_uncov&quot; : &lt;float&gt;, #  Percent of reference sequence uncovered by reads
          &quot;mp_pairs&quot; : &lt;int&gt;,    #  Number of paired reads mapping within sequence
          &quot;mp_sh&quot; : &lt;int&gt;,       #  Number of shared reads, mapping to other sequences
          &quot;mp_fwd&quot; : &lt;int&gt;,      #  Number of reads with mates off 3&#39; end of sequence
          &quot;mp_bwd&quot; : &lt;int&gt;,      #  Number of reads with mates off 5&#39; end of sequence
          &quot;mp_ins_mean&quot; : &lt;float&gt;,  #  Mean insert length of pairs mapping within
          &quot;mp_ins_stdev&quot; : &lt;float&gt;, #  Standard deviation of above insert lengths
          &quot;name&quot; : &quot;&quot;,                  #  Name / unique node_id for this node
          &quot;desc&quot; : &quot;&quot;,                  #  Text description of this sequence
          &quot;adj_cov&quot; : &lt;float&gt;,          #  Adjusted coverage, omits near seq ends
          &quot;adj_cov_stddev&quot; : &lt;float&gt;,   #  Standard deviation of adj_cov within seq
          &quot;adj_cov_min&quot; : &lt;float&gt;,      #  Minimum adjusted coverage
          &quot;adj_cov_max&quot; : &lt;float&gt;       #  Maximum adjusted coverage
          &quot;short_seq&quot; : &lt;bool&gt;   #  True if adj_* stats are omitted due to short seq
          &quot;seq&quot; : &quot;&quot;             #  DNA sequence, optional, either reconstructed or provided
          &quot;pct_gc&quot; : &lt;float&gt;,    #  GC composition as %, may be absent or NA
          &quot;circular&quot; : &lt;bool&gt;,   #  True if this node represents circular DNA sequence
          &quot;per_nt&quot; : { ... }     #  Sequence and per nucleotide stats are in this object
          &quot;contig_problems&quot; : [
               { }, ...          #  List of potential problems found with this node
           ]
          &quot;ref_str&quot; : &lt;bool&gt;,       #  Set by graph_ops. True if this sequence is on
                                    #  the &quot;reference strand&quot;
          &quot;merged_stats&quot; : { ... }  # Present when the &quot;merged_edges&quot; attribute is true.
     }, ...
}</code></pre><p>The <code>per_nt</code> attribute of a node object is optional. If present it contains the per nucleotide information calculated for the node, including the DNA sequence itself, when present.</p>
<pre><code>&quot;per_nt&quot; : {                    #  Per nucleotide statistics, all of these are optional.
          &quot;mean_gc&quot; : [         #
               &lt;float&gt;, ...     #  Mean %GC content for nucleotides within a window
           ],                   #
          &quot;cov&quot; : [             #
               &lt;float&gt;, ...     #  Read coverage for each nucleotide position
           ],                   #
          &quot;phys_cov&quot; : [        #
               &lt;int&gt;, ...       #  Physical coverage of pairs spanning each position
           ],                   #
          &quot;mp_ins&quot; : [          #
               &lt;float&gt;, ...     #  Mean insert size for pairs spanning each position
           ],                   #
}</code></pre><p>The <code>contig_problems</code> attribute of a node object is optional, but if present it flags a number of different types of potential problems and pinpoints their location(s) within the sequence. These may also be viewed in <code>graph_ops</code> using the <code>PROBS</code> command.</p>
<pre><code>&quot;contig_problems&quot; : [            #  A node may have multiple problems, so it&#39;s a list
     {                           #  of objects, one per problem region
          &quot;start&quot; : &lt;int&gt;,       #  Staring coordinate of this problem
          &quot;end&quot; : &lt;int&gt;,         #  Ending coordinate of this problem
          &quot;type&quot; :                          #  What kind of problem?
               &quot;Physical coverage break&quot; |  #  Contig should be split here...
               &quot;Insert size anomaly&quot; |      #  Extra or missing sequence detected
               &quot;Collapsed duplication&quot; |    #  There are two/more repeated sequences
               &quot;Uncovered end&quot; |            #  Bad assembly / reconstruction of end
               &quot;Small insert&quot;               #  Sequence doesn&#39;t belong here...
     }, ...
]</code></pre><p>The following attributes are present when the graph &quot;merged_edges&quot; attribute is true.  These attributes preserve the node statistics of the graph that the edges in this graph came from.
    &quot;merged_stats&quot; : {
              &quot;bits&quot; : <float>,      #  Total bitscore of reads mapping to this sequence
              &quot;rd_cnt&quot; : <float>,    #  Number of (perhaps fractional) reads mapping
              &quot;int_cov&quot; : <float>,   #  Mapped reads per nucleotide (rd_cnt / seq_len)
              &quot;rel_ab&quot; : <float>,    #  Relative abundance of this sequence, fraction of 1
              &quot;cov&quot; : <float>,       #  Coverage, mean reads at each position in sequence
              &quot;pct_uncov : <float>,  #  Percent of reference sequence uncovered by reads
              &quot;rd_len&quot; : <float>,    #  Mean read length of mapping reads
              &quot;adj_cov&quot; : <float>,          #  Adjusted coverage, omits near seq ends
              &quot;adj_cov_stddev&quot; : <float>,   #  Standard deviation of adj_cov within seq
              &quot;adj_cov_min&quot; : <float>,      #  Minimum adjusted coverage
              &quot;adj_cov_max&quot; : <float>,      #  Maximum adjusted coverage
              &quot;cov&quot; : [                     #  This may be present when &quot;merged_edges&quot; is true
                   <float>, ...     #  Read coverage for each nucleotide position from
               ]                    #  an alternative mate-pairing alignment
    }</p>
<p>The different <code>edges</code> lists (<code>edges</code>, <code>removed_edges</code>, <code>internal_edges</code>, <code>shared_seq_edges</code>) have similar syntax, but with different semantics:</p>
<pre><code># Note that internal and shared_seq edges will not have dir, org_dir or score attributes

&quot;edges&quot; : [                   #  Each edges list holds multiple edge objects
     {                        #  each of which represents a relationship between:
          &quot;n1&quot; : &quot;&quot;,          #  A first node, referenced by name, and a
          &quot;n2&quot; : &quot;&quot;,          #  second node. This are the same for internal_edges
          &quot;dir&quot; : &quot;&quot;,         #  Direction of this edge: &quot;FB&quot;, &quot;FF&quot;, &quot;BB&quot; or &quot;forward&quot;
          &quot;num&quot; : &lt;int&gt;,      #  Number of mapped read-pairs represented by this link
          &quot;bits&quot; : &lt;float&gt;,   #  Total bitscore of read-pairs participating in edge
          &quot;p1&quot; : &lt;int&gt;,       #  Mean mapping coordinate in the first node
          &quot;p2&quot; : &lt;int&gt;,       #  Mean mapping coordinate in the second node
          &quot;score&quot; : &lt;float&gt;,  #  Heuristic score for edge, calculated by graph_ops
          &quot;org_dir&quot; : &quot;&quot;      #  Preserves original &quot;dir&quot; when edge becomes &quot;forward&quot;
     }, ...                   #  &quot;dir&quot; codes, where --&gt; is:  5&#39; --&gt; 3&#39;
]                             #  &quot;FB&quot; :      --&gt; edge --&gt;
                              #  &quot;FF&quot; :      --&gt; edge &lt;--
                              #  &quot;BB&quot; :      &lt;-- edge --&gt;
                              #  &quot;forward&quot; : directed (all &quot;forward&quot; edges are &quot;FB&quot;
                              #              and on the &quot;reference strand&quot;)</code></pre><p>Rollup statistics are generated by <code>ref_select --rollup</code> parameter. They require use of a catalog file with <code>ref_select</code> that connects sub-sequences to their parents. A common use case is aligning reads to genome sequences, where some of the genomes may be unfinished and remain in numerous scaffolds. The rollup statistics summarize the stats by parent (genome) sequence, &quot;rolled-up&quot; from the underlying node sub-sequences.</p>
<pre><code>&quot;rollup_stats&quot; : {
     &quot;seq_id&quot; : {
        &quot;node_list&quot; : [
             &quot;node_id&quot;, ...     #  List of node ids of the sub-seqs rolled into this
         ]
        &quot;bits&quot; : &lt;float&gt;,       #  All of these statistics are the same as their
        &quot;rd_count&quot; : &lt;float&gt;,   #  named counterparts in the &quot;node&quot; object described
        &quot;int_cov&quot; : &lt;float&gt;,    #  above, except they represent a proper amalgamation
        &quot;rel_ab&quot; : &lt;float&gt;,     #  of the underlying sequence statistics, as though
        &quot;cov&quot; : &lt;float&gt;,        #  all of the &quot;parent&quot; sequences were simply a
        &quot;rd_len&quot; : &lt;float&gt;,     #  concatenation of their child subsequences. So for
        &quot;seq_len&quot; : &lt;flaot&gt;,    #  example, &quot;rel_ab&quot; is relative abundance of this
        &quot;pct_gc&quot; : &lt;float&gt;,     #  parent sequence compared to all other parent
        &quot;pct_uncov&quot; : &lt;float&gt;,  #  sequences.
        &quot;desc&quot; : &quot;&quot;
     }, ....
}</code></pre><p>Run statistics are <code>ref_select</code> overview statistics for the run that created this file</p>
<pre><code>&quot;run_stats&quot; : {
     &quot;num_selected_nodes&quot; : &lt;int&gt;,               #  Number of nodes in output
     &quot;original_nodes&quot; : &lt;int&gt;,                   #  Number of original ref seqs
     &quot;total_node_abundance&quot; : &lt;float&gt;,           #  Should be very close to 1.0!
     &quot;mean_coverage&quot; : &lt;float&gt;,                  #  Optional: mean of all nodes
     &quot;coverage_duplication_threshold&quot; : &lt;float&gt;, #  Thresh used for single genomes
     &quot;reads_assigned_to_selected&quot; : &lt;int&gt;,       #  Reads mapping to output nodes
     &quot;original_reads_aligned&quot; : &lt;int&gt;,           #  Reads mapping in alignment
     &quot;second_chance_assigned_reads&quot; : &lt;int&gt;      #  Reads assigned to suboptimal
}                                                #  (but selected) node</code></pre><p>The <code>processing</code> list is added by <code>graph_ops</code> to create a command history of changes made to this sequence graph.</p>
<pre><code>&quot;processing&quot; : [
     [
          &quot;script&quot;,      #  Filename of the SCRIPT, &#39;$&#39; for cmd line or &#39;&gt;&gt;&#39; for STDIN
          &quot;version&quot;,     #  SEAStAR version number of graph_ops used for this command
          &quot;command&quot;,     #  Name of the graph_ops command executed
          { &quot;parm&quot; : val, ... }   #  JSON parameters passed to the command
     ], ...
]</code></pre><p>The <code>scaffolds</code> list is added by the <code>graph_ops SCAFF</code> command. It details the proper ordering of nodes in each scaffold, and links them back to connected components</p>
<pre><code>&quot;scaffolds&quot; : {
     &quot;scaffold_id&quot; : {           #  One object per scaffold
          &quot;ccnum&quot; : &lt;int&gt;,       #  Reference to associated connected component
          &quot;nodes&quot; : [
               &quot;node_id&quot;, ...    #  This is an *ordered* list, 5&#39; -&gt; 3&#39; of sequence
          ]                      #  nodes, all on the same strand.
     }, ...
}</code></pre><p>The <code>clusters</code> object is identical to the contents of the object output from the <code>tetracalc</code> tool:</p>
<pre><code>&quot;clusters&quot; : [                         # A list...
     [&quot;NODE_123&quot;, &quot;NODE_456&quot;, ... ],   # of lists, of contig node ids
     [&quot;NODE_987&quot;, &quot;NODE_654&quot;, ... ],   # that were clustered together
     ...
]</code></pre>