-
Notifications
You must be signed in to change notification settings - Fork 2
/
SEAStAR_User_Guide.html
1744 lines (1678 loc) · 182 KB
/
SEAStAR_User_Guide.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<link href="style.css" media="screen" rel="stylesheet" type="text/css" />
<h1 id="seastar-user-guide-version-050">SEAStAR User Guide, version 0.5.0</h1>
<h4 id="vaughn-iverson-and-chris-berthiaume">Vaughn Iverson and Chris Berthiaume</h4>
<h5 id="armbrust-lab-school-of-oceanography-university-of-washington"><a href="http://armbrustlab.ocean.washington.edu">Armbrust Lab</a>, School of Oceanography, University of Washington</h5>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#quick_start">Quick Start</a></li>
<li><a href="#command_reference">Command Reference</a></li>
<li><a href="#file_formats">File Formats</a></li>
<li><a href="#misc_scripts">Miscellaneous Scripts</a></li>
</ul>
<h2 id="a-nameintroductionintroductiona"><a name="introduction">Introduction</a></h2>
<p>The SEAStAR tools are a collection of programs which implement many data and processing intensive steps in next-generation sequencing analysis pipelines. The tools quantitatively analyze alignments to reference sequence databases (including de novo assembled contigs), to produce a wide variety of statistics and relationships, including producing binned, assembled genomic scaffolds. At one time, SEAStAR was an acronym for <strong>S</strong>elect and <strong>E</strong>stimate <strong>A</strong>bundance from <strong>S</strong>hor<b>t</b> <strong>A</strong>ligned <strong>R</strong>eads. However, as described above, the tools and their capabilities have expanded well beyond that description, and so now SEAStAR is just our name for the in-house components of our analysis pipeline.</p>
<p><a href="SEAStAR_Pipeline_Stages.png"><img src="SEAStAR_Pipeline_Stages.png" alt="SEAStAR Pipeline" title="SEAStAR Pipeline"></a></p>
<p>Our philosophy has been to create stable, modular, high-performance tools that fill important gaps we encountered between other established pre-existing tools in our pipeline. As such, you will see references to other packages throughout this document. We recognize that considerable innovation is ongoing in analysis software for next generation sequence, and so we have maintained a modular, pipelined approach to permit experimentation with, and substitution of, new tools as they become available. Due to the wide variety of different computational environments that undoubtedly exist in different labs and institutes, our focus will be on releasing stable, documented, portable and high performance <em>components</em> from our analysis pipeline, but not the pipeline scripts themselves. This is because these scripts will need to be highly customized for both the goals and computational environment of any specific project.</p>
<p>Initially our goal was to release tools to enable others to work directly with datasets generated by the Life Technologies SOLiD™ sequencing platform. We recognized that most of the components we have developed are also valuable for use with Illumina® sequence data, or with a mixture of sequence data types, and so in this release we have generalized most components to support both colorspace and nucleotide reads.</p>
<h3 id="technical-guidance">Technical guidance</h3>
<p>The next-generation sequence datasets that the SEAStAR tools are designed to work with are typically very large (>> one billion reads). All compiled components of the SEAStAR tools are multithreaded and optimized for modern multi-core processors. Many SEAStAR tools will take advantage of 8 or more cores, particularly when compressed (gzip format) input and output files are being processed. When running on such a system, you will find that these components may run considerably more than 10x faster than equivalent tools written in a scripting language such as Python. For this reason, the operation of such tools will often be "I/O bound"; that is: the performance bottleneck is the speed of the disk(s) and/or network connections connected to the workstation, and not the availability of CPU cycles. The use of compressed files and local, separate independent disks for input and output files can often relieve some of this bottleneck and further improve run times.</p>
<p>Several of the SEAStAR tools (as well as many other tools used by our pipeline) have very high memory requirements when working with typical next-generation sequence datasets. We note these cases in the discussion of each individual SEAStAR tool, and attempt to provide guidance about the impact different parameters will have on memory use. However, in general, doing this kind of work will require a workstation with at least 32 gigabytes of RAM, and depending on the dataset and the types of analysis your project requires, you may require some multiple of that amount (64, 128, 256, or even 512 gigabytes) to successfully implement an analysis pipeline that meets your needs. When running these tools, it is highly advisable to set a session limit on the memory a process can attempt to allocate, to prevent machines from crashing and impacting other users who may be logged in. For example, adding the Unix shell command:</p>
<pre><code>ulimit -v 64000000</code></pre><p>to any script that invokes high memory tools will effectively prevent a 64GB machine from being unintentionally brought down (i.e. thrashing the swap file) by an application that attempts to allocate more memory than is physically available.</p>
<h3 id="license-and-citation">License and citation</h3>
<p>The SEAStAR tools are open source and are currently publicly distributed under the <a href="http://gplv3.fsf.org/">GPL version 3 license</a>, a copy of which is provided in a file named <code>COPYING</code> in the project root directory. By using this software, you are agreeing to be bound by the terms of this license. Please contact the authors if you are interested in discussing an alternative licensing arrangement.</p>
<p>We are in the process of preparing several publications specific to the SEAStAR analysis pipeline and the custom tools it contains. In the interim, we ask that you cite the following paper (which was the first to use and describe these tools):</p>
<blockquote>
<p>Iverson, V., Morris, R. M., Frazar, C. D., Berthiaume, C. T., Morales, R. L .and Armbrust, E. V., Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota, <a href="http://www.sciencemag.org/content/335/6068/587.abstract"><em>Science</em> <strong>335</strong> pp.587-590</a> (2012)</p>
</blockquote>
<h2 id="a-namequick_startquick-starta"><a name="quick_start">Quick Start</a></h2>
<p>This section is a quick introduction to using SEAStAR tools for some common use cases. The SEAStAR tools are "command-line" tools that run on UNIX-like operating systems. If you are not familiar with how to use the UNIX command-line shell, you should work your way through a tutorial before proceeding. Here are a few possibilities:</p>
<ul>
<li><a href="http://www.ee.surrey.ac.uk/Teaching/Unix/">UNIX Tutorial for Beginners</a></li>
<li><a href="http://linuxcommand.org/learning_the_shell.php">Learning the Linux Shell</a></li>
<li><a href="http://www.matisse.net/osx/intro_unix/0_outline.html">Introduction to the OS X Unix Command Line</a></li>
</ul>
<p>This tutorial assumes that you have already installed the SEAStAR software. If you are working from the <a href="http://armbrustlab.ocean.washington.edu/node/305">SEAStAR Quick Start virtual machine appliance</a> file in VirtualBox, your environment is already set-up and ready to go, so you can skip to the next section: <a href="#read_prep">Read Preparation</a>.</p>
<p>For SEAStAR installation instructions see the <code>README</code> file in the SEAStAR root directory. Although the SEAStAR tools can be quite memory and resource intensive, with the data used in this tutorial you should be able to comforably complete all of the steps below on a modern laptop with a dual-core processor and 2GB of RAM.</p>
<p>The data used in these examples are actual raw SOLiD sequence reads corresponding to the <a href="http://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1">Lambda phage</a> genome, and can be found in the <code>test_data</code> subdirectory of the SEAStAR source directory.</p>
<p>Create a new working directory and copy the following files from the <code>test_data</code> directory into that directory.</p>
<pre><code># Substitute the actual path to your SEAStAR "source directory" for
# [SEASTAR_src] in the copy commands below:
mkdir Quick_start
cd Quick_start
cp [SEASTAR_src]/test_data/lambda_reads_F3.csfasta.gz .
cp [SEASTAR_src]/test_data/lambda_reads_F3_QV.qual.gz .
cp [SEASTAR_src]/test_data/lambda_reads_R3.csfasta.gz .
cp [SEASTAR_src]/test_data/lambda_reads_R3_QV.qual.gz .</code></pre><p>And make sure that your <code>PATH</code> environment variable points to the executable files in your SEAStAR binaries directory:</p>
<pre><code># Where [SEASTAR_dest] below is the fully qualified path to your
# SEAStAR build destination directory.
export PATH=$PATH:[SEASTAR_dest]/bin</code></pre><p>If the <code>PATH</code> above is working, you should be able to run <code>ref_select --version</code> and get back something like: <code>ref_select version: v0.4.x</code>. If this doesn't work, you will need to figure out where SEAStAR is installed before proceeding so that the instructions that follow in this tutorial can find the SEAStAR tools on your computer.</p>
<p><strong>NOTE:</strong> To work your way through all of the examples in this tutorial, you will need the following external tools (in addition to those that were required to build SEAStAR):</p>
<ul>
<li><a href="http://www.ebi.ac.uk/~zerbino/velvet/">Velvet</a> -- You need the <strong>colorspace</strong> version. Build it with: <code>make color</code>, see the Velvet <code>README</code> file for instructions.</li>
<li><a href="http://sourceforge.net/projects/bio-bwa/files/">BWA</a> -- <strong>Version 0.5.10</strong>. <em>Important!! Because newer versions do not support the SOLiD colorspace reads used by this tutorial</em></li>
<li><a href="http://www.graphviz.org/Download.php">GraphViz</a></li>
</ul>
<p>Please ensure that the directories where these tools reside are also in your <code>PATH</code> environment variable before proceeding. You can test this by typing <code>velveth_de</code>, <code>bwa</code>, and <code>dot -V</code> at your command line. Each of these commands should write its version string (and perhaps other help messages) in response to being run.</p>
<h3 id="a-nameread_prepread-preparationa"><a name="read_prep">Read preparation</a></h3>
<p>Convert SOLiD <code>.csfasta</code> and <code>.qual</code> files to gzipped <code>.fastq</code> files. For real projects, users with no colorspace data will omit this step in their analysis pipeline, although the FASTQ file naming conventions still need to be followed for nucleotide (e.g. Illumina) reads; see the <a href="#FASTQ">FASTQ</a> appendix for details.</p>
<pre><code>solid2fastq -z lambda_reads lambda_conv</code></pre><p>Remove presumed PCR duplicate reads, identified by mate-pairs seen more than once.</p>
<pre><code>fastq_nodup -z -l 13 -d 2 -e 3 lambda_conv lambda_dedup</code></pre><p>Trim and filter the de-duplicated FASTQ reads based on read quality, information content, and length.</p>
<pre><code>trimfastq -z --mates_file -p 0.9 -l 34 -m 34 --add_len -e 3.0 lambda_dedup lambda_trim</code></pre><p>Randomly sample reads from <code>.fastq</code> files, retaining approximately 10% of the original reads. Randomly subsampled read sets are useful for quickly tuning the parameters of <code>trimfastq</code> for a given set of reads against some reference. We won't do that here, but tuning the <code>trimfastq -p</code> setting is important for getting clean assemblies and maximizing the useful information in your sequence data.</p>
<pre><code>samplefastq -z -f 0.1 lambda_trim lambda_samp</code></pre><h3 id="de-novo-contig-assembly"><em>De novo</em> contig assembly</h3>
<p>If you have installed the colorspace aware build of the <em>de novo</em> assembly tool <a href="http://www.ebi.ac.uk/~zerbino/velvet/">Velvet</a>, it can be used to assemble lambda-phage colorspace contigs:</p>
<pre><code>velveth_de lambda_asm/ 19 -fastq.gz -shortPaired lambda_trim.mates.fastq.gz -short lambda_trim.single.fastq.gz > lambda_asm.velveth_de.log 2>&1
velvetg_de lambda_asm/ -scaffolding no -read_trkg no -ins_length auto -ins_length_sd auto -exp_cov 50 -cov_cutoff 5 -min_contig_lgth 50 > lambda_asm.velvetg_de.log 2>&1</code></pre><p>There will now be a subdirectory called <code>lambda_asm</code> with a file called <code>contigs.fa</code> in it. These are the colorspace contigs we will use in the next section.</p>
<h3 id="aligning-reads-to-a-reference">Aligning reads to a reference</h3>
<p>For this example, you need the short read aligner <a href="http://sourceforge.net/projects/bio-bwa/files/">BWA</a> (version <= 0.5.10). It can be used to align the de-duplicated and trimmed lambda FASTQ reads against the lambda-phage colorspace contigs. IMPORTANT: This quickstart example will not work with versions of BWA version 0.6.0 or newer because colorspace support was removed from BWA at that point. For your own work, if you are using nucleotide reads, you are free to use any alignment software you wish that produces standard SAM files.</p>
<pre><code># Because these are colorspace contigs and BWA expects nucleotide reference sequences,
# we need to convert the colorspace contigs to nucleotides using a single starting
# nucleotide (this will be wrong 3/4 of the time, but it doesn't matter because
# BWA immediately converts the reference back to colorspace internally...)
csfasta2ntfasta.awk lambda_asm/contigs.fa > lambda_contigs.fna
bwa index -a is -c lambda_contigs.fna
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.read1.fastq.gz > lambda_trim.read1.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.read1.sai lambda_trim.read1.fastq.gz 2>lambda_trim.read1.samse.log > lambda_trim.read1.sam
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.read2.fastq.gz > lambda_trim.read2.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.read2.sai lambda_trim.read2.fastq.gz 2> lambda_trim.read2.samse.log > lambda_trim.read2.sam
bwa aln -c -n 0.001 -l 18 lambda_contigs.fna lambda_trim.single.fastq.gz > lambda_trim.single.sai
bwa samse -n 1000000 lambda_contigs.fna lambda_trim.single.sai lambda_trim.single.fastq.gz 2>lambda_trim.single.samse.log > lambda_trim.single.sam</code></pre><p>The resulting alignment files are in SAM format, which is what the next operation requires.</p>
<h3 id="constructing-an-assembly-graph">Constructing an assembly graph</h3>
<p>Now we will use the <code>ref_select</code> tool to convert the colorspace contig sequences to nucleotides and output the matepair assembly graph JSON file.</p>
<pre><code>ref_select -q -m --mp_mate_cnt=10 -r lambda_trim.read1.fastq.gz -r lambda_trim.read2.fastq.gz -r lambda_trim.single.fastq.gz lambda_trim.read1.sam lambda_trim.read2.sam lambda_trim.single.sam > lambda_asm.json</code></pre><h3 id="producing-scaffolded-sequence">Producing scaffolded sequence</h3>
<pre><code># The following chain of graph_ops commands accomplish the following steps:
# MST - Convert the assembly graph into a directed spanning tree
# PLUCK - Remove short (stub) branches from the tree, leaving a mostly linear structure
# PRUNE - Cleave any remaining branches from the tree, leaving linear backbones
# PUSH - Put back contigs removed from the ends of the scaffolds by PLUCK
# INSERT - Put back small contigs that fit between the backbone contigs
# SCAFF - Re-linearize the scaffolds based on constraints added by INSERT
# FASTA - Join all neighboring contigs together (looking for overlaps) and write FASTA to STDOUT
graph_ops lambda_asm.json MST PLUCK PRUNE PUSH INSERT SCAFF FASTA '{"scaff":true}' > lambda_scaffs.fna</code></pre><h3 id="visualizing-the-sequence-graph-and-using-script-files-with-graph_ops">Visualizing the sequence graph, and using <code>SCRIPT</code> files with <code>graph_ops</code>.</h3>
<p>If you rerun <code>graph_ops</code> using the <code>DOT</code> command at each stage of the pipeline, you can then visualize the assembly graph as it progresses through each stage of processing using <a href="http://www.graphviz.org/">Graphviz</a>.</p>
<p>First, create a text file called <code>lambda_viz.go</code> with the following contents:</p>
<pre><code># Sample script file for SEAStAR graph_ops tool
DOT {"file":"lambda_asm#.dot"}
MST
DOT {"file":"lambda_asm#.dot"}
PLUCK
DOT {"file":"lambda_asm#.dot"}
PRUNE
DOT {"file":"lambda_asm#.dot"}
PUSH
DOT {"file":"lambda_asm#.dot"}
INSERT
DOT {"file":"lambda_asm#.dot"}
SCAFF
DOT {"file":"lambda_asm#.dot"}
FASTA {"scaff":true,"file":"lambda_scaffs.fna"}</code></pre><p>Now run the following command:</p>
<pre><code>graph_ops lambda_asm.json lambda_viz.go</code></pre><p>This will produce the same scaffolded assembly as the previous example, but in addition, it will write a <code>DOT</code> (GraphViz) format file for each step in the scaffolding pipeline (with an incrementing number at the end of the filename: <code>lambda_asm0.dot</code>, <code>lambda_asm1.dot</code>, etc.) This next command will convert all of these DOT files to corresponding PDF figures using the <code>neato</code> layout engine in Graphviz.</p>
<pre><code># After this command, load the output PDF files in a viewer to see the scaffold layout
# process, step by step. Circles are contigs with area proportional to sequence length
# and color by %GC. Black arrows are mate-pair connections with thickness indicating
# bitscore. Red arrows are added dependencies to produce a fully ordered layout for SCAFF.
for i in lambda_asm?.dot; do neato -Tpdf $i > ${i%.*}.pdf; done</code></pre><p>You should now be able to load the PDF files with your favorite PDF viewer in order from <code>lambda_asm0.pdf</code> through <code>lambda_asm6.pdf</code> to see the effect of each <code>graph_ops</code> command in the script above. Each graph represents the state of the assembly before/after each successive step, showing the operation of graph_ops as it operated on this example dataset.</p>
<h3 id="estimating-16s-sequence-abundance">Estimating 16S sequence abundance</h3>
<p>The tools in SEAStAR can be used to create a pipeline for estimating sequence abundance in a short-read data set. This is particularly useful for assessing community composition in a metagenomic sample through alignments to a 16S database. See vignettes/RDP/RDP_vignette.html for a walk-through of an implementation of this pipeline using the <a href="http://rdp.cme.msu.edu/">RDP</a> 16S database.</p>
<hr>
<h1 id="a-namecommand_referencecommand-referencea"><a name="command_reference">Command Reference</a></h1>
<p>Tools described in this user guide:</p>
<ul>
<li><strong><a href="#solid2fastq"><code>solid2fastq</code></a></strong> -- Conversion of <em>SOLiD specific</em> CSFASTA/QUAL or FASTQ files to SEAStAR style <a href="#FASTQ">FASTQ</a> format files</li>
<li><strong><a href="#fastq_nodup"><code>fastq_nodup</code></a></strong> -- Reference-free PCR deduplication of paired reads</li>
<li><strong><a href="#samplefastq"><code>samplefastq</code></a></strong> -- Subsamples reads randomly</li>
<li><strong><a href="#trimfastq"><code>trimfastq</code></a></strong> -- Tunable, pairing aware read-trimmer</li>
<li><strong><a href="#ref_select"><code>ref_select</code></a></strong> -- Converts SAM reference alignments into <a href="#JSON">JSON</a> "sequence graph" files for assembly and abundance analyses</li>
<li><strong><a href="#graph_ops"><code>graph_ops</code></a></strong> -- A toolkit for operating on the <code>ref_select</code> "sequence graph" files to produce assemblies and data tables</li>
<li><strong><a href="#seq_scaffold"><code>seq_scaffold</code></a></strong> -- Converts pre-ordered and oriented contigs into scaffolded sequence, filling gaps where possible</li>
<li><strong><a href="#tetracalc"><code>tetracalc</code></a></strong> -- Clusters sequence scaffolds into "species-level" taxonomic bins for final assembly</li>
<li><strong><a href="#misc_scripts">misc_scripts</a></strong> -- Miscellaneous scripts</li>
</ul>
<hr>
<h2 id="a-namesolid2fastqsolid2fastqa"><a name="solid2fastq"><code>solid2fastq</code></a></h2>
<p>This tool converts SOLiD colorspace <code>.csfasta</code> and <code>.qual</code> input files (raw reads transferred off the instrument) to output files that are an interoperable variant of the FASTQ file format (see SEAStAR format details in section <a href="#FASTQ">FASTQ</a>). It can also convert nucleotide FASTQ files produced by runs using Exact Call Chemistry to SEAStAR style FASTQ files. This is conversion necessary to ensure that read pairs are synced in <code>.read1.fastq</code> and <code>.read2.fastq</code>.</p>
<h4 id="usage---solid2fastq-options-in_prefix-out_prefix">Usage: <code>solid2fastq [options] <in_prefix> <out_prefix></code></h4>
<p>Where <code><in_prefix></code> denotes the input prefix which is the part of the input filename shared in common between all <code>.csfasta</code> and <code>.qual</code> or <code>.fastq</code> files generated by the same sequence library. It is typically shown in the <code>Title:</code> comment line near the top of <code>.csfasta</code> format input files. For example:</p>
<p><code># Title: GG050409_20090604_matepair_50</code></p>
<p>is found near the top of input files:</p>
<p><code>GG050409_20090604_matepair_50_R3.csfasta</code>, <code>GG050409_20090604_matepair_50_R3_QV.qual</code>, <code>GG050409_20090604_matepair_50_F3.csfasta</code> and <code>GG050409_20090604_matepair_50_F3_QV.qual</code>.</p>
<p>So use of <code>GG050409_20090604_matepair_50</code> for <code><in_prefix></code> will specify these four files for input to the <code>solid2fastq</code> tool.</p>
<p>Note that <code>solid2fastq</code> relies on the SOLiD naming conventions for the suffixes of these files (e.g. <code>_R3.csfasta</code>). You are free to rename these files with different prefixes, but the suffixes must remain the same. You may, however, compress these files using gzip and add the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>solid2fastq</code> to decompress the files as it reads them.</p>
<p>The <code><out_prefix></code> parameter is similar to the <code><in_prefix></code> as it is used to name the output <code>.fastq</code> file(s) generated by this tool. The output prefix may be any text, except you should generally avoid whitespace and punctuation characters other than "<code>_</code>". This is because, by default, the names of all output reads are also prefixed with this string and the variant of the FASTQ format used by the SEAStAR tools depends on certain punctuation characters to quickly parse these read names (see SEAStAR <a href="#FASTQ">FASTQ</a> format ). This file naming constraint may be relaxed through the use of optional parameters described below.</p>
<p>Note that <code><in_prefix></code> and/or <code><out_prefix></code> may contain directory path information, but in the case of <code><out_prefix></code> this will necessitate the use of the <code>--prefix</code> or <code>--no_prefix</code> options (described below).</p>
<p><code>solid2fastq</code> writes a single output <code>.fastq</code> file for each matching pair of <code>.csfasta</code> / <code>.qual</code> input files or for each single <code>.fastq</code> input file. For example, given a set of non-barcoded colorspace input files, the names of these output files are constructed as:</p>
<p><code><in_prefix>_R3[.csfasta|_QV.qual]</code> → <code><out_prefix>.read1.fastq</code></p>
<p><code><in_prefix>_F3[.csfasta|_QV.qual]</code> → <code><out_prefix>.read2.fastq</code></p>
<p>singlets from F3/R3 → <code><out_prefix>.single.fastq</code></p>
<p>Note that for paired-end libraries, alternate naming suffix conventions for the R3 input are also recognized (substituting <code>F5-BC</code>, <code>F5-P2</code>, <code>F5-DNA</code>, <code>F5-RNA</code>, for <code>R3</code>).</p>
<p>Additionally, for paired libraries (mate-paired or paired-end) should <code>solid2fastq</code> encounter any singlet reads (those with a missing mate), they are by default given the appropriate suffix and then written to a file named: <code><out_prefix>.single.fastq</code></p>
<p>The complete list of file name patterns that will be recognized by <code>solid2fastq</code> are:</p>
<p><code><in_prefix>_<BC>_R3.fastq</code> → <code><out_prefix>.read1.fastq</code>
<code><in_prefix>_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA][.csfasta|_QV.qual|.QV.qual]</code> → <code><out_prefix>.read1.fastq</code>
<code><in_prefix>_<BC>_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA][.csfasta|.QV.qual]</code> → <code><out_prefix>.read1.fastq</code>
<code><in_prefix>_[R3|F5-BC|F5-P2|F5-DNA|F5-RNA]_[<BC>.csfasta|_QV_<BC>.qual]</code> → <code><out_prefix>.read1.fastq</code>
<code><in_prefix>_<BC>_F3.fastq</code> → <code><out_prefix>.read2.fastq</code>
<code><in_prefix>_F3[.csfasta|_QV.qual|.QV.qual]</code> → <code><out_prefix>.read2.fastq</code>
<code><in_prefix>_<BC>_F3[.csfasta|.QV.qual]</code> → <code><out_prefix>.read2.fastq</code>
<code><in_prefix>_F3_[<BC>.csfasta|_QV_<BC>.qual]</code> → <code><out_prefix>.read2.fastq</code></p>
<h4 id="solid2fastq-optional-parameters-options"><code>solid2fastq</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[--prefix=<string>]</code> <code>[-n|--no_prefix]</code> <code>[-x|--no_suffix]</code> <code>[-z|--gzip]</code> <code>[-b <BC>|--bc=<BC>]</code> <code>[-s|--singles]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>--prefix=<string></code></b></p>
<p>Specify the prefix string to add to read identifiers. For example, <code>.csfasta</code> read identifier: <code>>1_68_381_R3</code> will become: <code>@xx+<string>:1_68_381/1</code> in the <code>.fastq</code> output file. By default this is <code><out_prefix></code> described above.</p>
<p>The use of a terse, but unique, prefix string is advisable because it helps identify the source, and processing steps, performed on a set of reads as they flow through an analysis pipeline. Keep in mind that this text will be added to the output file for every read, so short strings are preferable. The output prefix may be any text you would like, except you may not use whitespace or punctuation characters other than "<code>_</code>".</p>
<p><b><code>-n</code> / <code>--no_prefix</code></b></p>
<p>Overrides the default read prefixing behavior and instead preserves the read identifiers as they are in the input <code>.csfasta</code> file(s).</p>
<p>For example: <code>>1_68_381_R3</code> simply becomes <code>@xx+:1_68_381/1</code></p>
<p><b><code>-x</code> / <code>--no_suffix</code></b></p>
<p>Suppress /1 or /2 suffix additions to read IDs. This ensures matching read IDs for paired reads in FASTQ files and non-BWA aligner generated SAM files.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output <code>.fastq</code> files compressed in the gzip format, and with filenames suffixed <code>.gz</code>.</p>
<p><b><code>-b <BC></code> / <code>--bc=<BC></code></b></p>
<p>Used only for barcoded SOLiD libraries, in addition to <code><in_prefix></code>, to specify input files for a specific barcode (<code><BC></code>).</p>
<p><b><code>-s</code> / <code>--singles</code></b></p>
<p>Write two separate singlet files that segregate the singlets by input files. This is useful for doing strand-specific analyses using paired-end libraries, where the mated reads are sequenced from opposite strands. For example:</p>
<p><code><in_prefix>_R3[.csfasta|_QV.qual]</code> → <code><out_prefix>.read1.fastq</code></p>
<p>singlets from <code>R3</code> → <code><out_prefix>.single1.fastq</code></p>
<p><code><in_prefix>_F3[.csfasta|_QV.qual]</code> → <code><out_prefix>.read2.fastq</code></p>
<p>singlets from <code>F3</code> → <code><out_prefix>.single2.fastq</code></p>
<hr>
<h2 id="a-namefastq_nodupfastq_nodupa"><a name="fastq_nodup"><code>fastq_nodup</code></a></h2>
<p>This tool reads paired <code>.fastq</code> files (with associated singlets), and removes the lowest quality pair(s) of presumed PCR duplicate reads. This operation assumes that multiple read-pairs sharing substantially the same sequences for both mates are statistically unlikely to occur frequently at random (due to the distribution of insert sizes), and therefore represent likely artifactual duplications resulting from PCR over-amplification during library construction. If this assumption does not hold for a given library, then use of this tool is inappropriate.</p>
<p>Notably, this tool is most appropriate for metagenomes (or any other sample type) where no reference genome is available for performing duplicate removal through reference alignments. This is because <code>fastq_nodup</code> works entirely through error-tolerant internal comparisons of read-pairs to each other within an entire sequenced library.</p>
<p>To accomplish this, <code>fastq_nodup</code> builds a large table of sequence prefixes, requiring a substantial amount of memory. The amount of memory required can be controlled through tuning the various parameters, but a minimum of 32Gb will probably be necessary for most read datasets.</p>
<p>Note that singlet reads are also randomly removed in the same proportion as matching mated reads. This is done assuming that PCR duplicated pairs are as likely to produce singlets as non-duplicated pairs. For example:</p>
<p><code>S1</code>, <code>S1</code>, <code>M1 -- M2</code>, <code>M1 -- M2</code>, <code>M1 -- M3</code></p>
<p>In the above example <code>Sn</code> = singlet reads, <code>Mn</code> = mated reads, and the numeric suffixes indicate matching sequences. So the lowest quality pair of <code>M1 -- M2</code> mates will be removed from the output, and each of the <code>S1</code> reads will have a 1/3 probability of being removed (disregarding quality). The 1/3 probability is calculated from the fact that 1/3 of the pairs containing sequence <code>M1</code> were presumed PCR duplicates and therefore removed.</p>
<h4 id="usage--fastq_nodup-options-in_prefix-out_prefix">Usage: <code>fastq_nodup [options] <in_prefix> <out_prefix></code></h4>
<p>where <code><in_prefix></code> and <code><out_prefix></code> specify the input and output filename prefixes to use for input files and naming output files. For example:</p>
<p><code><in_prefix>.read1.fastq</code> → <code><out_prefix>.read1.fastq</code>
<code><in_prefix>.read2.fastq</code> → <code><out_prefix>.read2.fastq</code>
<code><in_prefix>.single.fastq</code> → <code><out_prefix>.single.fastq</code></p>
<p>The input files may be gzip format compressed files with the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>fastq_nodup</code> to decompress the files as it reads them. Multiple single files (i.e. <code>.single1.fastq</code> and <code>.single2.fastq</code> suffixes) are also recognized.</p>
<h4 id="fastq_nodup-optional-parameters-options"><code>fastq_nodup</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[-v|--verbose]</code> <code>[-z|--gzip]</code> <code>[--prefix=<string>]</code> <code>[--no_prefix]</code> <code>[-l <u>|--index_len=<u>]</code> <code>[-m <u>|--match_len=<u>]</code> <code>[-d <u>|--index_err=<u>]</code> <code>[-e <u>|--match_err=<u>]</code> <code>[--seed=<n>]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>-v</code> / <code>--verbose</code></b></p>
<p>Print additional statistics and diagnostic information about the run.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output <code>.fastq</code> files compressed in the gzip format, and with filenames suffixed <code>.gz</code>.</p>
<p><b><code>--prefix=<string></code></b></p>
<p>Specify the prefix string to add to read identifiers. For example: <code>@<string>:1_68_381</code></p>
<p>Note, this will replace any existing prefix. By default this is <code><out_prefix></code> described above.</p>
<p><b><code>--no_prefix</code></b></p>
<p>Overrides the default read prefixing behavior and instead preserves the read identifiers exactly as they are in the input fastq file(s).</p>
<p><b><code>-l <u></code> / <code>--index_len=<u></code></b></p>
<p>Controls the read sequence prefix length of the search index. This should be set long enough so that random prefix collisions are infrequent, but short enough so that a large number of sequence errors are unlikely to accumulate. This parameter also impacts the memory use of this tool, with each increment increasing the index size by a factor of 4. This parameter defaults to <code><u></code>=14 nucleotides and may not exceed 32 nucleotides.</p>
<p>Both mates of each read-pair are indexed, and this indexed look-up is used to find a list of mate sequences corresponding to each sequence prefix. Given that the mates also must match, it is acceptable for a low-level of prefix collisions to occur in this index. That is, assuming the level of prefix collision is low (only a small number of non-duplicated reads share any given prefix), it will be highly improbable that reads matching a given prefix will share mates with matching sequence at random.</p>
<p><b><code>-m <u></code> / <code>--match_len=<u></code></b></p>
<p>Controls the required match length of the unindexed mate in detecting a duplicate pair. Like the previous parameter (<code>--index_len</code>), this setting has an impact on memory use, although memory requirements only increase linearly with match length. This parameter should be set as high as memory permits to improve specificity.</p>
<p>This parameter defaults to the full read length, although lower settings are acceptable assuming that the <code>--index_len</code> is appropriately set, and the sum of <code>--index_len</code> and <code>--match_len</code> exceeds about 35 nucleotides (that is, random 35 nucleotide joint index-match collisions are extremely unlikely assuming each match is sufficiently long).</p>
<p><b><code>-d <u></code> / <code>--index_err=<u></code></b></p>
<p>Controls the number of mismatches tolerated in the indexed sequence prefix while still permitting a sequence match. This value defaults to <code><u></code>= 1 and may be set in the range 0-2. This setting has a major impact on run time, with each increment increasing it by a factor of about 4. Duplicated reads with more than <code>--index_err</code> mismatches in the <code>--index_len</code> prefix may still be found by the mate's indexed prefix, assuming that it does not also have more than <code>--index_err</code> errors.</p>
<p><b><code>-e <u></code> / <code>--match_err=<u></code></b></p>
<p>Controls the number of mismatches tolerated in a matched mate sequence (in the <code>--match_len</code> prefix). This value defaults to 3 and must be > 0. This setting has little impact on run time. Setting <code>--match_err</code> to higher values will reduce the specificity of the mate-matching and should therefore be accompanied by correspondingly larger values of <code>--match_len</code>.</p>
<p><b><code>-seed=<n></code></b></p>
<p>Integer seed used by the internal pseudo-random number generator. This only affects the random selection of matching singlets (as described above). Defaults to <code><n></code>= 12345.</p>
<hr>
<h2 id="a-namesamplefastqsamplefastqa"><a name="samplefastq"><code>samplefastq</code></a></h2>
<p>This tool randomly samples reads from a given set of fastq files, producing a corresponding set of output fastq files. This is useful for more quickly tuning parameters for the <code>trimfastq</code>tool (described below) and for producing simulated read datasets for various other purposes.</p>
<h4 id="usage-samplefastq--f--frac_samplefloatint-options-in_prefix-out_prefix">Usage: <code>samplefastq -f|--frac_sample=<float>|<int> [options] <in_prefix> <out_prefix></code></h4>
<p>where the mandatory <code>-f</code> parameter indicates how the input files should be sampled and <code><in_prefix></code> and <code><out_prefix></code> specify the input and output filename prefixes to use for input files and naming output files. For example:</p>
<p><code><in_prefix>.read1.fastq</code> → <code><out_prefix>.read1.fastq</code>
<code><in_prefix>.read2.fastq</code> → <code><out_prefix>.read2.fastq</code>
<code><in_prefix>.single.fastq</code> → <code><out_prefix>.single.fastq</code></p>
<p>The input files may be gzip format compressed files with the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>samplefastq</code> to decompress the files as it reads them. Multiple single files (i.e. <code>.single1.fastq</code> and <code>.single2.fastq</code> suffixes) are also recognized.</p>
<p>Note that for paired sequences (as shown above) read pairs are always sampled together; that is, each pair of reads counts at a single read unit for sampling purposes, as does each singlet read.</p>
<h4 id="samplefastq-mandatory-parameter"><code>samplefastq</code> mandatory parameter:</h4>
<p><code>[-f|--frac_sample=<float>|<int>]</code></p>
<p><b><code>-f <float></code> / <code>-f <int></code> / <code>--frac_sample=<float></code> / <code>--frac_sample=<int></code></b></p>
<p>The <code>-f</code> parameter tells <code>samplefastq</code>, either directly or indirectly, how many output reads to randomly sample from the input <code>.fastq</code> files. Values < 1.0 are interpreted as a fraction of the input reads to retain, whereas values > 1.0 are rounded to an integer count of reads to retain.</p>
<p>Note that when a read count is provided, the input files must be read twice by <code>samplefastq</code>; once to count the number of input reads, and then again to randomly sample those reads. Once the number of input reads is known, the requested read count is converted into a fraction and operation proceeds as though that fraction had been initially provided to the <code>-f</code> parameter. Therefore, if a given set of inputs is to be sampled repeated, it will be considerably more efficient to obtain the count of reads once and calculate the fraction to retain for each subsequent sampling. The requested (or calculated) fraction of reads to sample is used internally as the probability of retaining each read (or pair) and each is considered an independent trial. Because of this, the precise fraction (or count) of reads requested is unlikely to be generated, although the sampled output will typically be very close to that requested.</p>
<h4 id="samplefastq-optional-parameters-options"><code>samplefastq</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[--seed=<int>]</code> <code>[-z|--gzip]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output fastq files compressed in the gzip format, and with filenames suffixed <code>.gz</code>.</p>
<p><b><code>--seed=<n></code></b></p>
<p>Integer seed used by the internal pseudo-random number generator. To obtain different samples from the same input file(s), set this parameter to different values. Repeated runs with all parameters the same will produce identical outputs. Defaults to <code><n></code>= 12345.</p>
<hr>
<h2 id="a-nametrimfastqtrimfastqa"><a name="trimfastq"><code>trimfastq</code></a></h2>
<p>This tool is used to trim or reject individual reads on the basis of quality scores. Reads may also be rejected based on low information content (entropy). When run on paired reads, this operation is performed on mates independently, and when one mate of a pair is rejected, the other becomes a singlet.</p>
<p>Trimming based on quality scores is performed with the goal of producing trimmed reads that have a calculated probability of error below some threshold. That is, bases are trimmed off the 3' end of the read until the remaining bases have a joint probability of error below a threshold. For example, if a threshold of 0.5 is selected, that means the output reads will be expected to have, on average, 0.5 erroneous nucleotides per read.</p>
<p>For colorspace (SOLiD differentially encoded) reads, the probability of error is calculated as the joint probability of error in each of two consecutive color positions. That is, because single color errors will typically be correctable, the probability of a nucleotide error at a given position is the probability that both corresponding color positions are erroneous (i.e. the pairwise product of color error probabilities yields the probability of uncorrectable nucleotide errors).</p>
<p>Minimum read lengths may also specified such that reads trimmed shorter than the specified minimum to meet an error probability threshold will instead be rejected.</p>
<p>Reads may also be independently rejected for failing to have a minimum mean information content. <code>trimfastq</code> optionally calculates the mean entropy of each read (as average bits per dinucleotide) and reject reads that fail to meet a given threshold. A common error for next-generation sequencers is to produce junk reads where all or part of the read contains highly repetitive (low information) sequence. Random DNA sequence contains 4 bits per dinucleotide of entropy, and we have observed that (non-repeat) natural sequences typically contain more than 3 bits per dinucleotide.</p>
<h4 id="usage--trimfastq-options-in_prefix-out_prefix">Usage: <code>trimfastq [options] <in_prefix> <out_prefix></code></h4>
<p>where <code><in_prefix></code> and <code><out_prefix></code> specify the input and output filename prefixes to use for input files and naming output files. For example:</p>
<p><code><in_prefix>.read1.fastq</code> → <code><out_prefix>.read1.fastq</code></p>
<p><code><in_prefix>.read2.fastq</code> → <code><out_prefix>.read2.fastq</code></p>
<p><code><in_prefix>.single.fastq</code> → <code><out_prefix>.single.fastq</code></p>
<p>The input files may be gzip format compressed files with the customary <code>.gz</code> suffix at the very end of the filename, which tells <code>trimfastq</code> to decompress the files as it reads them. Multiple single files (i.e. <code>.single1.fastq</code> and <code>.single2.fastq</code> suffixes) are also recognized. It is also possible to write FASTA format files (i.e. lacking quality scores) and interleaved paired-read files (for the Velvet assembler). See the options below for more information.</p>
<h4 id="trimfastq-optional-parameters-options"><code>trimfastq</code> optional parameters <code>[options]</code>:</h4>
<p><code>[-h|--help]</code> <code>[--version]</code> <code>[-c|--color_space]</code> <code>[-z|--gzip]</code> <code>[-v|--invert_singles]</code> <code>[-s|--singles]</code> <code>[-p <d>|--correct_prob=<d>]</code> <code>[-l <u>|--min_read_len=<u>]</code> <code>[-m <u>|--min_mate_len=<u>]</code> <code>[-f <u>|--fixed_len=<u>]</code> <code>[--prefix=<string>]</code> <code>[--add_len]</code> <code>[--no_prefix]</code> <code>[-e <d>|--entropy_filter=<d>]</code> <code>[--entropy_strict]</code> <code>[--mates_file]</code> <code>[--no_rev]</code> <code>[--force_rev]</code> <code>[--only_mates]</code> <code>[--fasta]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Print a guide to valid command line parameters and their correct usage to the terminal, and then exit.</p>
<p><b><code>--version</code></b></p>
<p>Print the version number of the executing program and then exit.</p>
<p><b><code>-z</code> / <code>--gzip</code></b></p>
<p>Write output fastq files compressed in the gzip format, and with filenames suffixed `.gz+.</p>
<p><b><code>--prefix=<string></code></b></p>
<p>Specify the prefix string to add to read identifiers. For example: <code>@<string>:1_68_381</code></p>
<p>Note, this will replace any existing prefix. By default this is <code><out_prefix></code> described above.</p>
<p><b><code>-n</code> / <code>--no_prefix</code></b></p>
<p>Overrides the default read prefixing behavior and instead preserves the read identifiers exactly as they are in the input fastq file(s).</p>
<p><b><code>--add_len</code></b></p>
<p>Add the final trimmed length value to the read prefix. For example:</p>
<pre><code>@lambda:1_81_912
TGTTGTGCGCGCCATAGCGAGGGGCTCAGCACGCGTCCCTCCGCCCCAC
+
5/0)358)%57*)%881/(/4'.'53*&#91*-'&,&%$''%&'46-+#`</code></pre><p>Trims to: ("37" in the first line indicates the final trimmed length)</p>
<pre><code>@37|lambda_trim:1_81_912
TGTTGTGCGCGCCATAGCGAGGGGCTCAGCACGCGTC
+
5/0)358)%57*)%881/(/4'.'53*&#91*-'&,&</code></pre><p><b><code>-v</code> / <code>--invert_singles</code></b></p>
<p>Causes singlet file(s) output to be the opposite configuration of the input. That is:</p>
<p><code><in_prefix>.single.fastq</code> → <code><out_prefix>.single[12].fastq</code></p>
<p>or</p>
<p><code><in_prefix>.single[12].fastq</code> → <code><out_prefix>.single.fastq</code></p>
<p>Note that this option is only valid when there are input singlet reads.</p>
<p><b><code>-s</code> / <code>--singles</code></b></p>
<p>Write two singlet files (<code>.single1.fastq</code> and <code>.single2.fastq</code>), one for new singlets generated from each paired input file. Note that this option is only valid when there are no input singlet reads. The default behavior in this case is to write a single combined singlet file (<code>.single.fastq</code>).</p>
<p><b><code>-p <d></code> / <code>--correct_prob=<d></code></b></p>
<p>Minimum mean probability that trimmed output reads are free of uncorrectable errors (or all errors with <code>-c</code>). Default value is <code><d></code>= 0.5 Setting <code>-p</code> to 0.0 completely disables quality trimming, which is sometimes appropriate for file format conversion and interleaving operations (e.g. see <code>--fasta</code> and <code>--mates_file</code>).</p>
<p><b><code>-l <u></code> / <code>--min_read_len=<u></code></b></p>
<p>Minimum length of a singlet or longest-mate of a pair, in nucleotides. Default value is <code><u></code>= 24 bases.</p>
<p><b><code>-m <u></code> / <code>--min_mate_len=<u></code></b></p>
<p>Minimum length of the shortest mate of a pair, in nucleotides. Default value is the value provided for <code>--min_read_len</code>. This value must be < the value used for <code>--min_read_len</code>. Use of this parameter allows paired reads shorter than <code>--min_read_len</code> to be retained as long as their mate satisfies <code>--min_read_len</code>.</p>
<p><b><code>-f <u></code> / <code>--fixed_len=<u></code></b></p>
<p>Trim all reads to a fixed length, still rejecting reads that don't meet specified quality thresholds at this length. Default is no fixed length.</p>
<p><b><code>-e <d></code> / <code>--entropy_filter=<d></code></b></p>
<p>Reject reads with mean per position measured information (entropy) below the given value (in bits per dinucleotide). The range of valid values is 0.0-4.0 inclusive. By default this filter is off.</p>
<p><b><code>--entropy_strict</code></b></p>
<p>Reject reads for low entropy overall, not just the retained part after trimming. By default this setting is off. Use of this setting requires use of <code>--entropy_filter</code>.</p>
<p><b><code>--mates_file</code></b></p>
<p>In addition to other outputs, produce a Velvet compatible interleaved paired output file (named <code><out_prefix>_mates.fastq</code>) with read2 mates reversed by default (to support SOLiD mate-paired reads.)</p>
<p><b><code>--no_rev</code></b></p>
<p>By default, for SOLiD colorspace read-pairs, the second read of each pair is reversed in the <code>mates</code> file produced by <code>--mates_file</code> (to put the mates on opposite strands, as Velvet expects). <code>--no_rev</code> disables this reversing, which is useful for SOLiD colorspace paired-end files. Nucleotide inputs are by default not reversed in the output generated by <code>--mates_file</code>, so this option has no effect in that case.</p>
<p><b><code>--force_rev</code></b></p>
<p>By default, for nucleotide (non-colorspace) read-pairs, the second read of each pair isn't reversed in the <code>mates</code> file produced by <code>--mates_file</code> (to keep the mates on opposite strands, as Velvet expects, assuming Illumina sequence). <code>--force_rev</code> enables this reversing, which is useful for SOLiD 5500+ nucleotide mate-paired files (nucleotide sequences are correctly reverse complemented). Colorspace inputs are reversed by default in the output generated by <code>--mates_file</code>, so this option has no effect in that case.</p>
<p><b><code>--only_mates</code></b></p>
<p>Surpress writing <code><out_prefix>.read1.fastq</code> and <code><out_prefix>.read2.fastq</code> output files. <code>--only_mates</code> may only be used with the <code>--mates_file</code> option. This option is useful when reads are being trimmed exclusively for assembly, to save unnecessary processing for files that will not be used. See also the --fasta option below to save even more space.</p>
<p><b><code>--fasta</code></b></p>
<p>Write FASTA format files instead of FASTQ files for all outputs. FASTA files are about half the size of FASTQ, so for some purposes (e.g. assembly, alignment) that may not take quality scores into account, FASTA may be a preferable output format to save disk space. When <code>--fasta</code> is used, no <code>.fastq</code> files will be written, and the output files will instead end with <code>.fasta[.gz]</code>.</p>
<hr>
<h2 id="a-nameref_selectref_selecta"><a name="ref_select"><code>ref_select</code></a></h2>
<p><code>ref_select</code> is a utility for calculating a wide variety of useful information from alignments of next-generation sequence reads to large reference sequence datasets. For example, it can be used to build a meta-genomic assembly graph from paired-reads aligned back to <em>de novo</em> assembled contigs. It can also be used to precisely select (and accurately estimate the relative abundances of) sequences from an alignment with a reference library (e.g. metagenomic reads to 16S rDNA database, or mRNA transcript reads to a database of genes from one or more genomes.)</p>
<p><code>ref_select</code> processes <a href="http://samtools.sourceforge.net">SAM</a> format alignment files into a <a href="#JSON">JSON format "sequence graph"</a> file ready to be processed by the <code>graph_ops</code> tool. Optionally, <code>ref_select</code> will also read FASTQ files and filter out reads (and mates) which are or aren't present in the provided SAM alignment. The JSON output file is structured as a graph of "nodes" (selected sequences) and "edges" (connections between nodes based on read-pairing) A variety of useful information for the nodes and edges is included in the output JSON file, which can be accessed using the <code>graph_ops</code> tool or custom code written in any language with a JSON library.</p>
<h4 id="usage-ref_select-options-sam_file1-sam_file2--outfilejson">Usage: <code>ref_select [options] <sam_file1> [<sam_file2>]... > <outfile.json></code></h4>
<p><code>ref_select</code> has a lot of options, and depending on what type of analysis you are doing, you may use quite a few of them. However, at its simplest, <code>ref_select</code> reads one or more SAM alignment files and writes a JSON "sequence graph" to STDOUT:</p>
<pre><code>ref_select alignment.single.sam > seq_graph.json</code></pre><p><code>ref_select</code> requires any SAM (or FASTQ) format input files to be named using the <a href="#FASTQ">SEAStAR naming convention used for FASTQ files</a>. Crucially, individual SAM alignment files must not mix alignments for reads from different categories (<code>read1</code>, <code>read2</code>, <code>single</code>), and the corresponding filename suffixes of the SAM files must match those of the FASTQ files used to generate them:</p>
<p><code>sample.read1.fastq</code> → <code>alignment.read1.sam</code>
<code>sample.read2.fastq</code> → <code>alignment.read2.sam</code>
<code>sample.single.fastq</code> → <code>alignment.single.sam</code></p>
<p>Note that <code>ref_select</code> may also write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when saving the output to a file, or the resulting saved JSON data file syntax may be corrupted by messages written to STDERR.</p>
<p>Of course, since these types of files may be very large, you may prefer to keep everything compressed:</p>
<pre><code>ref_select alignment.single.sam.gz | gzip -c > seq_graph.json.gz</code></pre><p>Because <code>ref_select</code> has many options, and produces many different kinds of outputs, we made the decision to organise all of the output data into a single output stream in a standard text-based format (<a href="http://www.json.org">JSON</a>), that is flexible and straightforward to process in any language. The JSON formatted output produced by <code>ref_select</code> is documented in the appendix: <a href="#JSON">JSON Sequence Graph File Format Reference</a>. However, for most tasks you will not need to look directly into these files because the <code>graph_ops</code> tool will handle that work for you (including the work of extracting data into other standard file formats).</p>
<p><code>ref_select</code> has four main pieces of functionality:</p>
<ol>
<li><p>Quantitatively analyzing reads aligned to a known reference database (e.g. metagenomic reads to a 16S rDNA database, or reads from mRNA transcripts to the gene models of one or more genomes):</p>
<pre><code> # Select only reference sequences with reads aligned scoring 100 or more bits
ref_select -t 100.0 alignment.single.sam > ref_stats.json
# Convert ref_select output to a tab separated (TSV) table for use downstream
graph_ops ref_stats.json TABLE > ref_stats.tsv</code></pre></li>
<li><p>Analyzing reads mapped back to <em>de novo</em> assembled contigs, to build a "sequence graph" for downstream quality checking, scaffolding, binning, visualization, etc.</p>
<pre><code> # Construct a sequence graph from the alignment, inserting the assembled contig seqs into the graph
ref_select --ref=contigs.fna -m alignment.read1.sam alignment.read2.sam > seq_graph.json
# Produce scaffolded sequence (where .go file is graph_ops SCRIPT)...
graph_ops seq_graph.json run_scaffold_pipeline.go > scaffolds.fna</code></pre></li>
<li><p>Converting SOLiD colorspace contigs into nucleotide contigs based on aligned reads</p>
<pre><code> # Rather than inserting contigs as above, they are reconstructed from the alignment
# and the provided read FASTQ files
ref_select -q -m -r mates.read1.fastq -r mates.read2.fastq alignment.read1.sam alignment.read2.sam > seq_graph.json
# Produce scaffolded sequence (where .go file is graph_ops SCRIPT)...
graph_ops seq_graph.json run_scaffold_pipeline.go > scaffolds.fna</code></pre></li>
<li><p>Filtering out a sub-set of reads (and their mates) based on the results of an alignment.</p>
<pre><code> # Write two new read files for all reads (and their mates) that did not map in the alignment
# NOTE: this is the one exception to the "all data goes into the JSON stream" principle described above.
# Filtered read output data is written to `FASTQ` format files, with a prefix specified using --read_output
ref_select --read_output=filtered_ --output_nomatch -r mates.read1.fastq -r mates.read2.fastq alignment.read1.sam alignment.read2.sam</code></pre></li>
</ol>
<p>The approaches used by <code>ref_select</code> to accomplish these tasks are described in more detail in the methods section of (<a href="http://www.sciencemag.org/content/335/6068/587.abstract">Iverson, <em>et al.</em> 2012</a>).</p>
<h3 id="the-options-reference-below-organizes-the-various-parameters-into-general-categories-related-to-the-above-tasks">The <code>[options]</code> reference below organizes the various parameters into general categories related to the above tasks:</h3>
<ul>
<li><a href="#ref_select_basic">Basic parameters</a></li>
<li><a href="#ref_select_bitscore">Bitscore calculation and reference selection parameters</a></li>
<li><a href="#ref_select_coverage">Coverage calculation parameters</a></li>
<li><a href="#ref_select_recon">Sequence reconstruction and read filtering parameters</a></li>
<li><a href="#ref_select_pairing">Read pairing statistical parameters</a></li>
<li><a href="#ref_select_inputs">Input file parameters</a><br>
<br>
</li>
</ul>
<h3 id="a-nameref_select_basicref_select-basic-parametersa-options-"><a name="ref_select_basic"><code>ref_select</code> basic parameters</a> <code>[options]</code> :</h3>
<p><code>[-h|--help]</code> <code>[-v|--version]</code> <code>[--verbose]</code> <code>[-d <seq_id>|--detail=<seq_id>]...</code> <code>[--detail_file=<detail_file>]</code> <code>[-q|--recon_seq]</code> <code>[--ref=<ref_file>]</code> <code>[-m|--mate_pair]</code> <code>[--split=<n>]</code> <code>[--separate_strands]</code> <code>[--rollup]</code> <code>[--seed=<n>]</code> <code>[--num_threads=<n>]</code></p>
<p><b><code>-h</code> / <code>--help</code></b></p>
<p>Request help.</p>
<p><b><code>-v</code> / <code>--version</code></b></p>
<p>Print the build version and exit.</p>
<p><b><code>--verbose</code></b></p>
<p>Print detailed diagnostic status to stderr during run. Off by default.</p>
<p><b><code>-d <seq_id></code> / <code>--detail=<seq_id></code></b></p>
<p>Only produce per-base statistics and reconstructed sequence for listed seq ids. By default these things are done for all selected sequences. For very large reference sequence databases, <code>ref_select</code> needs to perform a lot of work and can use a lot of memory and produce a very large output. If you want to analyze the entire reference dataset, but only produce detailed results for a small subset of sequences, the <code>--detail</code> option allows you to specify just those sequences (by <code><seq_id></code>, one per use of <code>-d</code> / <code>--detail</code>). To specify more than a few sequences, use <code>--detail_file</code> instead.</p>
<p><b><code>--detail_file=<detail_file></code></b></p>
<p>Filename of file containing sequence ids to treat as in --detail. Format is a plain text file with one <seq_id> per line.</p>
<p><b><code>-q</code> / <code>--recon_seq</code></b></p>
<p>Enable reconstruction of contig sequences using alignment data (SAM) and reads (FASTQ). This works for both colorspace and nucleotide data (or a combination of the two), and the resulting sequence is always nucleotide sequence. The reconstructed contigs may contain ambiguity codes (e.g. for SNPs or variable positions) and the sensitivity of this process is controlled with the <code>ambig_tol</code> parameter. NOTES: Requires use of --read_file. If --detail is used, only specified sequences are reconstructed.</p>
<p><b><code>--ref=<ref_file></code></b></p>
<p>For assemblies where the contigs and alignments are already nucleotide sequences, this option allows you to specify a FASTA file containing the reference contigs used in the alignment (as an alternative to using --recon_seq). In this case, the reference sequences are simply copies into the output JSON sequence graph for use in downstream analysis (scaffolding, binning, etc).</p>
<p><b><code>-m</code> / <code>--mate_pair</code></b></p>
<p>This option is required to enable the pairing analysis between mated reads (<code>.read1</code> and <code>.read2</code> alignments). When paired reads are available and <code>--mate_pair</code> is used, "edges" will be included in the output JSON data file, connecting the "nodes" into a full sequence graph structure. Enable this option when you want to produce scaffolds for a <em>de novo</em> assembly.</p>
<p><b><code>--split=<n></code></b></p>
<p>Split reference sequences into n-base pieces for analysis. The default is no splitting. This is a specialized option that performs all <code>ref_select</code> analyses as though large reference sequences (e.g. whole genomes) are actually split into many separate chunks. This can be useful for testing, but also for performing genome connection and rearrangement analyses when pairing information is available and the sequence read data is from a different organism (e.g. strain or closely related species) than the reference sequence used.</p>
<p><b><code>--separate_strands</code></b></p>
<p>Top and bottom stands of reference sequences are considered separately. Each reference sequence effectively becomes two references, with reads mapping to each strand contributing to the analysis for that strand's reference only. NOTE: <code>--separate_strands</code> may not be used with: mate-pairing (<code>--mate_pair</code>), sequence reconstruction (<code>--recon_seq</code>) or splitting (<code>--split</code>)</p>
<p><b><code>--rollup</code></b></p>
<p>When <code>--split</code> is used, or when a <code>--catalog</code> file is specified (e.g. to join reference scaffolds or chromosomes into whole genomes), the output JSON data will include a <code>rollup_stats</code> object with information about each parent reference sequence (e.g. whole genome) calculated from the stats of the individual underlying sequences.</p>
<p><b><code>--seed=<n></code></b></p>
<p>SEAStAR uses its own random number generator (for reproducibility among different platforms). You may use a different seed to investigate what difference randomized decisions made in the analysis are having on the results. By default <code>--seed=12345</code>.</p>
<p><b><code>--num_threads=<n></code></b></p>
<p><code>ref_select</code> is highly multithreaded to support modern multi-core processors. Sometimes you will want to restrict the number of cores it uses (e.g. to prevent resource competition on clusters or shared computers, or when you are running more than one instance of <code>ref_select</code> at a time on a given computer.) By default <code>--num_threads</code> is the number of cores on the machine (including "hyper-threads"). The minimum number of threads required for proper operation is 2 or 3 depending on the parmeters being used, and ref_select checks to ensure that the proper minimum number of threads is allocated.</p>
<h3 id="a-nameref_select_bitscoreref_select-parameters-for-bitscore-calculation-and-reference-selectiona-options-"><a name="ref_select_bitscore"><code>ref_select</code> parameters for bitscore calculation and reference selection</a> <code>[options]</code> :</h3>
<p><code>[-t <n>|--bit_thresh=<n>]</code> <code>[-f <n>|--bit_fraction=<n>]</code> <code>[-l <n>|--read_map_limit=<n>]</code> <code>[-s|--second_chance_reads]</code> <code>[--relax_read_sharing]</code> <code>[--all_taxa]</code> <code>[-a|--absolute_bitscores]</code> <code>[--no_rand]</code> <code>[--sim_frac=<n>]</code> <code>[-e <seq_id>|--exclude=<seq_id>]...</code> <code>[--exclude_file=<exclude_file>]</code> <code>[--invert_exclude]</code></p>
<p><b><code>-t <n></code> / <code>--bit_thresh=<n></code></b></p>
<p>Minimum bitscore value for a ref sequence to be selected for output. Default > 0.0, all sequences with a positive bitscore. NOTE: if both --bit_fraction and --bit_thresh are used, the highest resulting threshold will be used. Use of selection thresholds is useful for selecting a "noise floor" to exclude sequences that have only a small number of reads uniquely aligning, which in a large sequence database (e.g. 16S rDNA databases) may only reflect sequencing errors.</p>
<p><b><code>-f <n></code> / <code>--bit_fraction=<n></code></b></p>
<p>Minimum bitscore value for a ref sequence to be selected for output, as a fraction of the bitscore for the top scoring sequence. NOTE: if both --bit_fraction and --bit_thresh are used, the highest resulting threshold will be used.</p>
<p><b><code>-l <n></code> / <code>--read_map_limit=<n></code></b></p>
<p>Maximum number of reference mappings before a read is rejected from consideration. Default is 50000 reference sequences. Reads that map with a large fraction of the sequences in a reference database (e.g. low complexity reads, or highly conserved sequence) carry very little information but require a lot of computational effort and memory to process. If your sequence database has a lot of conserved or redundant sequence (e.g. 16S rDNA), then this parameter will save a lot of computational expense with little if any effect on the results. If desired, a 2-pass approach can be used where reads are aligned to the full database and run through <code>ref_select</code> first to eliminate the vast majority of the database from consideration. Then in a second pass, reads are re-aligned to just the sequences selected in the first round, so that all aligning reads can be accounted for (e.g. in coverage and relative abundance calculations).</p>
<p><b><code>-s</code> / <code>--second_chance_reads</code></b></p>
<p>Allow reads to be mapped to edit dist +1 mappings when their best ref is eliminated. By default, if a read maps to a sequence that ultimately doesn't meet the bitscore threshold being used limit selection, that read is removed from further downstream consideration, if it didn't also map at least as well with some other sequence that is ultimately selected for output. When <code>--second_chance_reads</code> is used, such reads gain a "second chance" to be reallocated to reference sequence(s) where they map with edit distance +1 from the eliminated best match(es). This parameter causes <code>ref_select</code> to use significantly more memory. Note: this has no effect when <code>--bit_fraction</code> and <code>--bit_thresh</code> are at their default values, or when <code>--relax_read_sharing</code> is used.</p>
<p><b><code>--relax_read_sharing</code></b></p>
<p>Allow reads to be mapped to all sequences, regardless of edit distance. By default, reads are only associated with the sequence(s) that they align with at the minimum edit distance. <code>--relax_read_sharing</code> causes reads to be associated with all sequence alignments, even if they are suboptimal relative to the best match.</p>
<p><b><code>--all_taxa</code></b></p>
<p>Use count of all taxa in catalog to calculate bitscores. By default, any taxa in the reference database that have zero read alignments are excluded from consideration in the information theoretic bitscore calculations (ie. the calculations proceed as though those sequences didn't exist in the reference database at all).</p>
<p><b><code>-a</code> / <code>--absolute_bitscores</code></b></p>
<p>Use absolute bitscores to score and select reference sequences. By default length normalized scores (bits/nt) are used to control for differences in reference sequence length.</p>
<p><b><code>--no_rand</code></b></p>
<p>By default, reads that map equally well to multiple locations within a single reference sequence are randomly assigned to one of the possible positions. The <code>--no_rand</code> option changes this behavior so that such a read will be mapped to the first position in the reference seen in the alignment file. Counterintuitively, it is likely that using <code>--no_rand</code> will make the output of <code>ref_select</code> <strong><em>less</em></strong> deterministic. This is because the pseudo-random number generator used by SEAStAR is cross-platform deterministic, whereas the process by which such multiple alignments are ordered in an alignment SAM file may be at least paritally random (say through multi-thread race contingencies in the alignment software.)</p>
<p><b><code>--sim_frac=<n></code></b></p>
<p>Fraction of aligned reads to use. Reads (and their mates) are randomly sampled from the input alignment(s) in this proportion. Primarily for use with simulated datasets.</p>
<p><b><code>-e <seq_id></code> / <code>--exclude=<seq_id></code></b></p>
<p>This option in conceptually similar to the <code>--detail</code> option, except that reference sequences identified with <code>--exclude</code> are discarded from the analysis entirely, as though they were not in the reference database to begin with. This is useful if a long, expensive alignment to a large reference sequence database has been completed, and then some downstream analysis calls for processing alignments with only a subset of that database.</p>
<p><b><code>--exclude_file=<exclude_file></code></b></p>
<p>Filename of file containing sequence ids to treat as in <code>--exclude</code>. Format is a plain text file with one <code><seq_id></code> per line.</p>
<p><b><code>--invert_exclude</code></b></p>
<p><code>--exclude</code> or <code>--exclude_file</code> sequences are inverted; that is, exclude all sequences except those specified.</p>
<h3 id="a-nameref_select_coverageref_select-parameters-for-coverage-calculationsa-options-"><a name="ref_select_coverage"><code>ref_select</code> parameters for coverage calculations</a> <code>[options]</code> :</h3>
<p><code>[--per_base]</code> <code>[-w <n>|--cov_window=<n>]</code> <code>[--gc_window=<n>]</code> <code>[--detect_dups]</code></p>
<p><b><code>--per_base</code></b></p>
<p>Per base statistics (coverage; and when <code>--mate_pair</code> is enabled, physical coverage and mean insert length) will be generated and output for all selected reference sequences (or only those specified by <code>--detail</code>).</p>
<p><b><code>-w <n></code> / <code>--cov_window=<n></code></b></p>
<p>Fixed read length for use in calculating per position coverage. By default the actual length of each read is used when available in the FASTQ readnames (<a href="#FASTQ">see the SEAStAR FASTQ format guide</a> for details.) If the readnames do not include length information and <code>--cov_window</code> isn't specified, then 49 nucleotides is the default fixed read length used.</p>
<p><b><code>--gc_window=<n></code></b></p>
<p>Enables per position %GC moving average calculation, within the specified window size. The results are output in the <code>per_nt_mean_gc</code> array in the JSON output for each selected node (or only those specified by <code>--detail</code>). Use of this option requires <code>--per_base</code> statistics and either <code>--recon_seq</code> or <code>--ref</code> to provide the sequence data nrcessary for the calculation. If <code>--mp_circular</code> is used, then the %GC calculation wraps around at the sequence ends with no discontinuity, otherwise the N/2 %GC values at each end of the sequence all have the same value as the first (or last) valid window.</p>
<p><b><code>--detect_dups</code></b></p>
<p>Perform additional analysis to detect likely collapsed duplications. Assumes relatively uniform coverage (i.e. a single isolate genome). Results are reported in the <code>contig_problems</code> section of the <a href="#JSON">JSON sequence graph</a>, a summary of which is available via the <code>graph_ops</code> <a href="#PROBS"><code>PROBS</code></a> command.</p>
<h3 id="a-nameref_select_reconref_select-parameters-for-sequence-reconstruction-and-read-filteringa-options-"><a name="ref_select_recon"><code>ref_select</code> parameters for sequence reconstruction and read filtering</a> <code>[options]</code> :</h3>
<p><code>[--ambig_tol=<n>]</code> <code>[--read_output=<prefix>]</code> <code>[--output_nomatch]</code> <code>[--read_output_gzip]</code></p>
<p><b><code>--ambig_tol=<n></code></b></p>
<p>Tolerance adjusting the sensitivity for generating ambiguity codes in output sequence reconstructed from read alignments. In the range: 0.0 - 1.0. 1.0 = no ambiguities, majority rules base calling. Default is 0.2</p>
<p><b><code>--read_output=<prefix></code></b></p>
<p>Enables output of reads aligning with the refernce database (and their mates, regardless of whether the mates align.) <code><prefix></code> indicates the FASTQ output filename prefix to use in writing the new ouput files. Must be used with <code>--read_file</code> and/or <code>rev_read_file</code>.</p>
<p><b><code>--output_nomatch</code></b></p>
<p>Inverse the meaning <code>--read_output</code> to write only reads that do not align with any reference sequence (whose mates also do not align).</p>
<p><b><code>--read_output_gzip</code></b></p>
<p>Modifies <code>--read_output</code> to write gzip compressed output files.</p>
<h3 id="a-nameref_select_pairingref_select-parameters-for-generating-read-pairing-statisticsa-options-"><a name="ref_select_pairing"><code>ref_select</code> parameters for generating read pairing statistics</a> <code>[options]</code> :</h3>
<p><code>[--mp_mate_lim=<n>]</code> <code>[--mp_share_lim=<n>]</code> <code>[--mp_strict]</code> <code>[--mp_inserts]</code> <code>[--mp_circular]</code> <code>[--mp_cutoff=<n>]</code></p>
<p><b><code>--mp_mate_cnt=<n></code></b></p>
<p>Minimum count of mapping read-pairs for generation of a mate-pair linking edge in the output graph (normal or internal). Default is 2, meaning that no edges consisting of a single read-pair will be output.</p>
<p><b><code>--mp_mate_lim=<n></code></b></p>
<p>Minimum bitscore threshold for generation of a mate-pair linking edge in the output graph (normal or internal). Default is 0.0 bits, however <code>--mp_mate_count</code> above will put a floor on lowest scoring edges output.</p>
<p><b><code>--mp_share_lim=<n></code></b></p>
<p>Minimum bitscore threshold for generation of a shared sequence edge in the output graph. Default is 500.0 bits.</p>
<p><b><code>--mp_strict</code></b></p>
<p>Do not consider paired reads that span reference sequences when the same pair of reads also map entirely within some single reference sequence.</p>
<p><b><code>--mp_inserts</code></b></p>
<p>Perform insert size estimation for all reference sequences where at least one pair of reads both map within that sequence.</p>
<p><b><code>--mp_circular</code></b></p>
<p>Allow circular self-linking mate-pairs to join the opposite ends of a single sequence.</p>
<p><b><code>--mp_cutoff=<n></code></b></p>
<p>Insert size cutoff for inclusion of an individual mate-pair in per base statistics. Setting this option for some reasonable upper limit of the length of a valid insert will prevent sequence duplications from improperly skewing estimation of the true mean insert size.</p>
<h3 id="a-nameref_select_inputsref_select-parameters-for-specifying-input-filesa-options-"><a name="ref_select_inputs"><code>ref_select</code> parameters for specifying input files</a> <code>[options]</code> :</h3>
<p><code>[-r <read_file>|--read_file=<read_file>]...</code> <code>[--rev_read_file=<read_file>]...</code> <code>[--old_bwa_samse]</code> <code>[-c <catalog_file>|--catalog=<catalog_file>]</code> <code>[--rev_align=<sam_file>]...</code></p>
<p><strong>NOTE!</strong> all inputs to <code>ref_select</code> involving per read data (FASTQ and SAM files) must be named using the SEAStAR <a href="#FASTQ">FASTQ naming conventions</a>, also see the <a href="#ref_select">introductory description of ref_select</a> for an example.</p>
<p><b><code>-r <read_file></code> / <code>--read_file=<read_file></code></b></p>
<p>Input filename(s) for FASTQ read files used for sequence reconstruction and/or read filtering.</p>
<p><b><code>--rev_read_file=<read_file></code></b></p>
<p>Same as --read_file above, but used for paired reads that are on the opposite strand from each other. By default, <code>ref_select</code> assumes that paired reads are from the same DNA strand. When this is not true, then one of the two readsets (read1/read2) must be specified using <code>--rev_read_file</code>. Note, this must be done consistently with the use of <code>--rev_align</code>.</p>
<p><b><code>--old_bwa_samse</code></b></p>
<p>Use the old BWA-style "samse" alignment format instead of SAM. This format was used by the BWA <code>samse -n</code> command until BWA version 0.5.6.</p>
<p><b><code>-c <catalog_file></code> / <code>--catalog=<catalog_file></code></b></p>
<p>Filename of a reference catalog file. Required for use <code>--old_bwa_samse</code>. This is optional otherwise, although it is required to associate sequence descriptions with reference sequences in the JSON output, or for use of the <code>--rollup</code> functionality when reference sequences are actually subsequences of parent sequences (e.g. scaffolds or chromosomes of a whole genome). A catalog is a tab separated (TSV) plain text file with the following columns:</p>
<pre><code> seq_id seq_len parent_id seq_description</code></pre><p>For example:</p>
<pre><code> ABCD123 1102938 ORG1 Organism 1, Scaffold 1
EFGH987 482273 ORG1 Organism 1, Scaffold 2
ZYXW765 193475 ORG2 Organism 2, Scaffold 1
RTSP345 782273 ORG2 Organism 2, Scaffold 2</code></pre><p>Note that each <code>seq_id</code> must be unique, but duplicated <code>parent_id</code> values indicate relationships among sequences (a shared parent sequence).</p>
<p><b><code>--rev_align=<sam_file></code></b></p>
<p>Same as the main <sam_file> input filenames, but used for paired reads that are on the opposite strand from each other. By default, <code>ref_select</code> assumes that paired reads are from the same DNA strand. When this is not true, then alignments for one of the two readsets must be specified using <code>--rev_align</code>. Note, this must be done consistently, and also with the use of <code>--rev_read_file</code> when input FASTQ files are also used.</p>
<hr>
<h2 id="a-namegraph_opsgraph_opsa"><a name="graph_ops"><code>graph_ops</code></a></h2>
<p><code>graph_ops</code> reads <a href="#JSON">JSON sequence graph files</a> produced by the <code>ref_select</code> program. It implements a variety of commands for manipulating this data for assembly, visualization or output to a variety of file formats (as detailed for each subcommand below). Because the JSON sequence graph objects are often very large, and therefore may take significant time to read and write, we designed <code>graph_ops</code> as a federated suite of tools that can pass the current sequence graph state from one tool to the next in-memory. This saves considerable time and I/O traffic relative to an alternative design that used separate command line tools for each operation performed by <code>graph_ops</code>. A consequence of this design is that <code>graph_ops</code> may be used in either a scripted mode, where all commands are supplied via the command-line or a script (<code>.go</code>) file; or in an interactive mode, where commands are manually entered one at a time.</p>
<h4 id="usage-graph_ops--h--help-inputjson-scriptgo-command-params">Usage: <code>graph_ops [-h|--help] [<input.json>] [<script.go>] [<command> ['{params}']]...</code></h4>
<p>where:</p>
<ul>
<li><p><b><code>-h</code> / <code>--help</code></b> provides command line help regarding program usage and version information</p>
<pre><code> # Provide help about shell command line usage
graph_ops --help
# Provide help about graph_ops commands
graph_ops HELP # HELP is actually a graph_ops command</code></pre></li>
<li><p><b><code><command></code></b> is some valid command and <b><code>'{params}'</code></b> optionally specifies parameters for each command, in the form: <code>'{"parm1":value1,"parm2":value2...}'</code>. Multiple commands with optional parameters may be provided in succession for execution. NOTE: the parameters for each command must be single quoted to ensure that it is passed to <code>graph_ops</code> as a single item, and to prevent interactions with special characters used by the UNIX shell.</p>
<pre><code> # Read a sequence graph and output all of the sequences to a FASTA file
graph_ops LOAD '{"file","assembly.json"}' FASTA '{"file":"sequence.fna"}'</code></pre><p> Like most of the other tools in SEAtAR, <code>graph_ops</code> can read and write gzip compressed versions of all of its input and output files:</p>
<pre><code> # Read a compressed sequence graph and output all of the sequences to a compressed FASTA file
graph_ops LOAD '{"file","assembly.json.gz"}' FASTA '{"file":"sequence.fna.gz"}'</code></pre></li>
<li><p><b><code><input.json></code></b> is an optional shortcut specifying a sequence graph file to initially <code>LOAD</code></p>
<pre><code> # Load the sequence graph file and go interactive
graph_ops assembly.json
# Equivalent to
graph_ops LOAD '{"file":"assembly.json"}' SCRIPT</code></pre></li>
<li><p><b><code><script.go></code></b> is an optional <code>SCRIPT</code> to initially execute</p>
<pre><code> # Load the sequence graph file and run the scaffold script
graph_ops assembly.json write_fasta.go
# Equivalent to
graph_ops LOAD '{"file":"assembly.json"}' SCRIPT '{"file":"write_fasta.go"}'
# If the write_fasta script contains the LOAD command, then just this will suffice
graph_ops write_fasta.go</code></pre><p> In this last case, the <code>write_fasta.go</code> script could be (note that single quotes around the parameters in a <code>.go</code> file are not necessary):</p>
<pre><code> # This is the write_fasta.go file
LOAD {"file":"assembly.json"}
FASTA {"file":"sequence.fna"}</code></pre></li>
</ul>
<p>Running <code>graph_ops</code> without <strong><em>any</em></strong> options will launch in interactive mode, equivalent to: <code>graph_ops SCRIPT</code></p>
<p>Note that many <code>graph_ops</code> commands may write warnings, errors and diagnostic information to STDERR, so it is important to keep STDOUT and STDERR separate when using commands such as <code>DUMP</code>, <code>DOT</code>, <code>TABLE</code>, or <code>FASTA</code> (without specifying an output filename) and redirecting the STDOUT of <code>graph_ops</code> to a file. Likewise, only one such output command may use STDOUT per run of <code>graph_ops</code> (e.g. redirecting the output of both <code>FASTA</code> and <code>DUMP</code> to the same file will result in a file that is both invalid JSON and invalid FASTA format.)</p>
<h3 id="graph_ops-command-summary"><code>graph_ops</code> <code><command></code> summary:</h3>
<h4 id="file-io-commands">File I/O commands</h4>
<ul>
<li><strong><a href="#LOAD"><code>LOAD</code></a></strong> -- Input current data structure from a file, or by default from STDIN</li>
<li><strong><a href="#DUMP"><code>DUMP</code></a></strong> -- Output current data structure to a file, or by default to STDOUT</li>
<li><strong><a href="#TABLE"><code>TABLE</code></a></strong> -- Write <code>ref_select</code> statistics for selected reference sequences to a TSV file</li>
<li><strong><a href="#FASTA"><code>FASTA</code></a></strong> -- Write sequences contained in the selected graph to the FASTA format</li>
<li><strong><a href="#DOT"><code>DOT</code></a></strong> -- Write the current assembly graph to the graphviz DOT format</li>
</ul>
<h4 id="assembly-pipeline-commands-in-this-order">Assembly pipeline commands (in this order)</h4>
<ul>
<li><strong><a href="#MST"><code>MST</code></a></strong> -- Calculate the Maximal Spanning Tree of all connected components</li>
<li><strong><a href="#SST"><code>SST</code></a></strong> -- Calculate the improved Scaffold Spanning Tree of all connected components. <code>SST</code> is an optional replacement for <code>MST</code>, useful for metagenomes or relatively short contigs.</li>
<li><strong><a href="#PLUCK"><code>PLUCK</code></a></strong> -- Remove all leaf contig nodes (in or outdegree == 0) from the graph</li>
<li><strong><a href="#PRUNE"><code>PRUNE</code></a></strong> -- Split the assembly graph at all contig nodes with in or out degree > 1</li>
<li><strong><a href="#SLICE"><code>SLICE</code></a></strong> -- Break linear scaffolds at a GC / coverage discontinuity. <code>SLICE</code> is optional, but recommended for metagenomes.</li>
<li><strong><a href="#PUSH"><code>PUSH</code></a></strong> -- Extends scaffold ends (reversing the action of PLUCK at scaffold ends)</li>
<li><strong><a href="#INSERT"><code>INSERT</code></a></strong> -- Reconnect contigs with ambiguous placement within a scaffold using mapped pair position information to resolve ambiguities</li>
<li><strong><a href="#SCAFF"><code>SCAFF</code></a></strong> -- Lay out a fully linear scaffold from contigs in unambiguously ordered connected components</li>
<li><strong><a href="#CLUST"><code>CLUST</code></a></strong> -- Cluster scaffolds using the <code>tetracalc</code> tool</li>
</ul>
<h4 id="assembly-graph-filter--edit-utilities">Assembly graph filter / edit utilities</h4>
<ul>
<li><strong><a href="#CIRCLE"><code>CIRCLE</code></a></strong> -- Find circular sequences among scaffolds.</li>
<li><strong><a href="#CCOMPS"><code>CCOMPS</code></a></strong> -- Calculate the current graph connected components</li>
<li><strong><a href="#SELCC"><code>SELCC</code></a></strong> -- Select specific connected components for further processing</li>
<li><strong><a href="#SELCLUST"><code>SELCLUST</code></a></strong> -- Select clusters of scaffolds for further processing</li>
<li><strong><a href="#SELND"><code>SELND</code></a></strong> -- Select contigs from the connected neighborhoods of the given contigs</li>
<li><strong><a href="#SCAFLNK"><code>SCAFLNK</code></a></strong> -- Find edge connections linking scaffold ends</li>
<li><strong><a href="#RELINK"><code>RELINK</code></a></strong> -- Reconnect previously removed edge connections between contigs</li>
<li><strong><a href="#EDGFLT"><code>EDGFLT</code></a></strong> -- Remove all edges scoring less than some number of bits</li>
<li><strong><a href="#PROBS"><code>PROBS</code></a></strong> -- Generate a report of contigs with likely internal assembly issues</li>
<li><strong><a href="#EDIT"><code>EDIT</code></a></strong> -- Make manual explicit edits to the selected graph structure</li>
<li><strong><a href="#CUTND"><code>CUTND</code></a></strong> -- Create a new node from an existing node using the given coordinates</li>
</ul>
<h4 id="information-and-control">Information and control</h4>
<ul>
<li><strong><a href="#GC"><code>GC</code></a></strong> -- Generate statistics about the assembly graph</li>
<li><strong><a href="#GCC"><code>GCC</code></a></strong> -- Generate statistics about the assembly graph, with details about each connected component</li>
<li><strong><a href="#STASH"><code>STASH</code></a></strong> -- Copy (push) the current graph to the top of a stack in memory</li>
<li><strong><a href="#UNSTASH"><code>UNSTASH</code></a></strong> -- Restore (pop) the current graph from the top of the stack (undo changes)</li>
<li><strong><a href="#SCRIPT"><code>SCRIPT</code></a></strong> -- Use contents of file as a series of commands, or read from the console if no file provided</li>
<li><strong><a href="#HELP"><code>HELP</code></a></strong> -- List all valid commands, or provide detailed help for a specific command with <code>HELP <COMMAND></code></li>
</ul>
<hr>
<h3 id="graph_ops-detailed-command-reference"><code>graph_ops</code> detailed <code><command></code> reference:</h3>
<h3 id="a-nameloadloada"><a name="LOAD"><code>LOAD</code></a></h3>
<p>Input JSON sequence graph data from a file, or by default from STDIN.</p>
<p>On the command line, if the first parameter isn't a valid command string, and it ends in <code>.json[.gz]</code>, then it is assumed to be the name of a JSON file, and an implicit <code>LOAD</code> command will be run using that filename.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><b><code>file : "filename.json[.gz]"</code></b></p>
<p> Specify the name of a JSON format sequence graph file.</p>
<p> Examples:</p>
<pre><code> # Read data from the file my_assembly.json
LOAD {"file":"my_assembly.json"}
# Read data piped from STDIN (default)
LOAD {"file":"-"}</code></pre></li>
<li><p><b><code>files : ["filename1.json[.gz]",...]</code></b></p>
<p> Specify the names of multiple JSON format sequence graph files to merge together.</p>
<p> NOTE: The files must have uniquely named sequence nodes (contigs) with no nodes in common among the multiple files being loaded. Only the processing history from the first JSON file is retained (the other histories may still be found in their original files.) Other processing output such as scaffolds and clusters are lost.</p>
<p> Example:</p>
<pre><code> # Read and merge data from the files my_assembly_1.json and my_assembly_2.json
LOAD {"files":["my_assembly_1.json","my_assembly_2.json"]}</code></pre></li>
<li><p><b><code>edges : "edge_file.json[.gz]"</code></p>
<p> Specify the name of a JSON format sequence graph file to load edges and mate-pair statistics from.</p>
<p> This option works with the <code>file</code> option, and requires that the edge file contain a graph with a subset of the nodes of the main graph being loaded with <code>file</code>. Any edges and mate-pair statisics present in the main graph are discarded in favor of those from the <code>edge</code> file.</p>
<p> Example:</p>
<pre><code> # Read nodes and sequence data from main_file.json, and edges and connection
# data from edge_file.json
LOAD {"file":"main_file.json","edge_file":"edge_file.json"}</code></pre></li>
</ul>
<h3 id="a-namedumpdumpa"><a name="DUMP"><code>DUMP</code></a></h3>
<p>Output current sequence graph data to a JSON file, or by default to STDOUT</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><b><code>file : "filename.json[.gz]"</code></b></p>
<p> Specify a file name to write the JSON format sequence graph to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name.</p>
<p> Examples:</p>
<pre><code> # Write data to the file my_assembly.json
DUMP {"file":"my_assembly.json"}
# Write data piped to STDOUT (default)
DUMP {"file":"-"}
# Write data to the file: my_assembly_000.json (first time this is run)
DUMP {"file":"my_assembly_###.json"}
# Run again, write data to the file: my_assembly_001.json
DUMP {"file":"my_assembly_###.json"}</code></pre></li>
</ul>
<h3 id="a-nametabletablea"><a name="TABLE"><code>TABLE</code></a></h3>
<p>Write <code>ref_select</code> statistics for selected reference sequences to a TSV file. The fields are (left to right by column):</p>
<ul>
<li><code>bitscore</code> - Information content of reads aligning with the ref sequence</li>
<li><code>read_cnt</code> - Number of (possibly fractional) reads aligning with the ref sequence</li>
<li><code>norm_cnt</code> - Read_cnt normalized to ref sequence length</li>
<li><code>rel_abun</code> - Relative fractional abundance of (copy number of) the ref sequence to all selected sequences (those with bitscores above some thresh)</li>
<li><code>mean_cov</code> - Mean coverage of the reference sequence by (possibly fractional) reads</li>
<li><code>read_len</code> - Mean length of reads aligning with this sequence</li>
<li><code>seq_len</code> - Length of the ref sequence</li>
<li><code>pct_gc</code> - Percent GC content of ref sequence (NA if not calculated)</li>
<li><code>name</code> - Catalog name of the ref sequence</li>
<li><code>desc</code> - Catalog description of the ref sequence</li>
</ul>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : "filename.tsv[.gz]"</code></p>
<p> Specify a file name to write the TSV format stats table to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name. (See <code>DUMP</code> command an example using this behavior)</p>
<p> Examples:</p>
<pre><code> # Write stats to the file my_seq.tsv
TABLE {"file":"my_seq.tsv"}
# Write stats to STDOUT (default)
TABLE {"file":"-"}</code></pre></li>
<li><p><code>header : true</code></p>
<p> Write a header row in the output table with labels for each column.</p>
<p> Example:</p>
<pre><code> # Write a header row (false is default).
TABLE {"header":true}</code></pre></li>
</ul>
<h3 id="a-namefastafastaa"><a name="FASTA"><code>FASTA</code></a></h3>
<p>Write sequences contained in the current graph data to a FASTA format file</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : "filename.fasta[.gz]"</code></p>
<p> Specify a file name to write the FASTA format sequence data to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name. (See <code>DUMP</code> command an example using this behavior)</p>
<p> Examples:</p>
<pre><code> # Write stats to the file my_seq.fasta
FASTA {"file":"my_seq.fasta"}
# Write sequence to STDOUT (default)
FASTA {"file":"-"}</code></pre></li>
<li><p><code>scaff : true</code> or <code>scaff : "[options]"</code></p>
<p> Output fully scaffolded sequences (using the <code>seq_scaffold</code> tool). Using this parameter with the value <code>true</code> run <code>seq_scaffold</code> with its default settings. As an advanced option, this parameter can also accept a string argument, which will be passed along to the external <code>seq_scaffold</code> tool as its <code>[options]</code> parameter string. Run with option string <code>"--help"</code> for help with the settings offered by <code>seq_scaffold</code> and its default values.</p>
<p> Examples:</p>
<pre><code> # Output scaffolded contig sequences using default settings.
FASTA {"scaff":true}
# Get seq_scaffold help
FASTA {"scaff":"--help"}
# Pass seq_scaffold some options
FASTA {"scaff":"--overlap=7 --heal=othercontigs.fna"}</code></pre></li>
<li><p><code>no_merge_scaffs : true</code></p>
<p> Write contig sequences in scaffold order with a scaffold ID in each header, ready to be provided to the <code>seq_scaffold</code> tool (but do not run it).</p>
<p> Example:</p>
<pre><code> # Output ordered contig sequences
FASTA {"no_merge_scaffs":true}</code></pre></li>
<li><p><code>abundance : true</code></p>
<p> Append relative abundance values to the FASTA sequence IDs. This is for use with downstream classification and visualization tools, such as for the 16S rDNA analysis pipeline vignette.</p>
<p> Example:</p>
<pre><code> # Attach abundance values
FASTA {"abundance":true} -- Attach abundances</code></pre></li>
</ul>
<h3 id="a-namedotdota"><a name="DOT"><code>DOT</code></a></h3>
<p>Write the current graph data to a <a href="http://www.graphviz.org/">graphviz</a> format <a href="http://www.graphviz.org/content/dot-language">DOT</a> file for visualization or other processing. Some of the parameters listed below allow you to change node and edge parameters that will affect the output style of graphs rendered using the <a href="http://www.graphviz.org/pdf/dotguide.pdf"><code>dot</code></a> or <a href="http://www.graphviz.org/pdf/neatoguide.pdf"><code>neato</code></a> layout tools that are part of the graphviz package. The parameters provided here give only very limited control of the most common style choices available. The <a href="http://www.graphviz.org/pdf/gvpr.1.pdf"><code>gvpr</code></a> language supplied by graphviz is much better suited for more detailed manipulation of the myriad of rendering options available.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>file : "filename.dot[.gz]"</code></p>
<p> Specify a file name to write the DOT format graph data to. If the filename contains one or more <code>#</code> characters in a row, these positions are replaced with zero-padded digits that will increment each time a file is written to this filename pattern. If no <code>#</code> characters are present, then this command overwrites any existing file of the same name. (See <code>DUMP</code> command an example using this behavior)</p>
<p> Example:</p>
<pre><code> # Write stats to the file my_graph.dot
DOT {"file":"my_graph.dot"}
# Write graph to STDOUT (default)
DOT {"file":"-"}</code></pre></li>
<li><p><code>detail : true</code></p>
<p> Draw labelled contig nodes</p>
<p> Example:</p>
<pre><code> # Draw contigs as ovals containing the node names of each contig.
# Note that this disables scaling the node size by contig length.
DOT {"detail":true}</code></pre></li>
<li><p><code>arrowtype : "arrowtype_string"</code></p>
<p> Control the type of arrowheads drawn. See graphviz documentation at: <a href="http://www.graphviz.org/doc/info/attrs.html#k:arrowType">http://www.graphviz.org/doc/info/attrs.html#k:arrowType</a></p>
<p> Example:</p>
<pre><code> # Draw normal arrows (default)
DOT {"arrowtype":"normal"}</code></pre></li>
<li><p><code>pen_scale : <float></code></p>
<p> Control the relative thickness of the arrow lines. See graphviz documentation at: <a href="http://www.graphviz.org/doc/info/attrs.html#d:penwidth">http://www.graphviz.org/doc/info/attrs.html#d:penwidth</a></p>
<p> Example:</p>
<pre><code> # Draw normal arrows (default)
DOT {"pen_scale":0.1}</code></pre></li>
<li><p><code>const_edge : true</code></p>
<p> Draw connection arrow lines at constant width</p>
<p> Example:</p>
<pre><code> # Draw edge arrows of constant width regardless of bitscore.
DOT {"const_edge":true}</code></pre></li>
<li><p><code>colored_edges : true</code></p>
<p> Draw edges colored by the GC% of the connected contigs. Note that this overrides any other edge coloring that would have otherwise have been applied.</p>
<p> Example:</p>
<pre><code> # Draw colored arrows
DOT {"colored_edges":true}</code></pre></li>
<li><p><code>problems : true</code></p>
<p> Draw nodes with potential assembly problems as red double circles with the node label text inside. Note that this overrides the normal node text that would otherwise have been rendered.</p>
<p> Example:</p>
<pre><code> # Highlight problem nodes
DOT {"problems":true}</code></pre></li>
</ul>
<h3 id="a-namemstmsta"><a name="MST"><code>MST</code></a></h3>
<p>Calculate the <strong>M</strong>aximal <strong>S</strong>panning <strong>T</strong>ree of all connected components. This command reduces the connection graph to a tree-structure that optimally maximizes the total remaining edge bitscores, preserving all previously connected components of the graph, and directs all remaining edges of the graph based on the pairing information.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>bits : true</code></p>
<p> Use raw connection bitscores instead the default heuristically adjusted scores (based on GC% / coverage differences)</p>
<p> Example:</p>
<pre><code> # Use raw connection bitscores from mate-pairing
MST {"bits":true}</code></pre></li>
</ul>
<h3 id="a-namesstssta"><a name="SST"><code>SST</code></a></h3>
<p>As an alternative to the <code>MST</code> command, calculate the <strong>S</strong>caffold <strong>S</strong>panning <strong>T</strong>ree of all connected components, producing a sub-optimal tree from the perspective of maximizing edge bitscores, but with important properties for successful contig scaffolding in the following steps. This command is generally preferable to the MST command when median contig length is less than the mean distance between paired reads. That is, when a relatively large mate-pair insert size was selected for the sequencing library, relative to the size of the contigs that could ultimately be assembled. This condition is often the case for metagenomic samples, where inter-strain variability adversely affects the median contig length that can be achieved.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>bits : true</code></p>
<p> Use raw connection bitscores instead the default heuristically adjusted scores (based on GC% / coverage differences)</p>
<p> Example:</p>
<pre><code> # Use raw connection bitscores from mate-pairing
SST {"bits":true}</code></pre></li>
</ul>
<h3 id="a-namepluckplucka"><a name="PLUCK"><code>PLUCK</code></a></h3>
<p>Remove all leaf contig nodes (in- or out-degree == 0) from the (tree-structured) graph. Two (or sometimes three) iterations are usually sufficient to achieve the goal of removing all non-terminal short branches from the tree. These contigs will be added back to the assembly in later operations once the "backbone" scaffolds have been determined. NOTE: <code>PLUCK</code> does not remove two "leaf" contig nodes that form a connected pair; that
is, multiple iterations of <code>PLUCK</code> will not completely remove contigs that happen to form short chains. Also,
<code>PLUCK</code> preserves entirely unconnected contigs with more than 20000 bases of sequence by default. See the <code>min_len</code> parameter below.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>iterate : <int></code></p>
<p> Specify the number of iterations of PLUCK to run. Each is equivalent to running PLUCK again as a separate command. By default <code><int></code>= 3.</p>
<p> Examples:</p>
<pre><code> # Default is {"iterate":2}
PLUCK
# This is the equivalent of the above command
PLUCK {"iterate":1}
PLUCK {"iterate":1}</code></pre></li>
<li><p><code>min_len : <int></code></p>
<p> Minimum length of isolated nodes to keep. Default <code><int></code>= 20000 bases.</p>
<p> Example:</p>
<pre><code> # Do not "pluck away" an unconnected contig containing 20000 or more bases of sequence.
PLUCK {"min_len":20000}</code></pre></li>
</ul>
<h3 id="a-namepruneprunea"><a name="PRUNE"><code>PRUNE</code></a></h3>
<p>Split the assembly graph at all contig nodes with in or out degree > 1. This command prunes remaining branches from the assembly tree by selectively removing the lowest scoring extra in- or out- edges from a node. After this operation is complete, all remaining connected components will be linear and properly directed.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>strict : true</code></p>
<p> Always cut all but the strongest link when there are more than two in + out links</p>
<p> Examples:</p>
<pre><code> # The default case: attempt to determine which branch(es) to remove in order to
# preserve one high weight path across a multiply linked contig node.
PRUNE {"strict":false}
# Always strictly prune. Equivalent to {"ratio":0.0} below
PRUNE {"strict":true}</code></pre></li>
<li><p><code>ratio : <float></code></p>
<p> Ratio of the strongest to next strongest links to trigger strict pruning</p>
<p> Examples:</p>
<pre><code> # Always strictly prune.
PRUNE {"ratio":0.0}
# Never strictly prune.
PRUNE {"ratio":1.0}
# Default. Don't strictly filter when the score of the highest scoring edge
# is >= 5x greater than the next highest scoring link of the same direction
# (for both in and outlinks).
PRUNE {"ratio":0.2}</code></pre></li>
<li><p><code>verbose : true</code></p>
<p> Output diagnostics on STDERR</p>
<p> Example:</p>
<pre><code> # Generate extra output information
PRUNE {"verbose":true}</code></pre></li>
</ul>
<h3 id="a-namesliceslicea"><a name="SLICE"><code>SLICE</code></a></h3>
<p>Break linear scaffolds at a %GC / Coverage discontinuity. Each run of <code>SLICE</code> will break a given scaffold at no more than one position. <code>SLICE</code> should be run multiple times if multiple misassemblies per scaffold are suspected.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>thresh : <float></code></p>
<p> Threshold used to determine whether or not to break a connection</p>
<p> Example:</p>
<pre><code> # Default. Medium strength heuristic score based on GC% / Coverage statistics of a
# scaffold on either side of a given contig's connection edge. The lower the
# threshold, the more sensitive the filter is to such discontinuities.
SLICE '{"thresh":0.5}'</code></pre></li>
</ul>
<h3 id="a-namepushpusha"><a name="PUSH"><code>PUSH</code></a></h3>
<p>Extends scaffold ends, reversing the (undesired) action of <code>PLUCK</code> at scaffold ends. Because the <code>PLUCK</code> command can't distinguish short-branches from what will ultimately become scaffold ends, it essentially erodes a contig off of the each end with every iteration. To reverse this, <code>PUSH</code> should be run the same number of iterations as was <code>PLUCK</code> previously.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>iterate : <int></code></p>
<p> Number of iterations of PUSH to run. Each is equivalent to running <code>PUSH</code> again as a separate command. By default <code><int></code>= 3.</p>
<p> Examples:</p>
<pre><code> # Default case.
PUSH {"iterate":3}
# Equivalent to above.
PUSH {"iterate":2}
PUSH {"iterate":1}</code></pre></li>
</ul>
<h3 id="a-nameinsertinserta"><a name="INSERT"><code>INSERT</code></a></h3>
<p>Reconnect contigs with ambiguous placement within a scaffold using relative mean read-pair mapping position information to resolve ambiguities. By default this will run on all scaffolds. <code>INSERT</code> works by fully connecting contigs within the scaffold when their relative position can be unambiguously determined, and adding new "constraint" edges (supported only by the relative mapped position information) between such contigs so that they can be ordered correctly by the <code>SCAFF</code> command.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>ccname : "contig_name"</code></p>
<p> Use only the connected component (scaffold) containing the named contig.</p>
<p> Example:</p>
<pre><code> # Insert removed contigs within the connected component containing this contig
INSERT {"ccname":"NODE_1234"}</code></pre></li>
<li><p><code>ccnames : ["contig_name1","contig_name2",...]</code></p>
<p> Insert removed contigs within only the connected components (scaffolds) containing these contigs.</p>
<p> Example:</p>
<pre><code> INSERT {"ccnames":["NODE_1234","NODE_567"]}</code></pre></li>
<li><p><code>thresh : <float></code></p>
<p> Minimum contig connection bitscore to consider</p>
<p> Example:</p>
<pre><code> # Default. Only connections scoring at least 250.0 bits
# will count in the calculations
INSERT {"thresh":250.0}</code></pre></li>
<li><p><code>min_pos_diff : <int></code></p>
<p> Minimum mapping position difference (within a neighboring contig) considered reliable enough to use for resolving positional ambiguities.</p>
<p> Example:</p>
<pre><code> # Default. Only pairing information yielding a relative position difference of
# 75 nucleotides (between candidate contigs and a neighboring existing contig)
# will be considered significant enough to use in determining the relative
# ordering of the candidate contigs
INSERT {"min_pos_diff":75}</code></pre></li>
<li><p><code>dup_kmer : <int></code></p>
<p> Length of k-mers to use for detecting variant duplicate contigs</p>
<p> Examples:</p>
<pre><code> # Default. Use 14-mers to detect duplicate variant contigs before inserting
# them between scaffold backbone contigs.
INSERT {"dup_kmer":14}
# Disable duplicate variant contig checks.
INSERT {"dup_kmer":0}</code></pre></li>
<li><p><code>dup_thresh : <float></code></p>
<p> Fraction of kmers from a contig with hits to scaffold backbone.</p>
<p> Example:</p>
<pre><code> # Default. If more than 15% of kmers for a contig hit kmers from the scaffold
# backbone contigs, then do not insert that contig.
INSERT {"dup_thresh":0.15}</code></pre></li>
<li><p><code>external : true</code></p>
<p> Controls whether contigs may be inserted outside of an input scaffold.</p>
<p> Examples:</p>
<pre><code> # Default.
# Contigs will not be added outside of the first and last contigs in the input
# scaffold; that is, the only inserts performed will be between existing connected
# contigs.
INSERT {"external":false}
# Contigs may be added to the scaffold before the first, or after the last, contig
# in an input scaffold.
INSERT {"external":true}</code></pre></li>
<li><p><code>verbose : true</code></p>
<p> Output diagnostics on STDERR</p>
<p> Example:</p>
<pre><code> # Generate extra output information
INSERT {"verbose":true}</code></pre></li>
</ul>
<h3 id="a-namescaffscaffa"><a name="SCAFF"><code>SCAFF</code></a></h3>
<p>Lay out a fully linear scaffold from contigs in unambiguously ordered connected components (as produced by the <code>INSERT</code> command)</p>
<p><strong>Parameters: NONE.</strong></p>
<h3 id="a-nameclustclusta"><a name="CLUST"><code>CLUST</code></a></h3>
<p>Cluster scaffolds using the external <code>tetracalc</code> tool</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>options : <string></code></p>
<p> Command line options for the <code>tetracalc</code> tool, otherwise tool defaults used. Run with --help for help with the settings offered by <code>tetracalc</code>.</p>
<p> Examples:</p>
<pre><code> CLUST {"options":"--merge_tar=0.95 -m 7500"}
CLUST {"options":"--help"}
CLUST {"options":"--fixed -t 0.9 -s 0.8 -r 0.9"}</code></pre></li>
</ul>
<h3 id="a-namecirclecirclea"><a name="CIRCLE"><code>CIRCLE</code></a></h3>
<p>Find circular sequences among scaffolds</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>circular : true</code></p>
<p> Return circularized scaffolds.</p>
<p> Default (<code>false</code>): Produce linear scaffolds placing the ends within the largest
contig of each connected component. The node with the longest sequence will be
divided equally into two and the mate-pair connections also divided, such that
they form two clean ends for the new scaffold.</p>
<p> Example:</p>
<pre><code> # Produce circular scaffolds.
CIRCLE {"circular":true}</code></pre></li>
<li><p><code>thresh : <float></code></p>
<p> Minimum bitscore of end-connecting pairing information</p>
<p>Example:</p>
<pre><code> # Default. Circle completing end connections must have a positive bit score to qualify.
CIRCLE {"thresh":0.0}</code></pre></li>
</ul>
<h3 id="a-nameccompsccompsa"><a name="CCOMPS"><code>CCOMPS</code></a></h3>
<p>Calculate the current assembly graph connected components. Note, this is a low-level operation which is typically done automatically by other commands as required.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li><p><code>sortby : <string></code></p>
<p> Specify how to sort the resulting list of CCs. Must be: "nodes" or "sequences".</p>
<p> Examples:</p>
<pre><code> # Default. Sort CCs in descending order of number of nodes.
CCOMPS {"sortby":"nodes"}
# Sort CCs in descending order of amount of sequence.
CCOMPS {"sortby":"sequences"}</code></pre></li>
</ul>
<h3 id="a-nameselccselcca"><a name="SELCC"><code>SELCC</code></a></h3>
<p>Select specific connected components for further processing. "Unselected" connected components are saved in a "removed" data structure and may be recalled with other commands.</p>
<p>NOTE: There are two types of parameters below. "Selection parameters" allow a subset of connected components to be selected by contig names or component number. "Filter parameters" are applied to the set of "selected" connected components (or by default, to the set of all connected components.) These filters may also be used in combination, resulting in a logical "AND" relationship (ie. connected components not satisfying all of the filters are removed).</p>
<p><strong>Selection Parameters:</strong></p>
<ul>
<li><p><code>ccname : "contig_name"</code></p>
<p> Select the connected component containing the named contig</p>
<p> Example:</p>
<pre><code> # Select the connected component containing the contig named NODE_1234
SELCC {"ccname":"NODE_1234"}</code></pre></li>
<li><p><code>ccnames : ["contig_name1","contig_name2",...]</code>
Select the connected component(s) containing the named contigs</p>
<p> Example:</p>
<pre><code> # Select the connected components containing the contigs named NODE_1234 and NODE_5678
SELCC {"ccnames":["NODE_1234","NODE_5678"]}</code></pre></li>
<li><p><code>ccnum : <int></code></p>
<p> Select connected component number <code><int></code></p>
<p> Example:</p>
<pre><code> # Default. Select connected component number 0 (the first CC).
SELCC {"ccnum":0}</code></pre></li>
<li><p><code>ccnums : [<int>,<int>,...]</code></p>
<p> Select the connected components from the list of numbers</p>
<p> Example:</p>
<pre><code> # Select the second and third connected components (numbering is zero based)
SELCC {"ccnums":[1,2]}</code></pre></li>
<li><p><code>ccrange : [<int1>,<int2>]</code></p>
<p> Select the connected components numbered in the range <code><int1>..<int2></code> (inclusive). <code><int></code> may be negative, indicating positions at the end of the list of connected components.</p>
<p> Examples:</p>
<pre><code> # Select the first 6 connected components
SELCC {"ccrange":[0,5]}
# Select the last 5 connected components
SELCC {"ccrange":[-5,-1]}</code></pre></li>
<li><p><code>shift : true</code></p>
<p> Select all connected components except the first one. This is like <code>SELCC {"ccrange":[1,-1]}</code> except it doesn't generate an error when there is only one remaining connected component, allowing processing to potentially continue in any calling <code>SCRIPT</code> commands.</p>
<p> Example:</p>
<pre><code> # Drop the first connected component.
SELCC {"shift":true}</code></pre></li>
</ul>
<p><strong>Filter Parameters:</strong></p>
<ul>
<li><p><code>min_nodes : <int></code></p>
<p> Select connected components with <int> or more nodes.</p>
<p> Example:</p>
<pre><code> # Select connected components with 2 or more nodes.
SELCC {"min_nodes":2}</code></pre></li>
<li><p><code>min_seqlen : <int></code></p>
<p> Select connected components with <int> or more sequence within nodes.</p>
<p> Example:</p>
<pre><code> # Select connected components containing at least 1000 bases of sequence.
SELCC {"min_seqlen":1000}</code></pre></li>
<li><p><code>sequence : <string></code></p>
<p> Select connected components containing the provided DNA sequence. NOTE! This isn't BLAST, the sequence must match exactly. Any differences, including ambiguity codes, will prevent matching. The only extra thing that is done is the reverse complement of the provided sequence is also searched.</p>
<p> Example:</p>
<pre><code> # Select connected components containing the provided sequence
# (or its reverse complement).
SELCC {"sequence":"AGACTAGCAGATATACGATAACGATACGATACGAT"}</code></pre></li>
<li><p><code>sequences : [<string>,...]</code></p>
<p> Like <code>sequence</code> parameter above, but using a list of sequences</p>
<p> Example:</p>
<pre><code> # Select connected components containing the provided sequences
# (or their reverse complements).
SELCC {"sequences":["AGACTAGCAGATATACG","ATAACGATACGATACGAT"]}</code></pre></li>
<li><p><code>soft : true</code></p>
<p> Determines whether unselected ccomp nodes are deleted or kept as "removed"</p>
<p> Examples:</p>
<pre><code> # The contigs of first connected component will remain selected, all
# others are retained as unselected.
SELCC {"soft":true}
# The contigs of first connected component will remain selected,
# neighboring contigs (one hop away) are unselected and added to the
# "removed" pool. All others are permanently deleted.
SELCC {"soft":false} # Default</code></pre></li>
</ul>
<h3 id="a-nameselclustselclusta"><a name="SELCLUST"><code>SELCLUST</code></a></h3>