Kvg hierarchical models (#173)

* Refactored models into a hierarchical format * Updated `inspect` to have a BLAST-like display. Now includes option to ingest a second bam file to visualize corrected CBC and UMI. * Sift now verifies that the entirety of the cDNA structure is present. * Added `SC` tag, containing segment cigar strings. * Added `umi_correct` tool. (#172) * Added `sift` documentation * Changed default # threads for `correct` to 1 because of memory use. * Refactoring into a hierarchical HMM constitutes a #minor version bump. * Removed `scsplit`. * Updated `annotate` to only read in one model from a bam file. * Further updates for PR review. * Replaced mas10 test data. * Regenerated test data. * Now only supports one model per bam file. Co-authored-by: Jonn Smith <[email protected]> Co-authored-by: James Webber <[email protected]> Co-authored-by: BumpVersion Action <bumpversion@github-actions> Co-authored-by: Jonn Smith <[email protected]>
broadinstitute · Nov 5, 2022 · 09e7dd2 · 09e7dd2
1 parent b5cc346
commit 09e7dd2
Show file tree

Hide file tree

Showing 113 changed files with 4,398 additions and 7,816 deletions.
diff --git a/README.rst b/README.rst
@@ -51,17 +51,17 @@ The commands below illustrate the Longbow workflow on a small library of SIRVs (
     wget https://github.com/broadinstitute/longbow/raw/main/tests/test_data/resources/SIRV_Library.fasta
 
     # Basic processing workflow
-    longbow annotate -m mas15v2 mas15_test_input.bam | \  # Annotate reads according to the mas15v2 model
-      tee ann.bam | \                                     # Save annotated BAM for later
-      longbow filter | \                                  # Filter out improperly-constructed arrays
-      longbow segment | \                                 # Segment reads according to the model
-      longbow extract -o filter_passed.bam                # Extract adapter-free cDNA sequences
+    longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam | \  # Annotate reads according to the mas_15+sc_10x5p model
+      tee ann.bam | \                                             # Save annotated BAM for later
+      longbow filter | \                                          # Filter out improperly-constructed arrays
+      longbow segment | \                                         # Segment reads according to the model
+      longbow extract -o filter_passed.bam                        # Extract adapter-free cDNA sequences
 
     # Align reads with long read aligner (e.g. minimap2, pbmm2)
     samtools fastq filter_passed.bam | \
-        minimap2 -ayYL --MD -x splice:hq SIRV_Library.fasta - | \
-        samtools sort > align.bam &&
-        samtools index align.bam
+      minimap2 -ayYL --MD -x splice:hq SIRV_Library.fasta - | \
+      samtools sort > align.bam &&
+      samtools index align.bam
 
 
 Getting help

diff --git a/docs/commands.md b/docs/commands.md
@@ -23,19 +23,21 @@ Commands:
   annotate     Annotate reads in a BAM file with segments from the model.
   convert      Convert reads from fastq{,.gz} files for use with `annotate`.
   correct      Correct tag to values provided in barcode allowlist.
+  correct_umi  Correct UMIs with Set Cover algorithm.
   demultiplex  Separate reads into files based on which model they fit best.
   extract      Extract coding segments from the reads in the given bam.
   filter       Filter reads by conformation to expected segment order.
   inspect      Inspect the classification results on specified reads.
-  model        Get information about built-in Longbow models.
+  models       Get information about built-in Longbow models.
   pad          Pad tag by specified number of adjacent bases from the read.
   peek         Guess the best pre-built array model to use for annotation.
-  scsplit      Create files for use in `alevin` for single-cell analysis.
   segment      Segment pre-annotated reads from an input BAM file.
+  sift         Filter segmented reads by conformation to expected cDNA.
   stats        Calculate and produce stats on the given input bam file.
   tagfix       Update longbow read tags after alignment.
   train        Train transition and emission probabilities on real data.
   version      Print the version of longbow.
+
 ```
 
 ## Help for individual commands

diff --git a/docs/commands/correct_umi.md b/docs/commands/correct_umi.md
@@ -0,0 +1,55 @@
+---
+layout: default
+title: correct_umi 
+description: "Get info on built-in models."
+nav_order: 4
+parent: Commands
+---
+
+# Correct_UMI
+
+## Description
+
+Correct UMIs with Set Cover algorithm.
+
+Corrects all UMIs in the given bam file.  
+
+Algorithm originally developed by Victoric Popic.
+
+### Data Requirements:
+
+- Bam file should be aligned and annotated with genes and transcript equivalence classes prior to running.
+- It is critical that you give the proper input for the `--pre-extracted` flag.
+  - If the file has been run through `longbow extract`, use `--pre-extracted`
+  - If the file has _NOT_ been run through `longbow extract` _DO NOT USE_ `--pre-extracted`
+
+The following tags are required in the input file:
+
+- `CB` Cell Barcode
+- `JX` (Adjusted UMI)
+- `eq` (Equivalence class assignment)
+- `XG` (Gene assignment)
+- `rq` (Read Quality: [-1.0, 1.0])
+- `JB` (Back / UMI trailing segment Smith-Waterman alignment score)
+
+## Command help
+
+```shell
+$ longbow correct_umi --help
+Usage: longbow correct_umi [OPTIONS] INPUT_BAM
+
+  Correct UMIs with Set Cover algorithm.
+
+Options:
+  -v, --verbosity LVL       Either CRITICAL, ERROR, WARNING, INFO or DEBUG
+  -l, --umi-length INTEGER  Length of the UMI for this sample.  [default: 10]
+  -o, --output-bam PATH     Corrected UMI bam output [default: stdout].
+  -x, --reject-bam PATH     Filtered bam output (failing reads only).
+  -f, --force               Force overwrite of the output files if they exist.
+                            [default: False]
+  --pre-extracted           Whether the input file has been processed with
+                            `longbow extract`  [default: False]
+  --help                    Show this message and exit.
+```
+
+
diff --git a/docs/commands/demultiplex.md b/docs/commands/demultiplex.md
@@ -2,7 +2,7 @@
 layout: default
 title: demultiplex
 description: "Demultiplex reads from different array structures."
-nav_order: 4
+nav_order: 5
 parent: Commands
 ---
 

diff --git a/docs/commands/extract.md b/docs/commands/extract.md
@@ -2,7 +2,7 @@
 layout: default
 title: extract
 description: "Extract reads."
-nav_order: 5
+nav_order: 6
 parent: Commands
 ---
 

diff --git a/docs/commands/filter.md b/docs/commands/filter.md
@@ -2,7 +2,7 @@
 layout: default
 title: filter
 description: "Filter reads."
-nav_order: 6
+nav_order: 7
 parent: Commands
 ---
 

diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
@@ -2,7 +2,7 @@
 layout: default
 title: inspect 
 description: "Inspect reads."
-nav_order: 7
+nav_order: 8
 parent: Commands
 ---
 

diff --git a/docs/commands/model.md b/docs/commands/model.md
diff --git a/docs/commands/models.md b/docs/commands/models.md
@@ -0,0 +1,85 @@
+---
+layout: default
+title: models
+description: "Get info on built-in models."
+nav_order: 9
+parent: Commands
+---
+
+# Models
+
+## Description
+
+Get information about built-in Longbow models.
+
+Can list all built-in models with their version and descriptions or can dump the details of a single model to several files that contain information about that model.
+
+## Command help
+
+```shell
+$ longbow models --help
+Usage: longbow models [OPTIONS]
+
+  Get information about built-in Longbow models.
+
+Options:
+  -v, --verbosity LVL  Either CRITICAL, ERROR, WARNING, INFO or DEBUG
+  -l, --list-models    List the names of all models supported natively by this
+                       version of Longbow. NOTE: This argument is mutually
+                       exclusive with  arguments: [dump].
+  -d, --dump TEXT      Dump the details of a given model.  This command
+                       creates a set of files corresponding to the given model
+                       including: json representation, dot file
+                       representations, transmission matrix, state emission
+                       json file. NOTE: This argument is mutually exclusive
+                       with  arguments: [list_models].
+  --help               Show this message and exit.
+
+```
+
+## Examples
+
+```shell
+$ longbow models --list-models
+Longbow includes the following models:
+
+Array models
+============
+Name      Version  Description
+mas_15    3.0.0    15-element MAS-ISO-seq array
+mas_10    3.0.0    10-element MAS-ISO-seq array
+isoseq    3.0.0    PacBio IsoSeq model
+
+cDNA models
+===========
+Name                Version  Description
+sc_10x3p            3.0.0    single-cell 10x 3' kit
+sc_10x5p            3.0.0    single-cell 10x 5' kit
+bulk_10x5p          3.0.0    bulk 10x 5' kit
+bulk_teloprimeV2    3.0.0    Lexogen TeloPrime V2 kit
+spatial_slideseq    3.0.0    Slide-seq protocol
+
+Specify a fully combined model via '<array model>+<cDNA model>' syntax, e.g. 'mas_15+sc_10x5p'.
+
+```
+
+```shell
+$ longbow models --dump mas_15+sc_10x5p
+[INFO 2022-11-02 20:33:28   models] Generating model: mas_15+sc_10x5p
+[INFO 2022-11-02 20:33:29   models] Dumping mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
+[INFO 2022-11-02 20:33:29   models] Dumping dotfile: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dot
+[INFO 2022-11-02 20:33:29   models] Dumping simple dotfile: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.simple.dot
+[INFO 2022-11-02 20:33:29   models] Dumping json model specification: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.spec.json
+[INFO 2022-11-02 20:33:30   models] Dumping dense transition matrix: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dense_transition_matrix.pickle
+[INFO 2022-11-02 20:33:30   models] Dumping emission distributions: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.emission_distributions.txt
+[INFO 2022-11-02 20:33:30   models] Creating model graph from 1109 states...
+[INFO 2022-11-02 20:33:47   models] Rendering model graph now...
+[INFO 2022-11-02 20:33:57   models] Writing model graph now to longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.graph.png ...
+
+$ ls 
+longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dense_transition_matrix.pickle  longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.graph.png
+longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dot                             longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.simple.dot
+longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.emission_distributions.txt      longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.spec.json
+```
+
+
diff --git a/docs/commands/pad.md b/docs/commands/pad.md
@@ -2,7 +2,7 @@
 layout: default
 title: pad
 description: "Pad tag by specified number of adjacent bases from the read."
-nav_order: 9
+nav_order: 10
 parent: Commands
 ---
 

diff --git a/docs/commands/peek.md b/docs/commands/peek.md
@@ -2,7 +2,7 @@
 layout: default
 title: pad
 description: "Guess the best pre-built array model to use for annotation."
-nav_order: 10
+nav_order: 11
 parent: Commands
 ---