Skip to content

Commit

Permalink
Kvg hierarchical models (#173)
Browse files Browse the repository at this point in the history
* Refactored models into a hierarchical format

* Updated `inspect` to have a BLAST-like display.  Now includes option to ingest a second bam file to visualize corrected CBC and UMI.

* Sift now verifies that the entirety of the cDNA structure is present.

* Added `SC` tag, containing segment cigar strings.

* Added `umi_correct` tool. (#172)

* Added `sift` documentation

* Changed default # threads for `correct` to 1 because of memory use.

* Refactoring into a hierarchical HMM constitutes a #minor version bump.

* Removed `scsplit`.
* Updated `annotate` to only read in one model from a bam file.

* Further updates for PR review.

* Replaced mas10 test data.
* Regenerated test data.
* Now only supports one model per bam file.

Co-authored-by: Jonn Smith <[email protected]>
Co-authored-by: James Webber <[email protected]>
Co-authored-by: BumpVersion Action <bumpversion@github-actions>
Co-authored-by: Jonn Smith <[email protected]>
  • Loading branch information
5 people authored Nov 5, 2022
1 parent b5cc346 commit 09e7dd2
Show file tree
Hide file tree
Showing 113 changed files with 4,398 additions and 7,816 deletions.
16 changes: 8 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,17 +51,17 @@ The commands below illustrate the Longbow workflow on a small library of SIRVs (
wget https://github.com/broadinstitute/longbow/raw/main/tests/test_data/resources/SIRV_Library.fasta

# Basic processing workflow
longbow annotate -m mas15v2 mas15_test_input.bam | \ # Annotate reads according to the mas15v2 model
tee ann.bam | \ # Save annotated BAM for later
longbow filter | \ # Filter out improperly-constructed arrays
longbow segment | \ # Segment reads according to the model
longbow extract -o filter_passed.bam # Extract adapter-free cDNA sequences
longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam | \ # Annotate reads according to the mas_15+sc_10x5p model
tee ann.bam | \ # Save annotated BAM for later
longbow filter | \ # Filter out improperly-constructed arrays
longbow segment | \ # Segment reads according to the model
longbow extract -o filter_passed.bam # Extract adapter-free cDNA sequences

# Align reads with long read aligner (e.g. minimap2, pbmm2)
samtools fastq filter_passed.bam | \
minimap2 -ayYL --MD -x splice:hq SIRV_Library.fasta - | \
samtools sort > align.bam &&
samtools index align.bam
minimap2 -ayYL --MD -x splice:hq SIRV_Library.fasta - | \
samtools sort > align.bam &&
samtools index align.bam


Getting help
Expand Down
6 changes: 4 additions & 2 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,21 @@ Commands:
annotate Annotate reads in a BAM file with segments from the model.
convert Convert reads from fastq{,.gz} files for use with `annotate`.
correct Correct tag to values provided in barcode allowlist.
correct_umi Correct UMIs with Set Cover algorithm.
demultiplex Separate reads into files based on which model they fit best.
extract Extract coding segments from the reads in the given bam.
filter Filter reads by conformation to expected segment order.
inspect Inspect the classification results on specified reads.
model Get information about built-in Longbow models.
models Get information about built-in Longbow models.
pad Pad tag by specified number of adjacent bases from the read.
peek Guess the best pre-built array model to use for annotation.
scsplit Create files for use in `alevin` for single-cell analysis.
segment Segment pre-annotated reads from an input BAM file.
sift Filter segmented reads by conformation to expected cDNA.
stats Calculate and produce stats on the given input bam file.
tagfix Update longbow read tags after alignment.
train Train transition and emission probabilities on real data.
version Print the version of longbow.

```

## Help for individual commands
Expand Down
55 changes: 55 additions & 0 deletions docs/commands/correct_umi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
layout: default
title: correct_umi
description: "Get info on built-in models."
nav_order: 4
parent: Commands
---

# Correct_UMI

## Description

Correct UMIs with Set Cover algorithm.

Corrects all UMIs in the given bam file.

Algorithm originally developed by Victoric Popic.

### Data Requirements:

- Bam file should be aligned and annotated with genes and transcript equivalence classes prior to running.
- It is critical that you give the proper input for the `--pre-extracted` flag.
- If the file has been run through `longbow extract`, use `--pre-extracted`
- If the file has _NOT_ been run through `longbow extract` _DO NOT USE_ `--pre-extracted`

The following tags are required in the input file:

- `CB` Cell Barcode
- `JX` (Adjusted UMI)
- `eq` (Equivalence class assignment)
- `XG` (Gene assignment)
- `rq` (Read Quality: [-1.0, 1.0])
- `JB` (Back / UMI trailing segment Smith-Waterman alignment score)

## Command help

```shell
$ longbow correct_umi --help
Usage: longbow correct_umi [OPTIONS] INPUT_BAM

Correct UMIs with Set Cover algorithm.

Options:
-v, --verbosity LVL Either CRITICAL, ERROR, WARNING, INFO or DEBUG
-l, --umi-length INTEGER Length of the UMI for this sample. [default: 10]
-o, --output-bam PATH Corrected UMI bam output [default: stdout].
-x, --reject-bam PATH Filtered bam output (failing reads only).
-f, --force Force overwrite of the output files if they exist.
[default: False]
--pre-extracted Whether the input file has been processed with
`longbow extract` [default: False]
--help Show this message and exit.
```
2 changes: 1 addition & 1 deletion docs/commands/demultiplex.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: demultiplex
description: "Demultiplex reads from different array structures."
nav_order: 4
nav_order: 5
parent: Commands
---

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/extract.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: extract
description: "Extract reads."
nav_order: 5
nav_order: 6
parent: Commands
---

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: filter
description: "Filter reads."
nav_order: 6
nav_order: 7
parent: Commands
---

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/inspect.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: inspect
description: "Inspect reads."
nav_order: 7
nav_order: 8
parent: Commands
---

Expand Down
79 changes: 0 additions & 79 deletions docs/commands/model.md

This file was deleted.

85 changes: 85 additions & 0 deletions docs/commands/models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
layout: default
title: models
description: "Get info on built-in models."
nav_order: 9
parent: Commands
---

# Models

## Description

Get information about built-in Longbow models.

Can list all built-in models with their version and descriptions or can dump the details of a single model to several files that contain information about that model.

## Command help

```shell
$ longbow models --help
Usage: longbow models [OPTIONS]

Get information about built-in Longbow models.

Options:
-v, --verbosity LVL Either CRITICAL, ERROR, WARNING, INFO or DEBUG
-l, --list-models List the names of all models supported natively by this
version of Longbow. NOTE: This argument is mutually
exclusive with arguments: [dump].
-d, --dump TEXT Dump the details of a given model. This command
creates a set of files corresponding to the given model
including: json representation, dot file
representations, transmission matrix, state emission
json file. NOTE: This argument is mutually exclusive
with arguments: [list_models].
--help Show this message and exit.

```

## Examples

```shell
$ longbow models --list-models
Longbow includes the following models:

Array models
============
Name Version Description
mas_15 3.0.0 15-element MAS-ISO-seq array
mas_10 3.0.0 10-element MAS-ISO-seq array
isoseq 3.0.0 PacBio IsoSeq model

cDNA models
===========
Name Version Description
sc_10x3p 3.0.0 single-cell 10x 3' kit
sc_10x5p 3.0.0 single-cell 10x 5' kit
bulk_10x5p 3.0.0 bulk 10x 5' kit
bulk_teloprimeV2 3.0.0 Lexogen TeloPrime V2 kit
spatial_slideseq 3.0.0 Slide-seq protocol
Specify a fully combined model via '<array model>+<cDNA model>' syntax, e.g. 'mas_15+sc_10x5p'.
```
```shell
$ longbow models --dump mas_15+sc_10x5p
[INFO 2022-11-02 20:33:28 models] Generating model: mas_15+sc_10x5p
[INFO 2022-11-02 20:33:29 models] Dumping mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2022-11-02 20:33:29 models] Dumping dotfile: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dot
[INFO 2022-11-02 20:33:29 models] Dumping simple dotfile: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.simple.dot
[INFO 2022-11-02 20:33:29 models] Dumping json model specification: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.spec.json
[INFO 2022-11-02 20:33:30 models] Dumping dense transition matrix: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dense_transition_matrix.pickle
[INFO 2022-11-02 20:33:30 models] Dumping emission distributions: longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.emission_distributions.txt
[INFO 2022-11-02 20:33:30 models] Creating model graph from 1109 states...
[INFO 2022-11-02 20:33:47 models] Rendering model graph now...
[INFO 2022-11-02 20:33:57 models] Writing model graph now to longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.graph.png ...

$ ls
longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dense_transition_matrix.pickle longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.graph.png
longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.dot longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.simple.dot
longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.emission_distributions.txt longbow_model-mas_15+sc_10x5p-Av3.0.0_Cv3.0.0.spec.json
```


2 changes: 1 addition & 1 deletion docs/commands/pad.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: pad
description: "Pad tag by specified number of adjacent bases from the read."
nav_order: 9
nav_order: 10
parent: Commands
---

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/peek.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: pad
description: "Guess the best pre-built array model to use for annotation."
nav_order: 10
nav_order: 11
parent: Commands
---

Expand Down
Loading

0 comments on commit 09e7dd2

Please sign in to comment.