Skip to content

Commit

Permalink
Added tagfix command to correct position tags post-alignment and ot…
Browse files Browse the repository at this point in the history
…her minor updates. (#144)

- Added `tagfix` command to update reads after alignment.
- Added documentation for `tagfix`.
- Changed Slide-seq `XQ` tag to `XR` to prevent `XQ` tag from being
  overriden.
- Other minor updates.
  • Loading branch information
jonn-smith authored Aug 2, 2022
1 parent b391da0 commit efae6da
Show file tree
Hide file tree
Showing 16 changed files with 838 additions and 49 deletions.
1 change: 1 addition & 0 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Commands:
scsplit Create files for use in `alevin` for single-cell analysis.
segment Segment pre-annotated reads from an input BAM file.
stats Calculate and produce stats on the given input bam file.
tagfix Update longbow read tags after alignment.
train Train transition and emission probabilities on real data.
version Print the version of longbow.
```
Expand Down
37 changes: 37 additions & 0 deletions docs/commands/tagfix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
layout: default
title: extract
description: "Extract reads."
nav_order: 14
parent: Commands
---

# Tagfix

## Description

Update longbow read tags after alignment.

According to the SAM spec, aligners reverse-complement the read sequence in the event that a read maps to the reverse strand.
However, the aligners do not know about the tags that `Longbow` adds to the reads and therefore certain tags will be incorrect
for reverse-stranded reads after the alignment process (e.g. the `SG` tag).

`Tagfix` checks each read in the input bam file for its strandedness and in the event of a reverse strand read, the position-based
tags that have been added to the read by `Longbow` are corrected to reflect the sequence of bases in the BAM file itself.

## Command help

```shell
$ longbow tagfix --help
Usage: longbow tagfix [OPTIONS] INPUT_BAM

Update longbow read tags after alignment.

Options:
-v, --verbosity LVL Either CRITICAL, ERROR, WARNING, INFO or DEBUG
-t, --threads INTEGER number of threads to use (0 for all) [default: 7]
-o, --output-bam PATH annotated bam output [default: stdout]
-f, --force Force overwrite of the output files if they exist.
[default: False]
--help Show this message and exit.
```
2 changes: 1 addition & 1 deletion docs/commands/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: train
description: "Train model."
nav_order: 14
nav_order: 15
parent: Commands
---

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/version.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: version
description: "Print version number."
nav_order: 15
nav_order: 16
parent: Commands
---

Expand Down
64 changes: 32 additions & 32 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,38 +13,38 @@ Longbow adds several tags to output reads to indicate various metadata features.
the read tags added by each tool and what data they contain:

| Read Tag | Type | Tool(s) | Description |
|:---|:---|:---|:---|
| YS | f | Annotate, Demultiplex | Longbow HMM model score (log probability) |
| SG | Z | Annotate, Demultiplex | Read segment information indicating the boundaries and labels of each segment in the read. For example: `SG:Z:random:0-53,Poly_T:54-85,cDNA:86-823,MARS:824-853,N:854-869,VENUS:870-894,CBC:895-910,UMI:911-920,Poly_T:921-951,cDNA:952-2030,MARS:2031-2060,O:2061-2076,VENUS:2077-2100,CBC:2101-2116,UMI:2117-2126,Poly_T:2127-2157,cDNA:2158-3747,MARS:3748-3777,P:3778-3794` |
| YN | Z | Annotate, Demultiplex | Name of the model used to annotate the reads |
| RC | i | Annotate, Demultiplex | 1 if the read has been reverse-complemented; 0 otherwise |
| XQ | Z | Annotate, Demultiplex | Comma-delimited quality scores ratios for each segment in a given read. Scores for random segments are set to 0/0. Scores for other segments are the optimal alignment score (Smith-Waterman) for the expected sequence over 2x the sequence length. For example: `XQ:Z:0/0,12/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,32/32` |
| YQ | f | Annotate, Demultiplex | Approximate read quality. Read quality is approximated by summing the individual segment scores and dividing that sum by the sum of the total possible (best) scores (as defined by the `XQ` tag). |
| ZS | i | Segment, Extract | 1 if the given read is segmented; 0 otherwise |
| YV | i | Filter | 1 if the given read is valid according to the expected order of the model segments; 0 otherwise |
| YK | Z | Filter | Name of the first valid adapter / named non-random region in the given read |
| YG | i | Filter | Number of valid adapters / named non-random regions in the given read |
| ZU | Z | Segment | Unique Molecular Identifier (UMI) sequence for the given segmented read |
| XU | i | Segment | Position (0-based) in the read of the first base in the annotated Unique Molecular Identifier (UMI) sequence for the given segmented read |
| CR | Z | Segment | Cell Barcode (CBC) sequence for the given segmented read |
| XB | i | Segment | Position (0-based) in the read of the first base in the Cell Barcode (CBC) sequence in the given segmented read |
| XF | f | Segment | Confidence factor for the Cell Barcode (CBC) in the given segmented read. The confidence factor is based on the quality of each base in the CBC and is given by the following: `scale_factor = 100 scale_factor * reduce(operator.mul, map(lambda q: 1. - 10 ** (-(ord(q) - 33.) / 10), qual_string))` |
| XA | Z | Segment | String containing the name of the tag containing the Cell Barcode (CBC) tag followed by the Unique Molecular Identifier (UMI) tag: `XC-XM` |
| X1 | Z | Segment | Sequence for Spatial Barcode 1 for the given segmented read |
| XP | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 1 sequence in the given segmented read |
| X2 | Z | Segment | Sequence for Spatial Barcode 2 for the given segmented read |
| XQ | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 2 sequence in the given segmented read |
| XM | Z | Segment | Raw Unique Molecular Identifier (UMI) sequence for the given segmented read (for IsoSeq3 compatibility) |
| XC | Z | Segment | Raw Cell Barcode (CBC) sequence for the given segmented read (for IsoSeq3 compatibility) |
| YC | i | Correct | True IFF barcode correction was able to be performed (including "correction" where the original barcode did not change). False otherwise. |
| YP | i | Correct | True IFF the barcode was able to be corrected AND the corrected barcode != the raw barcode. False otherwise. |
| ic | i | Segment | Sum of number of passes from all ZMWs used to create consensus. Always set to 1. (for IsoSeq3 compatibility) |
| im | Z | Segment | ZMW names associated with a given segmented read. Set to the name of the parent read for a segmented read. (e.g. `m64013e_211031_055434/1/ccs`). (for IsoSeq3 compatibility) |
| is | i | Segment | Number of ZMWs associated with a given segmented read. Always set to 1. (for IsoSeq3 compatibility) |
| it | Z | Segment | List of barcodes / UMIs tagged/clipped during segmentation (e.g. `it:Z:CATTAGGTCATCCCTA,AAATTTTGGA`) (for IsoSeq3 compatibility) |
| zm | i | Segment | ZMW number from which the given read originates. (for IsoSeq3 compatibility) |
| XN | Z | Segment, Extract | Altered read name given by Longbow to a segmented read (used for debugging). This name consists of the original read name followed by the start and end positions of the segment on the original read, then the names of the bounding adapters / known regions. For example: `XN:Z:m64013e_211029_235558/24/ccs/0_1960/START-MARS` |
| pz | i | Correct | Offset of the new, corrected, barcode/tag relative to the original, raw value (0-based). NOTE: if the tag could not be corrected, this tag is not present in the output read. |
|:---------|:---|:---|:---|
| YS | f | Annotate, Demultiplex | Longbow HMM model score (log probability) |
| SG | Z | Annotate, Demultiplex | Read segment information indicating the boundaries and labels of each segment in the read. For example: `SG:Z:random:0-53,Poly_T:54-85,cDNA:86-823,MARS:824-853,N:854-869,VENUS:870-894,CBC:895-910,UMI:911-920,Poly_T:921-951,cDNA:952-2030,MARS:2031-2060,O:2061-2076,VENUS:2077-2100,CBC:2101-2116,UMI:2117-2126,Poly_T:2127-2157,cDNA:2158-3747,MARS:3748-3777,P:3778-3794` |
| YN | Z | Annotate, Demultiplex | Name of the model used to annotate the reads |
| RC | i | Annotate, Demultiplex | 1 if the read has been reverse-complemented; 0 otherwise |
| XQ | Z | Annotate, Demultiplex | Comma-delimited quality scores ratios for each segment in a given read. Scores for random segments are set to 0/0. Scores for other segments are the optimal alignment score (Smith-Waterman) for the expected sequence over 2x the sequence length. For example: `XQ:Z:0/0,12/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,32/32` |
| YQ | f | Annotate, Demultiplex | Approximate read quality. Read quality is approximated by summing the individual segment scores and dividing that sum by the sum of the total possible (best) scores (as defined by the `XQ` tag). |
| ZS | i | Segment, Extract | 1 if the given read is segmented; 0 otherwise |
| YV | i | Filter | 1 if the given read is valid according to the expected order of the model segments; 0 otherwise |
| YK | Z | Filter | Name of the first valid adapter / named non-random region in the given read |
| YG | i | Filter | Number of valid adapters / named non-random regions in the given read |
| ZU | Z | Segment | Unique Molecular Identifier (UMI) sequence for the given segmented read |
| XU | i | Segment | Position (0-based) in the read of the first base in the annotated Unique Molecular Identifier (UMI) sequence for the given segmented read |
| CR | Z | Segment | Cell Barcode (CBC) sequence for the given segmented read |
| XB | i | Segment | Position (0-based) in the read of the first base in the Cell Barcode (CBC) sequence in the given segmented read |
| XF | f | Segment | Confidence factor for the Cell Barcode (CBC) in the given segmented read. The confidence factor is based on the quality of each base in the CBC and is given by the following: `scale_factor = 100 scale_factor * reduce(operator.mul, map(lambda q: 1. - 10 ** (-(ord(q) - 33.) / 10), qual_string))` |
| XA | Z | Segment | String containing the name of the tag containing the Cell Barcode (CBC) tag followed by the Unique Molecular Identifier (UMI) tag: `XC-XM` |
| X1 | Z | Segment | Sequence for Spatial Barcode 1 for the given segmented read |
| XP | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 1 sequence in the given segmented read |
| X2 | Z | Segment | Sequence for Spatial Barcode 2 for the given segmented read |
| XR | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 2 sequence in the given segmented read |
| XM | Z | Segment | Raw Unique Molecular Identifier (UMI) sequence for the given segmented read (for IsoSeq3 compatibility) |
| XC | Z | Segment | Raw Cell Barcode (CBC) sequence for the given segmented read (for IsoSeq3 compatibility) |
| YC | i | Correct | True IFF barcode correction was able to be performed (including "correction" where the original barcode did not change). False otherwise. |
| YP | i | Correct | True IFF the barcode was able to be corrected AND the corrected barcode != the raw barcode. False otherwise. |
| ic | i | Segment | Sum of number of passes from all ZMWs used to create consensus. Always set to 1. (for IsoSeq3 compatibility) |
| im | Z | Segment | ZMW names associated with a given segmented read. Set to the name of the parent read for a segmented read. (e.g. `m64013e_211031_055434/1/ccs`). (for IsoSeq3 compatibility) |
| is | i | Segment | Number of ZMWs associated with a given segmented read. Always set to 1. (for IsoSeq3 compatibility) |
| it | Z | Segment | List of barcodes / UMIs tagged/clipped during segmentation (e.g. `it:Z:CATTAGGTCATCCCTA,AAATTTTGGA`) (for IsoSeq3 compatibility) |
| zm | i | Segment | ZMW number from which the given read originates. (for IsoSeq3 compatibility) |
| XN | Z | Segment, Extract | Altered read name given by Longbow to a segmented read (used for debugging). This name consists of the original read name followed by the start and end positions of the segment on the original read, then the names of the bounding adapters / known regions. For example: `XN:Z:m64013e_211029_235558/24/ccs/0_1960/START-MARS` |
| pz | i | Correct | Offset of the new, corrected, barcode/tag relative to the original, raw value (0-based). NOTE: if the tag could not be corrected, this tag is not present in the output read. |

### Read Tag Types
The SAM spec defines the following as types for read tags:
Expand Down
2 changes: 1 addition & 1 deletion src/longbow/inspect/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ def annotate_read(read, m, max_length, min_rq):
fppath = []

if read.has_tag(longbow.utils.constants.SEGMENTS_TAG):
tag = re.split(",", read.get_tag(longbow.utils.constants.SEGMENTS_TAG))
tag = re.split(longbow.utils.constants.SEGMENT_TAG_DELIMITER, read.get_tag(longbow.utils.constants.SEGMENTS_TAG))

for e in tag:
state, rrange = re.split(":", e)
Expand Down
2 changes: 1 addition & 1 deletion src/longbow/pad/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ def _expand_tag_fn(out_queue, out_bam_header, out_bam_file_name, pbar, res, lb_m
# No need to continue here:
break

# Write our our read:
# Write out our read:
out_bam_file.write(read)

# Increment our counters:
Expand Down
Empty file added src/longbow/tagfix/__init__.py
Empty file.
212 changes: 212 additions & 0 deletions src/longbow/tagfix/command.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
import logging
import re
import time
import sys
import itertools

import click
import click_log
import tqdm

import pysam
import multiprocessing as mp

from construct import *

import longbow.utils.constants
from ..utils import bam_utils
from ..utils.cli_utils import format_obnoxious_warning_message


PROG_NAME = "tagfix"

logging.basicConfig(stream=sys.stderr)
logger = logging.getLogger(PROG_NAME)
click_log.basic_config(logger)


@click.command(name=logger.name)
@click_log.simple_verbosity_option(logger)
@click.option(
"-t",
"--threads",
type=int,
default=mp.cpu_count() - 1,
show_default=True,
help="number of threads to use (0 for all)",
)
@click.option(
"-o",
"--output-bam",
default="-",
type=click.Path(exists=False),
help="annotated bam output [default: stdout]",
)
@click.option(
'-f',
'--force',
is_flag=True,
default=False,
show_default=True,
help="Force overwrite of the output files if they exist."
)
@click.argument("input-bam", default="-" if not sys.stdin.isatty() else None, type=click.File("rb"))
def main(threads, output_bam, force, input_bam):
"""Update longbow read tags after alignment."""

t_start = time.time()

logger.info("Invoked via: longbow %s", " ".join(sys.argv[1:]))

# Check to see if the output files exist:
bam_utils.check_for_preexisting_files(output_bam, exist_ok=force)

threads = mp.cpu_count() if threads <= 0 or threads > mp.cpu_count() else threads
logger.info(f"Running with {threads} worker subprocess(es)")

# Configure process manager:
# NOTE: We're using processes to overcome the Global Interpreter Lock.
manager = mp.Manager()
process_input_data_queue = manager.Queue(threads)
results = manager.Queue()

pysam.set_verbosity(0) # silence message about the .bai file not being found
with pysam.AlignmentFile(
input_bam, "rb", check_sq=False, require_index=False
) as bam_file, tqdm.tqdm(
desc="Progress",
unit=" read",
colour="green",
file=sys.stderr,
leave=False,
disable=not sys.stdin.isatty(),
) as pbar:

# Start worker sub-processes:
res = manager.dict({"num_tags_corrected": 0, "num_reads": 0})
worker_process_pool = []
for _ in range(threads):
p = mp.Process(
target=_correct_read_tags, args=(process_input_data_queue, results, bam_file.header, res)
)
p.start()
worker_process_pool.append(p)

# Check if we have run this command on a bam file already:
if bam_utils.bam_header_has_longbow_command_program_group(bam_file.header, PROG_NAME):
logger.error(f"{PROG_NAME} has already been run on input bam: {input_bam}")
sys.exit(1)

out_header = bam_utils.create_bam_header_with_program_group(logger.name, bam_file.header)

# Start output worker:
output_worker = mp.Process(
target=_output_writer_fn,
args=(
results,
out_header,
output_bam,
pbar,
),
)
output_worker.start()

# Add in a sentinel value at the end of the queue - one for each subprocess - so we guarantee
# that all subprocesses will exit:
iter_data = itertools.chain(bam_file, (None,) * threads)
for r in iter_data:
if r is not None:
process_input_data_queue.put(r.to_dict())
else:
process_input_data_queue.put(r)

# Wait for our input jobs to finish:
for p in worker_process_pool:
p.join()

# Now that our input processes are done, we can add our exit sentinel onto the output queue and
# wait for that process to end:
results.put(None)
output_worker.join()

logger.info(f"Corrected tags in {res['num_tags_corrected']} reads of {res['num_reads']} total.")

et = time.time()
logger.info(f"Done. Elapsed time: {et - t_start:2.2f}s. "
f"Overall processing rate: {res['num_reads']/(et - t_start):2.2f} reads/s.")

if res['num_tags_corrected'] == 0:
logger.warning(format_obnoxious_warning_message("No read tags were corrected. This is very unlikely. "
"You should check your data."))


def _output_writer_fn(out_queue, out_bam_header, out_bam_file_name, pbar):
"""Thread / process fn to write out all our data."""

with pysam.AlignmentFile(
out_bam_file_name, "wb", header=out_bam_header
) as out_bam_file:
while True:
# Wait for some output data:
raw_data = out_queue.get()

# Check for exit sentinel:
if raw_data is None:
break

# Unpack data:
read = raw_data
read = pysam.AlignedSegment.from_dict(read, out_bam_header)

# Write out our read:
out_bam_file.write(read)

pbar.update(1)


def _correct_read_tags(in_queue, out_queue, bam_header, res):
"""Function to run in each subprocess.
Do the tag correction here."""

while True:
# Wait until we get some data.
# Note: Because we have a sentinel value None inserted at the end of the input data for each
# subprocess, we don't have to add a timeout - we're guaranteed each process will always have
# at least one element.
raw_data = in_queue.get()

# Check for exit sentinel:
if raw_data is None:
return

# Unpack our data here:
read = pysam.AlignedSegment.from_dict(raw_data, bam_header)

if read.is_reverse:

# Reverse the segments and update the positions:
segments_tag = read.get_tag(longbow.utils.constants.SEGMENTS_TAG)
segments = segments_tag.split(longbow.utils.constants.SEGMENT_TAG_DELIMITER)
read_length = len(read.query_sequence)
new_segments = []
for seg in reversed(segments):
seg_name, start, end = re.split("[:-]", seg)
new_start = read_length - int(end)
new_end = read_length - int(start)

new_segments.append(f"{seg_name}:{new_start}-{new_end}")

read.set_tag(longbow.utils.constants.SEGMENTS_TAG,
longbow.utils.constants.SEGMENT_TAG_DELIMITER.join(new_segments))

# Reverse the Segment Qualities:
seg_quals = read.get_tag(longbow.utils.constants.SEGMENTS_QUAL_TAG)
new_seg_quals = seg_quals.split(longbow.utils.constants.SEGMENT_TAG_DELIMITER)[::-1]
read.set_tag(longbow.utils.constants.SEGMENTS_QUAL_TAG,
longbow.utils.constants.SEGMENT_TAG_DELIMITER.join(new_seg_quals))

# Increment our counter:
res["num_tags_corrected"] += 1

out_queue.put(read.to_dict())
res["num_reads"] += 1
Loading

0 comments on commit efae6da

Please sign in to comment.