Added tagfix command to correct position tags post-alignment and ot…

…her minor updates. (#144) - Added `tagfix` command to update reads after alignment. - Added documentation for `tagfix`. - Changed Slide-seq `XQ` tag to `XR` to prevent `XQ` tag from being overriden. - Other minor updates.
broadinstitute · Aug 2, 2022 · efae6da · efae6da
1 parent b391da0
commit efae6da
Show file tree

Hide file tree

Showing 16 changed files with 838 additions and 49 deletions.
diff --git a/docs/commands.md b/docs/commands.md
@@ -33,6 +33,7 @@ Commands:
   scsplit      Create files for use in `alevin` for single-cell analysis.
   segment      Segment pre-annotated reads from an input BAM file.
   stats        Calculate and produce stats on the given input bam file.
+  tagfix       Update longbow read tags after alignment.
   train        Train transition and emission probabilities on real data.
   version      Print the version of longbow.
 ```

diff --git a/docs/commands/tagfix.md b/docs/commands/tagfix.md
@@ -0,0 +1,37 @@
+---
+layout: default
+title: extract
+description: "Extract reads."
+nav_order: 14
+parent: Commands
+---
+
+# Tagfix
+
+## Description
+
+Update longbow read tags after alignment.
+
+According to the SAM spec, aligners reverse-complement the read sequence in the event that a read maps to the reverse strand.
+However, the aligners do not know about the tags that `Longbow` adds to the reads and therefore certain tags will be incorrect
+for reverse-stranded reads after the alignment process (e.g. the `SG` tag).
+
+`Tagfix` checks each read in the input bam file for its strandedness and in the event of a reverse strand read, the position-based
+tags that have been added to the read by `Longbow` are corrected to reflect the sequence of bases in the BAM file itself.
+
+## Command help
+
+```shell
+$ longbow tagfix --help
+Usage: longbow tagfix [OPTIONS] INPUT_BAM
+
+  Update longbow read tags after alignment.
+
+Options:
+  -v, --verbosity LVL    Either CRITICAL, ERROR, WARNING, INFO or DEBUG
+  -t, --threads INTEGER  number of threads to use (0 for all)  [default: 7]
+  -o, --output-bam PATH  annotated bam output  [default: stdout]
+  -f, --force            Force overwrite of the output files if they exist.
+                         [default: False]
+  --help                 Show this message and exit.
+```
diff --git a/docs/commands/train.md b/docs/commands/train.md
@@ -2,7 +2,7 @@
 layout: default
 title: train
 description: "Train model."
-nav_order: 14
+nav_order: 15
 parent: Commands
 ---
 

diff --git a/docs/commands/version.md b/docs/commands/version.md
@@ -2,7 +2,7 @@
 layout: default
 title: version
 description: "Print version number."
-nav_order: 15
+nav_order: 16
 parent: Commands
 ---
 

diff --git a/docs/faq.md b/docs/faq.md
@@ -13,38 +13,38 @@ Longbow adds several tags to output reads to indicate various metadata features.
 the read tags added by each tool and what data they contain:
 
 | Read Tag | Type | Tool(s) | Description |
-|:---|:---|:---|:---|
-| YS | f | Annotate, Demultiplex | Longbow HMM model score (log probability) |
-| SG | Z | Annotate, Demultiplex | Read segment information indicating the boundaries and labels of each segment in the read.   For example:   `SG:Z:random:0-53,Poly_T:54-85,cDNA:86-823,MARS:824-853,N:854-869,VENUS:870-894,CBC:895-910,UMI:911-920,Poly_T:921-951,cDNA:952-2030,MARS:2031-2060,O:2061-2076,VENUS:2077-2100,CBC:2101-2116,UMI:2117-2126,Poly_T:2127-2157,cDNA:2158-3747,MARS:3748-3777,P:3778-3794` |
-| YN | Z | Annotate, Demultiplex | Name of the model used to annotate the reads |
-| RC | i | Annotate, Demultiplex | 1 if the read has been reverse-complemented; 0 otherwise |
-| XQ | Z | Annotate, Demultiplex | Comma-delimited quality scores ratios for each segment in a given read.  Scores for random segments are set to 0/0.   Scores for other segments are the optimal alignment score (Smith-Waterman) for the expected sequence over 2x the sequence length.   For example:   `XQ:Z:0/0,12/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,32/32` |
-| YQ | f | Annotate, Demultiplex | Approximate read quality.  Read quality is approximated by summing the individual segment scores and dividing that sum by the sum of the total possible (best) scores (as defined by the `XQ` tag). |
-| ZS | i | Segment, Extract | 1 if the given read is segmented; 0 otherwise |
-| YV | i | Filter | 1 if the given read is valid according to the expected order of the model segments; 0 otherwise |
-| YK | Z | Filter | Name of the first valid adapter / named non-random region in the given read |
-| YG | i | Filter | Number of valid adapters / named non-random regions in the given read |
-| ZU | Z | Segment | Unique Molecular Identifier (UMI) sequence for the given segmented read |
-| XU | i | Segment | Position (0-based) in the read of the first base in the annotated Unique Molecular Identifier (UMI) sequence for the given segmented read |
-| CR | Z | Segment | Cell Barcode (CBC) sequence for the given segmented read |
-| XB | i | Segment | Position (0-based) in the read of the first base in the Cell Barcode (CBC) sequence in the given segmented read |
-| XF | f | Segment | Confidence factor for the Cell Barcode (CBC) in the given segmented read.  The confidence factor is based on the quality of each base in the CBC and is given by the following:  `scale_factor = 100 scale_factor * reduce(operator.mul, map(lambda q: 1. - 10 ** (-(ord(q) - 33.) / 10), qual_string))` |
-| XA | Z | Segment | String containing the name of the tag containing the Cell Barcode (CBC) tag followed by the Unique Molecular Identifier (UMI) tag:  `XC-XM` |
-| X1 | Z | Segment | Sequence for Spatial Barcode 1 for the given segmented read |
-| XP | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 1 sequence in the given segmented read |
-| X2 | Z | Segment | Sequence for Spatial Barcode 2 for the given segmented read |
-| XQ | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 2 sequence in the given segmented read |
-| XM | Z | Segment | Raw Unique Molecular Identifier (UMI) sequence for the given segmented read (for IsoSeq3 compatibility) |
-| XC | Z | Segment | Raw Cell Barcode (CBC) sequence for the given segmented read (for IsoSeq3 compatibility) |
-| YC | i | Correct | True IFF barcode correction was able to be performed (including "correction" where the original barcode did not change).  False otherwise. |
-| YP | i | Correct | True IFF the barcode was able to be corrected AND the corrected barcode != the raw barcode.  False otherwise.   |
-| ic | i | Segment | Sum of number of passes from all ZMWs used to create consensus.  Always set to 1.  (for IsoSeq3 compatibility) |
-| im | Z | Segment | ZMW names associated with a given segmented read.  Set to the name of the parent read for a segmented read. (e.g. `m64013e_211031_055434/1/ccs`).  (for IsoSeq3 compatibility) |
-| is | i | Segment | Number of ZMWs associated with a given segmented read.  Always set to 1.  (for IsoSeq3 compatibility) |
-| it | Z | Segment | List of barcodes / UMIs tagged/clipped during segmentation (e.g. `it:Z:CATTAGGTCATCCCTA,AAATTTTGGA`) (for IsoSeq3 compatibility) |
-| zm | i | Segment | ZMW number from which the given read originates.  (for IsoSeq3 compatibility) |
-| XN | Z | Segment, Extract | Altered read name given by Longbow to a segmented read (used for debugging).  This name consists of the original read name followed by the start and end positions of the segment on the original read, then the names of the bounding adapters / known regions.  For example:  `XN:Z:m64013e_211029_235558/24/ccs/0_1960/START-MARS` |
-| pz | i | Correct | Offset of the new, corrected, barcode/tag relative to the original, raw value (0-based).  NOTE: if the tag could not be corrected, this tag is not present in the output read. |
+|:---------|:---|:---|:---|
+| YS       | f | Annotate, Demultiplex | Longbow HMM model score (log probability) |
+| SG       | Z | Annotate, Demultiplex | Read segment information indicating the boundaries and labels of each segment in the read.   For example:   `SG:Z:random:0-53,Poly_T:54-85,cDNA:86-823,MARS:824-853,N:854-869,VENUS:870-894,CBC:895-910,UMI:911-920,Poly_T:921-951,cDNA:952-2030,MARS:2031-2060,O:2061-2076,VENUS:2077-2100,CBC:2101-2116,UMI:2117-2126,Poly_T:2127-2157,cDNA:2158-3747,MARS:3748-3777,P:3778-3794` |
+| YN       | Z | Annotate, Demultiplex | Name of the model used to annotate the reads |
+| RC       | i | Annotate, Demultiplex | 1 if the read has been reverse-complemented; 0 otherwise |
+| XQ       | Z | Annotate, Demultiplex | Comma-delimited quality scores ratios for each segment in a given read.  Scores for random segments are set to 0/0.   Scores for other segments are the optimal alignment score (Smith-Waterman) for the expected sequence over 2x the sequence length.   For example:   `XQ:Z:0/0,12/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,30/32,46/46,0/0,0/0,60/60,0/0,58/60,32/32` |
+| YQ       | f | Annotate, Demultiplex | Approximate read quality.  Read quality is approximated by summing the individual segment scores and dividing that sum by the sum of the total possible (best) scores (as defined by the `XQ` tag). |
+| ZS       | i | Segment, Extract | 1 if the given read is segmented; 0 otherwise |
+| YV       | i | Filter | 1 if the given read is valid according to the expected order of the model segments; 0 otherwise |
+| YK       | Z | Filter | Name of the first valid adapter / named non-random region in the given read |
+| YG       | i | Filter | Number of valid adapters / named non-random regions in the given read |
+| ZU       | Z | Segment | Unique Molecular Identifier (UMI) sequence for the given segmented read |
+| XU       | i | Segment | Position (0-based) in the read of the first base in the annotated Unique Molecular Identifier (UMI) sequence for the given segmented read |
+| CR       | Z | Segment | Cell Barcode (CBC) sequence for the given segmented read |
+| XB       | i | Segment | Position (0-based) in the read of the first base in the Cell Barcode (CBC) sequence in the given segmented read |
+| XF       | f | Segment | Confidence factor for the Cell Barcode (CBC) in the given segmented read.  The confidence factor is based on the quality of each base in the CBC and is given by the following:  `scale_factor = 100 scale_factor * reduce(operator.mul, map(lambda q: 1. - 10 ** (-(ord(q) - 33.) / 10), qual_string))` |
+| XA       | Z | Segment | String containing the name of the tag containing the Cell Barcode (CBC) tag followed by the Unique Molecular Identifier (UMI) tag:  `XC-XM` |
+| X1       | Z | Segment | Sequence for Spatial Barcode 1 for the given segmented read |
+| XP       | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 1 sequence in the given segmented read |
+| X2       | Z | Segment | Sequence for Spatial Barcode 2 for the given segmented read |
+| XR       | i | Segment | Position (0-based) in the read of the first base in the Spatial Barcode 2 sequence in the given segmented read |
+| XM       | Z | Segment | Raw Unique Molecular Identifier (UMI) sequence for the given segmented read (for IsoSeq3 compatibility) |
+| XC       | Z | Segment | Raw Cell Barcode (CBC) sequence for the given segmented read (for IsoSeq3 compatibility) |
+| YC       | i | Correct | True IFF barcode correction was able to be performed (including "correction" where the original barcode did not change).  False otherwise. |
+| YP       | i | Correct | True IFF the barcode was able to be corrected AND the corrected barcode != the raw barcode.  False otherwise.   |
+| ic       | i | Segment | Sum of number of passes from all ZMWs used to create consensus.  Always set to 1.  (for IsoSeq3 compatibility) |
+| im       | Z | Segment | ZMW names associated with a given segmented read.  Set to the name of the parent read for a segmented read. (e.g. `m64013e_211031_055434/1/ccs`).  (for IsoSeq3 compatibility) |
+| is       | i | Segment | Number of ZMWs associated with a given segmented read.  Always set to 1.  (for IsoSeq3 compatibility) |
+| it       | Z | Segment | List of barcodes / UMIs tagged/clipped during segmentation (e.g. `it:Z:CATTAGGTCATCCCTA,AAATTTTGGA`) (for IsoSeq3 compatibility) |
+| zm       | i | Segment | ZMW number from which the given read originates.  (for IsoSeq3 compatibility) |
+| XN       | Z | Segment, Extract | Altered read name given by Longbow to a segmented read (used for debugging).  This name consists of the original read name followed by the start and end positions of the segment on the original read, then the names of the bounding adapters / known regions.  For example:  `XN:Z:m64013e_211029_235558/24/ccs/0_1960/START-MARS` |
+| pz       | i | Correct | Offset of the new, corrected, barcode/tag relative to the original, raw value (0-based).  NOTE: if the tag could not be corrected, this tag is not present in the output read. |
 
 ### Read Tag Types
 The SAM spec defines the following as types for read tags:

diff --git a/src/longbow/inspect/command.py b/src/longbow/inspect/command.py
@@ -258,7 +258,7 @@ def annotate_read(read, m, max_length, min_rq):
     fppath = []
 
     if read.has_tag(longbow.utils.constants.SEGMENTS_TAG):
-        tag = re.split(",", read.get_tag(longbow.utils.constants.SEGMENTS_TAG))
+        tag = re.split(longbow.utils.constants.SEGMENT_TAG_DELIMITER, read.get_tag(longbow.utils.constants.SEGMENTS_TAG))
 
         for e in tag:
             state, rrange = re.split(":", e)

diff --git a/src/longbow/pad/command.py b/src/longbow/pad/command.py
@@ -258,7 +258,7 @@ def _expand_tag_fn(out_queue, out_bam_header, out_bam_file_name, pbar, res, lb_m
                                 # No need to continue here:
                                 break
 
-            # Write our our read:
+            # Write out our read:
             out_bam_file.write(read)
 
             # Increment our counters:

diff --git a/src/longbow/tagfix/__init__.py b/src/longbow/tagfix/__init__.py
diff --git a/src/longbow/tagfix/command.py b/src/longbow/tagfix/command.py
@@ -0,0 +1,212 @@
+import logging
+import re
+import time
+import sys
+import itertools
+
+import click
+import click_log
+import tqdm
+
+import pysam
+import multiprocessing as mp
+
+from construct import *
+
+import longbow.utils.constants
+from ..utils import bam_utils
+from ..utils.cli_utils import format_obnoxious_warning_message
+
+
+PROG_NAME = "tagfix"
+
+logging.basicConfig(stream=sys.stderr)
+logger = logging.getLogger(PROG_NAME)
+click_log.basic_config(logger)
+
+
+@click.command(name=logger.name)
+@click_log.simple_verbosity_option(logger)
+@click.option(
+    "-t",
+    "--threads",
+    type=int,
+    default=mp.cpu_count() - 1,
+    show_default=True,
+    help="number of threads to use (0 for all)",
+)
+@click.option(
+    "-o",
+    "--output-bam",
+    default="-",
+    type=click.Path(exists=False),
+    help="annotated bam output  [default: stdout]",
+)
+@click.option(
+    '-f',
+    '--force',
+    is_flag=True,
+    default=False,
+    show_default=True,
+    help="Force overwrite of the output files if they exist."
+)
+@click.argument("input-bam", default="-" if not sys.stdin.isatty() else None, type=click.File("rb"))
+def main(threads, output_bam, force, input_bam):
+    """Update longbow read tags after alignment."""
+
+    t_start = time.time()
+
+    logger.info("Invoked via: longbow %s", " ".join(sys.argv[1:]))
+
+    # Check to see if the output files exist:
+    bam_utils.check_for_preexisting_files(output_bam, exist_ok=force)
+
+    threads = mp.cpu_count() if threads <= 0 or threads > mp.cpu_count() else threads
+    logger.info(f"Running with {threads} worker subprocess(es)")
+
+    # Configure process manager:
+    # NOTE: We're using processes to overcome the Global Interpreter Lock.
+    manager = mp.Manager()
+    process_input_data_queue = manager.Queue(threads)
+    results = manager.Queue()
+
+    pysam.set_verbosity(0)  # silence message about the .bai file not being found
+    with pysam.AlignmentFile(
+        input_bam, "rb", check_sq=False, require_index=False
+    ) as bam_file, tqdm.tqdm(
+        desc="Progress",
+        unit=" read",
+        colour="green",
+        file=sys.stderr,
+        leave=False,
+        disable=not sys.stdin.isatty(),
+    ) as pbar:
+
+        # Start worker sub-processes:
+        res = manager.dict({"num_tags_corrected": 0, "num_reads": 0})
+        worker_process_pool = []
+        for _ in range(threads):
+            p = mp.Process(
+                target=_correct_read_tags, args=(process_input_data_queue, results, bam_file.header, res)
+            )
+            p.start()
+            worker_process_pool.append(p)
+
+        # Check if we have run this command on a bam file already:
+        if bam_utils.bam_header_has_longbow_command_program_group(bam_file.header, PROG_NAME):
+            logger.error(f"{PROG_NAME} has already been run on input bam: {input_bam}")
+            sys.exit(1)
+
+        out_header = bam_utils.create_bam_header_with_program_group(logger.name, bam_file.header)
+
+        # Start output worker:
+        output_worker = mp.Process(
+            target=_output_writer_fn,
+            args=(
+                results,
+                out_header,
+                output_bam,
+                pbar,
+            ),
+        )
+        output_worker.start()
+
+        # Add in a sentinel value at the end of the queue - one for each subprocess - so we guarantee
+        # that all subprocesses will exit:
+        iter_data = itertools.chain(bam_file, (None,) * threads)
+        for r in iter_data:
+            if r is not None:
+                process_input_data_queue.put(r.to_dict())
+            else:
+                process_input_data_queue.put(r)
+
+        # Wait for our input jobs to finish:
+        for p in worker_process_pool:
+            p.join()
+
+        # Now that our input processes are done, we can add our exit sentinel onto the output queue and
+        # wait for that process to end:
+        results.put(None)
+        output_worker.join()
+
+    logger.info(f"Corrected tags in {res['num_tags_corrected']} reads of {res['num_reads']} total.")
+
+    et = time.time()
+    logger.info(f"Done. Elapsed time: {et - t_start:2.2f}s. "
+                f"Overall processing rate: {res['num_reads']/(et - t_start):2.2f} reads/s.")
+
+    if res['num_tags_corrected'] == 0:
+        logger.warning(format_obnoxious_warning_message("No read tags were corrected.  This is very unlikely.  "
+                                                        "You should check your data."))
+
+
+def _output_writer_fn(out_queue, out_bam_header, out_bam_file_name, pbar):
+    """Thread / process fn to write out all our data."""
+
+    with pysam.AlignmentFile(
+        out_bam_file_name, "wb", header=out_bam_header
+    ) as out_bam_file:
+        while True:
+            # Wait for some output data:
+            raw_data = out_queue.get()
+
+            # Check for exit sentinel:
+            if raw_data is None:
+                break
+
+            # Unpack data:
+            read = raw_data
+            read = pysam.AlignedSegment.from_dict(read, out_bam_header)
+
+            # Write out our read:
+            out_bam_file.write(read)
+
+            pbar.update(1)
+
+
+def _correct_read_tags(in_queue, out_queue, bam_header, res):
+    """Function to run in each subprocess.
+    Do the tag correction here."""
+
+    while True:
+        # Wait until we get some data.
+        # Note: Because we have a sentinel value None inserted at the end of the input data for each
+        #       subprocess, we don't have to add a timeout - we're guaranteed each process will always have
+        #       at least one element.
+        raw_data = in_queue.get()
+
+        # Check for exit sentinel:
+        if raw_data is None:
+            return
+
+        # Unpack our data here:
+        read = pysam.AlignedSegment.from_dict(raw_data, bam_header)
+
+        if read.is_reverse:
+
+            # Reverse the segments and update the positions:
+            segments_tag = read.get_tag(longbow.utils.constants.SEGMENTS_TAG)
+            segments = segments_tag.split(longbow.utils.constants.SEGMENT_TAG_DELIMITER)
+            read_length = len(read.query_sequence)
+            new_segments = []
+            for seg in reversed(segments):
+                seg_name, start, end = re.split("[:-]", seg)
+                new_start = read_length - int(end)
+                new_end = read_length - int(start)
+
+                new_segments.append(f"{seg_name}:{new_start}-{new_end}")
+
+            read.set_tag(longbow.utils.constants.SEGMENTS_TAG,
+                         longbow.utils.constants.SEGMENT_TAG_DELIMITER.join(new_segments))
+
+            # Reverse the Segment Qualities:
+            seg_quals = read.get_tag(longbow.utils.constants.SEGMENTS_QUAL_TAG)
+            new_seg_quals = seg_quals.split(longbow.utils.constants.SEGMENT_TAG_DELIMITER)[::-1]
+            read.set_tag(longbow.utils.constants.SEGMENTS_QUAL_TAG,
+                         longbow.utils.constants.SEGMENT_TAG_DELIMITER.join(new_seg_quals))
+
+            # Increment our counter:
+            res["num_tags_corrected"] += 1
+
+        out_queue.put(read.to_dict())
+        res["num_reads"] += 1