-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing umi and/or cell tag from alligned reads after longbow processing #127
Comments
Hi Daud, I’ve looked into this error now and I understand what’s going on. Essentially, there are two issues:
Thankfully, these errors are relatively easy to fix. I’ve included an example script with some changes to yours at the bottom of this message. There are two major changes to your workflow. First, I run a relatively new Longbow command called “peek” that automatically detects the appropriate model to apply. The output on your test data looks as follows:
You can see what effect that has on the annotation in the images I’ve attached for one read, processed with the mas15v2 and mas15threeP models. In the first image, poly(T) sections are not properly annotated (given that the mas15v2 model looks for 5’ data instead). In the second image, the poly(T) sections (and other landmarks) are annotated much more appropriately. The second change was to add the “-T XC,ZU” argument to samtools fastq. That tells it to encode those particular BAM tags into the comment line of each fastq record. Minimap2 with the -y argument then copies those tags back into the resulting BAM file (this argument to minimap2 is not documented well - a reference to it can be found here: https://lh3.github.io/minimap2/minimap2.html). Now the umi_tools group command should succeed:
Hopefully that should fix the issue you were seeing. A few things to clean up on our end: we’ll update our documentation to reflect our new “peek” command and carrying over the CBC/UMI tag to BAM files for use with downstream tools. Finally, here's the modified script: #!/bin/bash
set -euxo pipefail
INPUT_BAM="daud.bam"
THREADS=8
ANNOTATE_OUTDIR="annonate"
ALIGNMENT_OUTDIR="allignedBam"
mkdir -p $ANNOTATE_OUTDIR
mkdir -p $ALIGNMENT_OUTDIR
longbow version
# The new Longbow peek command allows the user to detect the array structure automatically.
longbow peek -f -o $ANNOTATE_OUTDIR/model.txt $INPUT_BAM
# These are debugging tools, useful for seeing how the HMM applied the model to the data.
longbow inspect --seg-score -r "m64152e_220303_222034/14/ccs" -m mas15v2 -o images/mas15v2 daud.bam
longbow inspect --seg-score -r "m64152e_220303_222034/14/ccs" -m mas15threeP -o images/mas15threeP daud.bam
longbow annotate -m $(cat $ANNOTATE_OUTDIR/model.txt) -t $THREADS $INPUT_BAM | \
longbow filter | \
longbow segment | \
longbow extract -f -o $ANNOTATE_OUTDIR/extracted.bam
# Check that we have the ZU and XC tags in the extracted BAM file.
samtools view -c $ANNOTATE_OUTDIR/extracted.bam
samtools view $ANNOTATE_OUTDIR/extracted.bam | sed 's/\t/\n/g' | grep -c '^ZU'
samtools view $ANNOTATE_OUTDIR/extracted.bam | sed 's/\t/\n/g' | grep -c '^XC'
# Here, we need to tell samtools fastq to propogate the cell barcode and UMI tags to the aligned file.
samtools fastq -T XC,ZU $ANNOTATE_OUTDIR/extracted.bam | \
minimap2 -ayYL --MD -x splice:hq GCA_000001405.15_GRCh38_no_alt_analysis_set.fa - | \
samtools sort > $ALIGNMENT_OUTDIR/aligned.bam
samtools index $ALIGNMENT_OUTDIR/aligned.bam
# Recheck that we have the ZU and XC tags in the aligned BAM file.
samtools view -c $ALIGNMENT_OUTDIR/aligned.bam
samtools view $ALIGNMENT_OUTDIR/aligned.bam | sed 's/\t/\n/g' | grep -c '^ZU'
samtools view $ALIGNMENT_OUTDIR/aligned.bam | sed 's/\t/\n/g' | grep -c '^XC'
# Now with the proper array model and propagated tags, the umi_tools command should succeed.
# Note that because I'm aligning to the genome and skipping some of the gene annotation steps,
# and because I'm working with a tiny subset of the data, I'm omitting the arguments where
# we tell umi_tools to look at the gene name tag in the BAM file.
umi_tools group --stdin=$ALIGNMENT_OUTDIR/aligned.bam \
--buffer-whole-contig \
--no-sort-output \
--extract-umi-method tag \
--umi-tag ZU \
--group-out=out.tsv \
--output-bam \
--stdout=$ALIGNMENT_OUTDIR/aligned.grouped.bam Thanks, m64152e_220303_222034_14_ccs.mas15v2.pdf |
Hi,
we just received an MAS-Iso seq long read data for single cell 3'p for test which I am trying to process as per the Longbow pipeline , the bam file from the longbow output is giving an error when I am trying to group the detected UMIs and mark the molecules with the same UMIs using UMI-Tools , it's throwing an error
""" Read skipped, missing umi and/or cell tag: 61628044"""
here are the steps which I did
1- annotation, filtering, segmentation , extract
'longbow annotate -m mas15v2 -t 30 hifi_reads.bam | longbow filter | longbow segment | longbow extract -o filter_passed.bam'
2 , primary allignment of extracted bam with the hg38 genome using minimap2
'samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq Homo_sapiens.GRCh38.dna.primary_assembly.fa - |
samtools sort > align.bam &&
samtools index align.bam
3 - baseline transcript annotations via stringtie 2 to create new transcriptome annotations specific to our test data
'stringtie allign.bam -Lv -G gencode.v38.annotation2.gtf -o annotations.gtf -A gene_abund.out
4 - Based on the resulting transcript annotations we created a new transcript reference FASTA file using gffread
'gffread -w transcriptome.fa -g Homo_sapiens.GRCh38.dna.primary_assembly.fa annotations.gtf'
5 - aligned the extracted reads to the novel transcriptome reference using minimap2 v2.17-r941
''samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq transcriptome.fa - |
samtools sort > align_2.bam &&
samtools index align_2.bam
6 - group the detected UMIs and mark the molecules with the same UMI using UMI-Tools
'umi_tools group --stdin= align_2.bam --buffer-whole-contig --no-sort-output --per-gene --gene-tag XG --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam
here it is throwing an error with ''missing umi and/or cell tag from alligned reads''
could you please tell, why it's missing UMI from the longbow output files,
Thanks
'
The text was updated successfully, but these errors were encountered: