-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reads best mapped to decoys not documented in --writeUnmappedNames
output file
#748
Comments
Hi @taylorreiter, First thing first — this sat for way too long before I got to it, so apologies for that! So, the reason that they don’t show up in the unmapped.txt file is that is that reads mapped to decoys occupy a sort of “no man’s land” with respect to their mapping status. That is, they do map to the index, but just not to a valid target within the index. In other words, if you write a BAM output from However, the output BAM files are big, so I absolutely understand the desire to have them appear in the unmapped names list as well — it's a much smaller and easier thing to go through. I think the right thing to handle this would be to add a specific code/category to the set of unmapped codes used in the Best, |
Ah that makes sense, thank you for the explanation! Having a specific code/category to the set of unmapped codes used in the unmapped.txt file would be absolutely wonderful, and would give me all of the information I need to pursue my downstream use cases. |
In single-end mode, *all* unmapped reads were being reported with the code 'u'. This fixes the output so the proper code is reported (e.g. 'd') for decoys. This addresses #748.
Hi @taylorreiter, I was wrong — there was simply a bug that, in single end mode, everything was being written out with the --Rob |
Exciting! Thank you @rob-p!! |
Hi @taylorreiter, This should now be fixed in v1.9.0 which was just released 🎉 . Let us know if it works for you. --Rob |
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Salmon
Describe the bug
Only
u
(unmapped) reads are labelled in theaux_info/unmapped.txt
file, even when the log file indicates that many fragments were discarded because they are best-mapped to decoys.To Reproduce
This GitHub repository details the motivation and full workflow of my pipeline: https://github.com/greenelab/2022-microberna/
To get the read files, I ran:
However, I think it would be fine to reproduce from untrimmed files:
I generated the transcriptome by annotating all publicly available genomes for my species (Faecalibacterium prausnitzii_C). Using these annotations, I cut out coding domain sequences and nc/r/tRNAs and clustered the sequences at 95% identity. Then, I took the complement of these sequences (all the left over intergenic stuff) and designated these as decoys.
I indexed the transcriptome with:
I've attached my reference transcriptome and my file of decoy names at the bottom of this issue
Details --
Expected behavior
I expected the reads that were counted in the log file as "discarded because they are best-mapped to decoys" to be labelled in the
aux_info/unmapped_names.txt
file withd
, but all reads were marked asu
.Screenshots
Desktop (please complete the following information):
Additional context
I intentionally mapped all three libraries as SE, even though two are PE. Because of the presence of polycistronic transcripts in microbes, many paired-end reads would be discordant, which causes counts to look very...odd. See this preprint for more details on that phenomenon.
I'm trying to use the decoys as a first step in identifying reads that map to intergenic sequences, where reads might span two coding domain sequences, or land in the intergenic sequences between two coding domain sequenes. (#52)
Decoy names: s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz
Transcriptome: s__Faecalibacterium_prausnitzii_C.fa.gz
The text was updated successfully, but these errors were encountered: