-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salmon quant picks the wrong duplicate to keep #249
Comments
Hi @afkoeppel, Thanks for reporting this. While I completely understand that this is not the desired behavior, it is the expected behavior. That is, when collapsing duplicates during indexing, the invariant that is maintained is that all sequence-identical transcripts will be "collapsed" and a single representative maintained. In practice (i.e. in implementation), the transcript that is maintained is the first one encountered. In the short-term, your suggested solution is the best. That is you can simply remove the non-canonical transcripts from the input fasta file. Alternatively, you could re-order the entries in the file so that the canonical transcript occurs first. In the longer term, I'd be happy to implement a de-duplication procedure that is more aware of the semantics of the transcript names. However, I'd want to figure out how to do this that is agnostic to where the transcripts are coming from (i.e. that doesn't work only for Gencode, human etc., and that does something sensible when the transcripts are e.g. de novo assembled contigs). On that front, I'm completely open to suggestions. Best, |
Thanks so much for the prompt response. We will either remove the contigs from our reference, or re-order the file such that the contigs appear last. |
also related thelovelab/tximeta#7. @mikelove suggested using gencode which will solve this problem. some example code for generating index, tx2gene, etc. here https://f1000research.com/articles/7-952/v1 @afkoeppel |
Use Gencode is the simple solution. Also Gencode combines coding and non-coding while Ensembl has these as two FASTA files. What I'm thinking is that, |
I’m working on it, and there’s an issue on tximeta GH page |
I’ve taken a shot at the tximeta implementation: Use cleanDuplicateTxps=TRUE. It’s in the devel branch |
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
This issue is with salmon (bulk mode).
Describe the bug
The grch38 transcriptome has transcripts that are identical in sequence, but annotated differently. Salmon is collapsing the duplicate transcripts to one transcript (as it should).
The problem is that (at least for some transcripts) we're getting abundance estimates for the copy of a transcript that's on a contig, but not estimates for the copy on the regular chromosome.
In short, it's picking the wrong duplicate.
What we're after is a way to get it to prioritize the copies on chromosomes over the copies on contigs.
To Reproduce
Expected behavior
That the selected duplicate reference transcript would be the one on the chromosome, rather than the contig.
In particular, the LTA gene is of interest to us. In grch38, the two ids are ENSG00000226979 (on chr6) and ENSG00000231408 (on CHR_HSCHR6_MHC_MANN_CTG1). Each of the transcripts of these two genes has an identical copy in the other gene. The transcripts for both genes appear in the duplicate_clusters.tsv file in the salmon index directory.
When we quant, only transcripts of ENSG00000231408 appear in the output. We would rather that ENSG00000226979 be the one to appear.
What is the best way forward to accomplish this? Should we just remove the contig transcripts from the reference?
Desktop:
uname -a
: 3.10.0-862.2.3.el7.x86_64 Add a Gitter chat badge to README.md #1 SMP Wed May 9 18:05:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linuxlsb_release -a
: LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarchDistributor ID: CentOS
Description: CentOS Linux release 7.5.1804 (Core)
Release: 7.5.1804
Codename: Core
The text was updated successfully, but these errors were encountered: