Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salmon quant picks the wrong duplicate to keep #249

Closed
afkoeppel opened this issue Jul 11, 2018 · 6 comments
Closed

Salmon quant picks the wrong duplicate to keep #249

afkoeppel opened this issue Jul 11, 2018 · 6 comments

Comments

@afkoeppel
Copy link

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?

This issue is with salmon (bulk mode).

Describe the bug

The grch38 transcriptome has transcripts that are identical in sequence, but annotated differently. Salmon is collapsing the duplicate transcripts to one transcript (as it should).
The problem is that (at least for some transcripts) we're getting abundance estimates for the copy of a transcript that's on a contig, but not estimates for the copy on the regular chromosome.
In short, it's picking the wrong duplicate.

What we're after is a way to get it to prioritize the copies on chromosomes over the copies on contigs.

To Reproduce

  • Salmon version 0.10.2
  • Salmon installed through bioconda.
  • Reference grch38. All cdnas and ncrnas.
  • Reads were single-end
  • Program options: salmon quant -p 48 --seqBias --gcBias --biasSpeedSamp 5 -l A

Expected behavior
That the selected duplicate reference transcript would be the one on the chromosome, rather than the contig.
In particular, the LTA gene is of interest to us. In grch38, the two ids are ENSG00000226979 (on chr6) and ENSG00000231408 (on CHR_HSCHR6_MHC_MANN_CTG1). Each of the transcripts of these two genes has an identical copy in the other gene. The transcripts for both genes appear in the duplicate_clusters.tsv file in the salmon index directory.
When we quant, only transcripts of ENSG00000231408 appear in the output. We would rather that ENSG00000226979 be the one to appear.

What is the best way forward to accomplish this? Should we just remove the contig transcripts from the reference?

Desktop:

  • OS: Centos Linux
  • uname -a: 3.10.0-862.2.3.el7.x86_64 Add a Gitter chat badge to README.md #1 SMP Wed May 9 18:05:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • lsb_release -a: LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
    Distributor ID: CentOS
    Description: CentOS Linux release 7.5.1804 (Core)
    Release: 7.5.1804
    Codename: Core
@rob-p
Copy link
Collaborator

rob-p commented Jul 11, 2018

Hi @afkoeppel,

Thanks for reporting this. While I completely understand that this is not the desired behavior, it is the expected behavior. That is, when collapsing duplicates during indexing, the invariant that is maintained is that all sequence-identical transcripts will be "collapsed" and a single representative maintained. In practice (i.e. in implementation), the transcript that is maintained is the first one encountered.

In the short-term, your suggested solution is the best. That is you can simply remove the non-canonical transcripts from the input fasta file. Alternatively, you could re-order the entries in the file so that the canonical transcript occurs first. In the longer term, I'd be happy to implement a de-duplication procedure that is more aware of the semantics of the transcript names. However, I'd want to figure out how to do this that is agnostic to where the transcripts are coming from (i.e. that doesn't work only for Gencode, human etc., and that does something sensible when the transcripts are e.g. de novo assembled contigs). On that front, I'm completely open to suggestions.

Best,
Rob

@afkoeppel
Copy link
Author

Thanks so much for the prompt response. We will either remove the contigs from our reference, or re-order the file such that the contigs appear last.
I have no suggestions off the top of my head for making this work via transcript name semantics, (at least not that would work for any version of any reference).

@stephenturner
Copy link

also related thelovelab/tximeta#7. @mikelove suggested using gencode which will solve this problem. some example code for generating index, tx2gene, etc. here https://f1000research.com/articles/7-952/v1 @afkoeppel

@mikelove
Copy link
Collaborator

Use Gencode is the simple solution. Also Gencode combines coding and non-coding while Ensembl has these as two FASTA files.

What I'm thinking is that, tximeta can solve this for Ensembl post-hoc by simply renaming the duplicate transcripts after looking up the name of the transcript from the standard chromosome. it won't disrupt the function of tximeta because we index the sequence of the txome which doesn't care about the names of the duplicate transcripts

@rob-p rob-p closed this as completed Dec 6, 2018
@mikelove
Copy link
Collaborator

mikelove commented Dec 6, 2018

I’m working on it, and there’s an issue on tximeta GH page

@mikelove
Copy link
Collaborator

mikelove commented Jan 6, 2019

I’ve taken a shot at the tximeta implementation:

thelovelab/tximeta@e29c460

Use cleanDuplicateTxps=TRUE. It’s in the devel branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants