Salmon quant picks the wrong duplicate to keep #249

afkoeppel · 2018-07-11T15:12:31Z

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?

This issue is with salmon (bulk mode).

Describe the bug

The grch38 transcriptome has transcripts that are identical in sequence, but annotated differently. Salmon is collapsing the duplicate transcripts to one transcript (as it should).
The problem is that (at least for some transcripts) we're getting abundance estimates for the copy of a transcript that's on a contig, but not estimates for the copy on the regular chromosome.
In short, it's picking the wrong duplicate.

What we're after is a way to get it to prioritize the copies on chromosomes over the copies on contigs.

To Reproduce

Salmon version 0.10.2
Salmon installed through bioconda.
Reference grch38. All cdnas and ncrnas.
Reads were single-end
Program options: salmon quant -p 48 --seqBias --gcBias --biasSpeedSamp 5 -l A

Expected behavior
That the selected duplicate reference transcript would be the one on the chromosome, rather than the contig.
In particular, the LTA gene is of interest to us. In grch38, the two ids are ENSG00000226979 (on chr6) and ENSG00000231408 (on CHR_HSCHR6_MHC_MANN_CTG1). Each of the transcripts of these two genes has an identical copy in the other gene. The transcripts for both genes appear in the duplicate_clusters.tsv file in the salmon index directory.
When we quant, only transcripts of ENSG00000231408 appear in the output. We would rather that ENSG00000226979 be the one to appear.

What is the best way forward to accomplish this? Should we just remove the contig transcripts from the reference?

Desktop:

OS: Centos Linux
uname -a: 3.10.0-862.2.3.el7.x86_64 Add a Gitter chat badge to README.md #1 SMP Wed May 9 18:05:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
lsb_release -a: LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.5.1804 (Core)
Release: 7.5.1804
Codename: Core

The text was updated successfully, but these errors were encountered:

rob-p · 2018-07-11T15:53:57Z

Hi @afkoeppel,

Thanks for reporting this. While I completely understand that this is not the desired behavior, it is the expected behavior. That is, when collapsing duplicates during indexing, the invariant that is maintained is that all sequence-identical transcripts will be "collapsed" and a single representative maintained. In practice (i.e. in implementation), the transcript that is maintained is the first one encountered.

In the short-term, your suggested solution is the best. That is you can simply remove the non-canonical transcripts from the input fasta file. Alternatively, you could re-order the entries in the file so that the canonical transcript occurs first. In the longer term, I'd be happy to implement a de-duplication procedure that is more aware of the semantics of the transcript names. However, I'd want to figure out how to do this that is agnostic to where the transcripts are coming from (i.e. that doesn't work only for Gencode, human etc., and that does something sensible when the transcripts are e.g. de novo assembled contigs). On that front, I'm completely open to suggestions.

Best,
Rob

afkoeppel · 2018-07-11T17:32:19Z

Thanks so much for the prompt response. We will either remove the contigs from our reference, or re-order the file such that the contigs appear last.
I have no suggestions off the top of my head for making this work via transcript name semantics, (at least not that would work for any version of any reference).

stephenturner · 2018-07-25T17:17:46Z

also related thelovelab/tximeta#7. @mikelove suggested using gencode which will solve this problem. some example code for generating index, tx2gene, etc. here https://f1000research.com/articles/7-952/v1 @afkoeppel

mikelove · 2018-07-25T17:23:25Z

Use Gencode is the simple solution. Also Gencode combines coding and non-coding while Ensembl has these as two FASTA files.

What I'm thinking is that, tximeta can solve this for Ensembl post-hoc by simply renaming the duplicate transcripts after looking up the name of the transcript from the standard chromosome. it won't disrupt the function of tximeta because we index the sequence of the txome which doesn't care about the names of the duplicate transcripts

mikelove · 2018-12-06T19:56:24Z

I’m working on it, and there’s an issue on tximeta GH page

mikelove · 2019-01-06T14:41:45Z

I’ve taken a shot at the tximeta implementation:

thelovelab/tximeta@e29c460

Use cleanDuplicateTxps=TRUE. It’s in the devel branch

stephenturner mentioned this issue Jul 25, 2018

Renaming duplicate transcripts for Ensembl thelovelab/tximeta#7

Closed

rob-p closed this as completed Dec 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salmon quant picks the wrong duplicate to keep #249

Salmon quant picks the wrong duplicate to keep #249

afkoeppel commented Jul 11, 2018

rob-p commented Jul 11, 2018

afkoeppel commented Jul 11, 2018

stephenturner commented Jul 25, 2018

mikelove commented Jul 25, 2018

mikelove commented Dec 6, 2018

mikelove commented Jan 6, 2019

Salmon quant picks the wrong duplicate to keep #249

Salmon quant picks the wrong duplicate to keep #249

Comments

afkoeppel commented Jul 11, 2018

rob-p commented Jul 11, 2018

afkoeppel commented Jul 11, 2018

stephenturner commented Jul 25, 2018

mikelove commented Jul 25, 2018

mikelove commented Dec 6, 2018

mikelove commented Jan 6, 2019