Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alevin: incorrect indices in MTX output #431

Closed
pinin4fjords opened this issue Oct 1, 2019 · 18 comments
Closed

Alevin: incorrect indices in MTX output #431

pinin4fjords opened this issue Oct 1, 2019 · 18 comments
Assignees
Labels
alevin issue is primarily related to alevin bug fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master

Comments

@pinin4fjords
Copy link

pinin4fjords commented Oct 1, 2019

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?

Alevin

Describe the bug

I have an Alevin run which indicates 17136 genes in the MTX output:

%%MatrixMarket	matrix	coordinate	real	general
667389	17136	13000897
1	4	0.000000
1	5	0.000000
1	9	0.000000
1	11	0.000000
1	17	0.000000
...

quants_mat_cols.txt also has 17136 rows.

However the maximum value in the second column of the MTX is out range at 17144, so the matrix is invalid. There are thousands of matrix cells like this- not just one dodgy one.

To Reproduce
Steps and data to reproduce the behavior:

  • Version 0.14.1
  • Bioconda
  • Gallus Gallus cDNA from Ensembl, file Gallus_gallus.GRCg6a.cdna.all.98.fa.gz
  • These files from ENA, analysed together:
SAMN11526602	SRR8985046	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/006/SRR8985046/SRR8985046_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/006/SRR8985046/SRR8985046_2.fastq.gz
SAMN11526602	SRR8985045	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/005/SRR8985045/SRR8985045_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/005/SRR8985045/SRR8985045_2.fastq.gz
SAMN11526602	SRR8985044	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/004/SRR8985044/SRR8985044_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/004/SRR8985044/SRR8985044_2.fastq.gz
SAMN11526602	SRR8985043	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/003/SRR8985043/SRR8985043_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/003/SRR8985043/SRR8985043_2.fastq.gz
SAMN11526602	SRR8985042	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/002/SRR8985042/SRR8985042_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/002/SRR8985042/SRR8985042_2.fastq.gz
SAMN11526602	SRR8985041	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/001/SRR8985041/SRR8985041_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/001/SRR8985041/SRR8985041_2.fastq.gz
SAMN11526602	SRR8985040	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/000/SRR8985040/SRR8985040_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/000/SRR8985040/SRR8985040_2.fastq.gz
SAMN11526602	SRR8985039	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/009/SRR8985039/SRR8985039_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR898/009/SRR8985039/SRR8985039_2.fastq.gz

(renamed to cdna*.fastq.gz and barcodes.fastq.gz internal to workflow)

  • Which which program options were used?
# Do a pre-run to derive a starting whitelist, see https://github.com/COMBINE-lab/salmon/issues/362

salmon alevin -l ISR -1 $(ls barcodes*.fastq.gz | tr '\n' ' ') -2 $(ls cdna*.fastq.gz | tr '\n' ' ')         --chromium -i salmon_index -p 12 -o SAMN11526602_pre --tgMap transcript_to_gene.txt --dumpFeatures --noQuant

# Derive a relaxed whitelist, removing only the most obvious junk 

if [ $? -eq 0 ]; then 
    awk '{ if ($2 > 10) { print $1} }' SAMN11526602_pre/alevin/raw_cb_frequency.txt > pre_whitelist.txt
fi

# Supply the whitelist to the main Alevin run

salmon alevin -l ISR -1 $(ls barcodes*.fastq.gz | tr '\n' ' ') -2 $(ls cdna*.fastq.gz | tr '\n' ' ')         --chromium -i salmon_index -p 12 -o SAMN11526602_tmp --tgMap transcript_to_gene.txt --whitelist pre_whitelist.txt         --forceCells $(cat pre_whitelist.txt | wc -l | tr -d '\n') --dumpMtx

Expected behavior
MTX files without references to out-of-range columns.

Screenshots
If applicable, add screenshots or terminal output to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Ubuntu Linux, OSX]
  • Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a]
Linux hx-noah-63-15.ebi.ac.uk 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 GNU/Linux

LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.4 (Maipo)
Release:	7.4
Codename:	Maipo

Additional context
Add any other context about the problem here.

@k3yavi k3yavi self-assigned this Oct 1, 2019
@k3yavi k3yavi added alevin issue is primarily related to alevin bug labels Oct 1, 2019
@pinin4fjords
Copy link
Author

Apologies- not quite true that I use the above reference. I actually process the above cDNAs and remove any (if any) that don't have mappings in the associated GTF (from which I derived the transcript-to-gene mappings). Here's the R code:

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library(Biostrings))
suppressPackageStartupMessages(library(rtracklayer))

cl <- commandArgs(trailingOnly = TRUE)

cdna_file <- cl[1]
gtf_file <- cl[2]
transcript_to_gene_file <- cl[3]
filtered_cdna_file <- cl[4]

# Derive a transcript-to-gene mapping from the GTF

annotation <- elementMetadata(import(gtf_file))
annotation <- subset(annotation, ! (is.na(transcript_id) | is.na(gene_id)))

if ('transcript_version' %in% colnames(annotation)){
  mapping <- cbind(paste(annotation$transcript_id, annotation$transcript_version, sep='.'), annotation$gene_id)
}else{
  mapping <- annotation[,c('transcript_id', 'gene_id')]
}

# Read the cDNA

cdna <- readDNAStringSet(cdna_file)
cdna_transcript_names <- unlist(lapply(names(cdna), function(x) unlist(strsplit(x, ' '))[1]  ))

# Filter out cDNAs without matching transcript entries in the GTF

cdna <- cdna[which(cdna_transcript_names %in% mapping[,1])]

# Write transcript-to-gene mapping and filtered cDNA file

write.table(unique(mapping), file= transcript_to_gene_file, row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
writeXStringSet(x = cdna, filepath = filtered_cdna_file, compress = 'gzip')

@k3yavi k3yavi added the fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master label Oct 1, 2019
@pinin4fjords
Copy link
Author

Already fixed @k3yavi ? Is there any chance of a 0.14.2 release containing the fix?

@k3yavi
Copy link
Member

k3yavi commented Oct 1, 2019

HI @pinin4fjords ,

Thanks for reporting the this. There was bug associated with binary format to mtx format conversion we fixed in the upcoming release. The problem was associated with the last index of the matrix which can be off by max 8 indices because we were using uint_8 for storing the bit vectors.
My apologies for the trouble, there are two ways to solve this issue:
1.) Rerun the pipeline with the latest beta release of 0.99, unfortunately this is not in conda yet, but we plan to release it soon with awesome new updates like consuming both genome and transcriptome for read mapping.
2.) If you wan't to avoid rerunning the full pipeline, try this. Try cloning this repo and do a cargo install --release for the code in src-rs folder. Note: You might have to install Rust for this, it's just one liner install. Once compiled the EDS code, you can just do the following to generate the correct mtx file.

./target/release/eds convert -i <Path to output/alevin/quants_mat.gz> --mtx -c <num_cells> -f <num_genes>

Please let me know if it works out for you.

@pinin4fjords
Copy link
Author

Thanks @k3yavi - do I need to assume that all of our existing matrices could be corrupted?

We've built Alevin into our production processes, so I'm loath to hack my way to a solution. When is the 0.99 release due?

@pinin4fjords
Copy link
Author

A back port of this fix to a released 0.14.2 would be the most awesome solution ;-)

@k3yavi
Copy link
Member

k3yavi commented Oct 1, 2019

Hi @pinin4fjords ,
The error is very subtle, It happens when the total number of genes are exactly a multiple of 8, which in your case it is. It'd be great that before moving forward, if you can quickly verify if 0.99 actually fixes the issue or is it something else. Unfortunately, there is a chance that the mtx format output is corrupted for the last 8 genes (in the order they present in the quants_mat_cols.txt), however, the baseline output of alevin quants_mat.gz would still be correct and already be in the output folder of your pipeline. The EDS repo I forwarded you just convert the binary output into mtx format correctly.

Due to these subtle issues making into production environment, we changed the release cycle to be like a beta to stable type. Currently 0.99 is a beta release, which has been out for some time. If I have to guess, we plan to release 1.0 (the stable release) in approximately 2 weeks.

@k3yavi
Copy link
Member

k3yavi commented Oct 1, 2019

A back port of this fix to a released 0.14.2 would be the most awesome solution ;-)

Let me take this to the salmon team and get back to you. My apologies again for the stupid bug.

@pinin4fjords
Copy link
Author

Don't apologise, it happens. Just looking for a fix I can apply neatly without a major software change.

@pinin4fjords
Copy link
Author

... but will try to test the 0.99 too

@k3yavi
Copy link
Member

k3yavi commented Oct 1, 2019

Sounds good, let me know if 0.99 fixes it for you. If it does, I'll check if we can do something about 14.2.

@pinin4fjords
Copy link
Author

Hi @k3yavi - confirmed, fixed in 0.99!

@pinin4fjords
Copy link
Author

Could you let me know if a minor release is on the cards (or not) @k3yavi ? Would be nice if at least one Bioconda-available release had the fix.

@k3yavi
Copy link
Member

k3yavi commented Oct 4, 2019

Glad to know, that 0.99 fixes it.
I have a meeting today with the salmon team, I'd definitely let you know before eod.

@k3yavi
Copy link
Member

k3yavi commented Oct 4, 2019

Hi @pinin4fjords ,

I am gonna make the hot-fix release for the 0.14.2 now.
We plan to release v1.0.0 mid October, so with the v0.14.2 I can make just the bioconda and github release from the master branch to fix the mtx issue. Just wanted to give you heads up, I am making 14.2 from master, so the only difference from 14.1 would be the hotfix i.e. there would be no Genome+txome capability as that's planned with 1.0.

I'll keep you updated once it's available through bioconda.

@pinin4fjords
Copy link
Author

Thanks very much @k3yavi that's great

@k3yavi
Copy link
Member

k3yavi commented Oct 4, 2019

Ok, pushed to bioconda, should be available in a couple of hours. Once it's available I'll make the official release too on the github. It'd be great if you can quickly test the new release for the bug once it's available. Thanks again !

@k3yavi
Copy link
Member

k3yavi commented Oct 4, 2019

All right, v0.14.2 is online now.

@pinin4fjords
Copy link
Author

Thanks very much @k3yavi ! 0.14.2 works in my example too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alevin issue is primarily related to alevin bug fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master
Projects
None yet
Development

No branches or pull requests

2 participants