Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748

taylorreiter · 2022-02-04T20:13:58Z

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Salmon

Describe the bug
Only u (unmapped) reads are labelled in the aux_info/unmapped.txt file, even when the log file indicates that many fragments were discarded because they are best-mapped to decoys.

To Reproduce

This GitHub repository details the motivation and full workflow of my pipeline: https://github.com/greenelab/2022-microberna/

To get the read files, I ran:

rule rnaseq_sample_download:
    output:
        reads="outputs/rnaseq_fastp/{sra}.fq.gz",
        json = "outputs/rnaseq_fastp/{sra}.fastp.json",
        html = "outputs/rnaseq_fastp/{sra}.fastp.html"
    params: tmp_base = lambda wildcards: "inputs/tmp_raw/" + wildcards.sra
    threads: 1
    resources:
        mem_mb=8000
    run:
        row = m.loc[m['experiment_accession'] == wildcards.sra]
        fastqs = row['fastq_ftp'].values[0]
        fastqs = fastqs.split(";")
        if len(fastqs) == 1:
            # single end data; download and stream directly to fastp for trimming.
            fastq = fastqs[0]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + ".fastq.gz"):
                shell("wget -O {params.tmp_base}.fastq.gz ftp://{fastq}")

            shell("fastp -i {params.tmp_base}.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + ".fastq.gz")

        else:
            # paired end data; download both files, interleave, and then remove files
            fastq_1 = fastqs[0]
            fastq_2 = fastqs[1]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + "_1.fastq.gz"):
                shell("wget -O {params.tmp_base}_1.fastq.gz ftp://{fastq_1}")

            if not os.path.exists(params.tmp_base + "_2.fastq.gz"):
                shell("wget -O {params.tmp_base}_2.fastq.gz ftp://{fastq_2}")

            shell("fastp -i {params.tmp_base}_1.fastq.gz -I {params.tmp_base}_2.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + "_1.fastq.gz")
                os.remove(params.tmp_base + "_2.fastq.gz")

However, I think it would be fine to reproduce from untrimmed files:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR436/001/ERR4363141/ERR4363141.fastq.gz

I generated the transcriptome by annotating all publicly available genomes for my species (Faecalibacterium prausnitzii_C). Using these annotations, I cut out coding domain sequences and nc/r/tRNAs and clustered the sequences at 95% identity. Then, I took the complement of these sequences (all the left over intergenic stuff) and designated these as decoys.

I indexed the transcriptome with:

rule index_transcriptome:
    input:
        seqs=ancient("outputs/gtdb_genomes_salmon_ref/{gtdb_species}.fa"),
        decoys=ancient("outputs/gtdb_genomes_intergenic_comb/{gtdb_species}_clustered_intergenic_seq_names.txt")
    output: "outputs/gtdb_genomes_salmon_index/{gtdb_species}/info.json"
    threads: 1
    params: index_dir = lambda wildcards: "outputs/gtdb_genomes_salmon_index/" + wildcards.gtdb_species
    resources:
        mem_mb=16000,
        tmpdir=TMPDIR
    conda: "envs/salmon.yml"
    benchmark:"benchmarks/salmon_index_{gtdb_species}.txt"
    shell:'''
    salmon index -t {input.seqs} -i {params.index_dir} --decoys {input.decoys} -k 31
    '''

I've attached my reference transcriptome and my file of decoy names at the bottom of this issue

Details --

Which version of salmon was used?: 1.6.0
How was salmon installed (compiled, downloaded executable, through bioconda)?: Miniconda
Which reference (e.g. transcriptome) was used?: Self-generated
Which read files were used?: ERX4307280, SRX10245671, SRX3847835
Which which program options were used?

salmon quant -i {params.index_dir} -l A -r {input.reads} -o {params.out_dir} --validateMappings --writeUnmappedNames

Expected behavior
I expected the reads that were counted in the log file as "discarded because they are best-mapped to decoys" to be labelled in the aux_info/unmapped_names.txt file with d, but all reads were marked as u.

Screenshots

$grep "Number of fragments discarded because they are best-mapped to decoys" */logs/*

ERX4307280_quant/logs/salmon_quant.log:[2022-02-02 15:51:25.854] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 948,850
SRX10245671_quant/logs/salmon_quant.log:[2022-02-02 15:57:10.321] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 961,758
SRX3847835_quant/logs/salmon_quant.log:[2022-02-02 15:33:54.185] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 220,351

Desktop (please complete the following information):

OS: Linux
Version:

Linux farm.cse.ucdavis.edu 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic

Additional context
I intentionally mapped all three libraries as SE, even though two are PE. Because of the presence of polycistronic transcripts in microbes, many paired-end reads would be discordant, which causes counts to look very...odd. See this preprint for more details on that phenomenon.

I'm trying to use the decoys as a first step in identifying reads that map to intergenic sequences, where reads might span two coding domain sequences, or land in the intergenic sequences between two coding domain sequenes. (#52)

Decoy names: s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz
Transcriptome: s__Faecalibacterium_prausnitzii_C.fa.gz

The text was updated successfully, but these errors were encountered:

rob-p · 2022-04-13T19:53:00Z

Hi @taylorreiter,

First thing first — this sat for way too long before I got to it, so apologies for that!

So, the reason that they don’t show up in the unmapped.txt file is that is that reads mapped to decoys occupy a sort of “no man’s land” with respect to their mapping status. That is, they do map to the index, but just not to a valid target within the index.

In other words, if you write a BAM output from salmon, the decoy aligned reads will actually show up there, with the information about the decoy to which they are aligned. This is because they are mapped to something in the index, it just happens to be a decoy rather than a “valid target”.

However, the output BAM files are big, so I absolutely understand the desire to have them appear in the unmapped names list as well — it's a much smaller and easier thing to go through. I think the right thing to handle this would be to add a specific code/category to the set of unmapped codes used in the unmapped.txt file, to designate this is read best mapped to a decoy (rather than that this read is completely unmapped to the index). This shouldn't be too hard to do — I will try to find a few cycles to implement and test it.

Best,
Rob

taylorreiter · 2022-04-13T19:56:49Z

Ah that makes sense, thank you for the explanation! Having a specific code/category to the set of unmapped codes used in the unmapped.txt file would be absolutely wonderful, and would give me all of the information I need to pursue my downstream use cases.

In single-end mode, *all* unmapped reads were being reported with the code 'u'. This fixes the output so the proper code is reported (e.g. 'd') for decoys. This addresses #748.

rob-p · 2022-06-03T17:33:34Z

Hi @taylorreiter,

I was wrong — there was simply a bug that, in single end mode, everything was being written out with the u flag. This is now fixed in develop. It will be in the next release. Sorry about that!

--Rob

taylorreiter · 2022-06-03T20:50:11Z

Exciting! Thank you @rob-p!!

rob-p · 2022-06-23T18:56:51Z

Hi @taylorreiter,

This should now be fixed in v1.9.0 which was just released 🎉 . Let us know if it works for you.

--Rob

rob-p added the enhancement label May 26, 2022

rob-p self-assigned this May 26, 2022

rob-p closed this as completed Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748

Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748

taylorreiter commented Feb 4, 2022

rob-p commented Apr 13, 2022

taylorreiter commented Apr 13, 2022

rob-p commented Jun 3, 2022

taylorreiter commented Jun 3, 2022

rob-p commented Jun 23, 2022

Reads best mapped to decoys not documented in --writeUnmappedNames output file #748

Reads best mapped to decoys not documented in --writeUnmappedNames output file #748

Comments

taylorreiter commented Feb 4, 2022

rob-p commented Apr 13, 2022

taylorreiter commented Apr 13, 2022

rob-p commented Jun 3, 2022

taylorreiter commented Jun 3, 2022

rob-p commented Jun 23, 2022

Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748

Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748