Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads best mapped to decoys not documented in --writeUnmappedNames output file #748

Closed
taylorreiter opened this issue Feb 4, 2022 · 5 comments
Assignees

Comments

@taylorreiter
Copy link

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Salmon

Describe the bug
Only u (unmapped) reads are labelled in the aux_info/unmapped.txt file, even when the log file indicates that many fragments were discarded because they are best-mapped to decoys.

To Reproduce

This GitHub repository details the motivation and full workflow of my pipeline: https://github.com/greenelab/2022-microberna/

To get the read files, I ran:

rule rnaseq_sample_download:
    output:
        reads="outputs/rnaseq_fastp/{sra}.fq.gz",
        json = "outputs/rnaseq_fastp/{sra}.fastp.json",
        html = "outputs/rnaseq_fastp/{sra}.fastp.html"
    params: tmp_base = lambda wildcards: "inputs/tmp_raw/" + wildcards.sra
    threads: 1
    resources:
        mem_mb=8000
    run:
        row = m.loc[m['experiment_accession'] == wildcards.sra]
        fastqs = row['fastq_ftp'].values[0]
        fastqs = fastqs.split(";")
        if len(fastqs) == 1:
            # single end data; download and stream directly to fastp for trimming.
            fastq = fastqs[0]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + ".fastq.gz"):
                shell("wget -O {params.tmp_base}.fastq.gz ftp://{fastq}")

            shell("fastp -i {params.tmp_base}.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + ".fastq.gz")

        else:
            # paired end data; download both files, interleave, and then remove files
            fastq_1 = fastqs[0]
            fastq_2 = fastqs[1]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + "_1.fastq.gz"):
                shell("wget -O {params.tmp_base}_1.fastq.gz ftp://{fastq_1}")

            if not os.path.exists(params.tmp_base + "_2.fastq.gz"):
                shell("wget -O {params.tmp_base}_2.fastq.gz ftp://{fastq_2}")

            shell("fastp -i {params.tmp_base}_1.fastq.gz -I {params.tmp_base}_2.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + "_1.fastq.gz")
                os.remove(params.tmp_base + "_2.fastq.gz")

However, I think it would be fine to reproduce from untrimmed files:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR436/001/ERR4363141/ERR4363141.fastq.gz

I generated the transcriptome by annotating all publicly available genomes for my species (Faecalibacterium prausnitzii_C). Using these annotations, I cut out coding domain sequences and nc/r/tRNAs and clustered the sequences at 95% identity. Then, I took the complement of these sequences (all the left over intergenic stuff) and designated these as decoys.

I indexed the transcriptome with:

rule index_transcriptome:
    input:
        seqs=ancient("outputs/gtdb_genomes_salmon_ref/{gtdb_species}.fa"),
        decoys=ancient("outputs/gtdb_genomes_intergenic_comb/{gtdb_species}_clustered_intergenic_seq_names.txt")
    output: "outputs/gtdb_genomes_salmon_index/{gtdb_species}/info.json"
    threads: 1
    params: index_dir = lambda wildcards: "outputs/gtdb_genomes_salmon_index/" + wildcards.gtdb_species
    resources:
        mem_mb=16000,
        tmpdir=TMPDIR
    conda: "envs/salmon.yml"
    benchmark:"benchmarks/salmon_index_{gtdb_species}.txt"
    shell:'''
    salmon index -t {input.seqs} -i {params.index_dir} --decoys {input.decoys} -k 31
    '''

I've attached my reference transcriptome and my file of decoy names at the bottom of this issue

Details --

  • Which version of salmon was used?: 1.6.0
  • How was salmon installed (compiled, downloaded executable, through bioconda)?: Miniconda
  • Which reference (e.g. transcriptome) was used?: Self-generated
  • Which read files were used?: ERX4307280, SRX10245671, SRX3847835
  • Which which program options were used?
salmon quant -i {params.index_dir} -l A -r {input.reads} -o {params.out_dir} --validateMappings --writeUnmappedNames

Expected behavior
I expected the reads that were counted in the log file as "discarded because they are best-mapped to decoys" to be labelled in the aux_info/unmapped_names.txt file with d, but all reads were marked as u.

Screenshots

$grep "Number of fragments discarded because they are best-mapped to decoys" */logs/*

ERX4307280_quant/logs/salmon_quant.log:[2022-02-02 15:51:25.854] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 948,850
SRX10245671_quant/logs/salmon_quant.log:[2022-02-02 15:57:10.321] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 961,758
SRX3847835_quant/logs/salmon_quant.log:[2022-02-02 15:33:54.185] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 220,351

Desktop (please complete the following information):

  • OS: Linux
  • Version:
Linux farm.cse.ucdavis.edu 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic

Additional context
I intentionally mapped all three libraries as SE, even though two are PE. Because of the presence of polycistronic transcripts in microbes, many paired-end reads would be discordant, which causes counts to look very...odd. See this preprint for more details on that phenomenon.

I'm trying to use the decoys as a first step in identifying reads that map to intergenic sequences, where reads might span two coding domain sequences, or land in the intergenic sequences between two coding domain sequenes. (#52)

Decoy names: s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz
Transcriptome: s__Faecalibacterium_prausnitzii_C.fa.gz

@rob-p
Copy link
Collaborator

rob-p commented Apr 13, 2022

Hi @taylorreiter,

First thing first — this sat for way too long before I got to it, so apologies for that!

So, the reason that they don’t show up in the unmapped.txt file is that is that reads mapped to decoys occupy a sort of “no man’s land” with respect to their mapping status. That is, they do map to the index, but just not to a valid target within the index.

In other words, if you write a BAM output from salmon, the decoy aligned reads will actually show up there, with the information about the decoy to which they are aligned. This is because they are mapped to something in the index, it just happens to be a decoy rather than a “valid target”.

However, the output BAM files are big, so I absolutely understand the desire to have them appear in the unmapped names list as well — it's a much smaller and easier thing to go through. I think the right thing to handle this would be to add a specific code/category to the set of unmapped codes used in the unmapped.txt file, to designate this is read best mapped to a decoy (rather than that this read is completely unmapped to the index). This shouldn't be too hard to do — I will try to find a few cycles to implement and test it.

Best,
Rob

@taylorreiter
Copy link
Author

Ah that makes sense, thank you for the explanation! Having a specific code/category to the set of unmapped codes used in the unmapped.txt file would be absolutely wonderful, and would give me all of the information I need to pursue my downstream use cases.

@rob-p rob-p self-assigned this May 26, 2022
rob-p pushed a commit that referenced this issue Jun 3, 2022
In single-end mode, *all* unmapped reads were
being reported with the code 'u'. This fixes
the output so the proper code is reported (e.g. 'd')
for decoys. This addresses #748.
@rob-p
Copy link
Collaborator

rob-p commented Jun 3, 2022

Hi @taylorreiter,

I was wrong — there was simply a bug that, in single end mode, everything was being written out with the u flag. This is now fixed in develop. It will be in the next release. Sorry about that!

--Rob

@taylorreiter
Copy link
Author

Exciting! Thank you @rob-p!!

@rob-p
Copy link
Collaborator

rob-p commented Jun 23, 2022

Hi @taylorreiter,

This should now be fixed in v1.9.0 which was just released 🎉 . Let us know if it works for you.

--Rob

@rob-p rob-p closed this as completed Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants