Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scientific notation for large numbers of reads #371

Closed
map2085 opened this issue Feb 24, 2016 · 20 comments · Fixed by #1031
Closed

scientific notation for large numbers of reads #371

map2085 opened this issue Feb 24, 2016 · 20 comments · Fixed by #1031

Comments

@map2085
Copy link

map2085 commented Feb 24, 2016

command: bedtools genomecov -d -g genome.size -ibam in.bam > out.cov

genome.size consists of a single RNA "myRNA", length = 8 kb nucleotides.

in.bam contains > 15 million reads aligned to "myRNA".

For some positions on the RNA, there is coverage > 1 million. But in the coverage file out.cov, BEDTOOLS uses scientific notation for the coverage at position with > 1million depth.

ex:

...
myRNA 1238 1.06464e+06
myRNA 1239 1.06471e+06
myRNA 1240 1.06482e+06
myRNA 1241 1.06494e+06
...

Can BEDTOOLS please be corrected to output the full, exact integer, NOT in exponent notation?

Thank you

@arq5x
Copy link
Owner

arq5x commented Feb 29, 2016

Yep, I will look into it.

@JYLeeBioinfo
Copy link

JYLeeBioinfo commented Jun 16, 2022

Any update on this issue?

I also think writing large numbers as exact integers is needed, at least as an option.

@camelest
Copy link

camelest commented Oct 21, 2022

@map2085 @hd00ljy Hi, I'm having the same problem with you. Did you find any way around?

@JYLeeBioinfo
Copy link

JYLeeBioinfo commented Oct 21, 2022

@camelest I tried bamCoverage in deepTools.
It generates binned bigwigs, which were sufficient for my purpose.
But if you need single bp resolution, the deepTools would not be the option

@camelest
Copy link

@hd00ljy Thank you so much for the information. I will give it a try!

@brentp
Copy link
Collaborator

brentp commented Nov 28, 2022

@arq5x . what precision do we want for default here? 4, so like 123.4343 ?

@camelest
Copy link

@brentp @arq5x I think we want something like 1,064,643 (instead of 1.06464e+06).
(I also had similar problems mentioned here

By the way in a different situation when I perform
bedtools merge -c 5 -o sum -I input.bam
even if the sum exceeds > 1million, I don't find any scientific notation. I'm wondering where the difference comes.

@brentp
Copy link
Collaborator

brentp commented Nov 29, 2022

oh. right. I was thinking this was a float. I will make a PR. Thanks for following up.

@camelest
Copy link

Thank you so much for your response!

brentp added a commit to brentp/bedtools2 that referenced this issue Nov 29, 2022
@brentp
Copy link
Collaborator

brentp commented Nov 29, 2022

bedtools.static.gz

Hi, care to try out this binary to see if it resolves your issue? Just gunzip, chmod +x and then use as ./bedtools.static genomecov ...

@camelest
Copy link

@brentp Thank you so much for your prompt response. I checked it but somehow still found "e"... Is it because I'm applying -5 options?

./bedtools.static genomecov -ibam input.bam -5 -bg -strand + > output.bedGraph
grep "e" output.bedGraph
chr11 65499044 65499045 1.0946e+06
chr19 48965308 48965309 1.24953e+07

@brentp
Copy link
Collaborator

brentp commented Nov 30, 2022

Thanks for following up. Indeed, I see why the fix doesn't work. (The ternary must have a fixed type and it still chooses float).
I'll come up with a test-case and see how to resolve this.

@brentp
Copy link
Collaborator

brentp commented Nov 30, 2022

so we can adjust this by messing with the fmtflags. Here is the test code using the number above from @camelest
output is shown after // comment.

#include <cstdint>
#include <iostream>

int main() {
    double f = 12495300.0;
    std::cout << f << std::endl; // 1.24953e+07

    std::cout.setf(std::ios_base::fixed);
    std::cout << f << std::endl; // 12495300.000000

    std::cout.precision(0);
    std::cout << f << std::endl; // 123000000
}

So, we can safely set to std::cout.setf(std::ios_base::fixed); which avoids the scientific notation.
And when scale == 1.0, I think we can't universally set precison(0), but I can add that (and restore default) in functions that should write integers.

@arq5x you ok with this approach?

I would use std::cout.setf(std::ios_base::fixed); in genomeCoverageMain.cpp
and the apply precision(0) and restore default in the appropriate functions.

Or, I could apply fixed and restore default in each function as well.

@brentp
Copy link
Collaborator

brentp commented Nov 30, 2022

There are sections like this in ReportGenomeCoverage:

   // loop through the depths for the entire genome
    // and report the number and fraction of bases in
    // the entire genome that are at said depth.
    for (histMap::iterator genomeDepthIt = genomeHist.begin(); genomeDepthIt != genomeHist.end(); ++genomeDepthIt) {
        int depth = genomeDepthIt->first;
        CHRPOS numBasesAtDepth = genomeDepthIt->second;

        cout << "genome" << "\t" << depth << "\t" << numBasesAtDepth << "\t"
            << genomeSize << "\t" << (float) ((float)numBasesAtDepth / (float)genomeSize) << endl;
    }

where we want the floating (or scientfic format) retained. For what I proposed above, we'd always get the float (with apparently 6 decimals of precision) but depending on how we do it, we could get a scientific format in some cases.

@arq5x
Copy link
Owner

arq5x commented Nov 30, 2022

I wonder if moving to printf or separate court calls would be better so that we can have fine-scale control over the precision for integers (such as depth) and floats (such as fraction of bases at depth)

@brentp
Copy link
Collaborator

brentp commented Nov 30, 2022

Integers are printed as expected. The problem is the scale parameter which turns ints into floats even with it is 1.0.

This program:

#include <cstdint>
#include <iostream>

int main() {
        double f = 12495300.0;
        int64_t i = 880000000000;
        std::cout << f << " i:" << i << std::endl; // 1.24953e+07

        std::cout.setf(std::ios_base::fixed);
        std::cout << f << " i:" << i << std::endl; // 12495300.000000

    std::cout.precision(0);
        std::cout << f << " i:" << i << std::endl; // 123000000
}

produces this output:

1.24953e+07 i:880000000000
12495300.000000 i:880000000000
12495300 i:880000000000

@camelest
Copy link

camelest commented Dec 3, 2022

@brentp @arq5x Thank you so much for following up. It would be really nice if we can choose to output integer as in "precision(0)".

@brentp
Copy link
Collaborator

brentp commented Dec 5, 2022

I will make a PR that sets precision(0) where appropriate and when scale == 0. It will reset precision to the default after each function call.

brentp added a commit to brentp/bedtools2 that referenced this issue Dec 5, 2022
brentp added a commit to brentp/bedtools2 that referenced this issue Dec 5, 2022
@brentp
Copy link
Collaborator

brentp commented Dec 5, 2022

@camelest . I added a test this time for the latest fix, but if you could verify that this works on your data, it would be much appreciated.

bedtools.static.gz

@camelest
Copy link

camelest commented Dec 6, 2022

@brentp I tested it and it worked perfect! Thank you so much for your quick solution.

./bedtools.static genomecov -ibam input.bam -5 -bg -strand + > output.bedGraph
grep "65499044" output.bedGraph
chr11 65499044 65499045 1094599

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants