Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimizer/kmer string compression #107

Closed
jianshu93 opened this issue Oct 17, 2022 · 6 comments
Closed

minimizer/kmer string compression #107

jianshu93 opened this issue Oct 17, 2022 · 6 comments

Comments

@jianshu93
Copy link

Hello Chirag,

Does fastANI compress kmer/minimizer strings by default? I did not see it after checking. I realized that kmer counting from Heng Li's repo (based on kseq.h) (https://github.com/lh3/kmer-cnt/blob/master/kc-c1.c) compress AGCT into 0,1,2,3 et.al. We could do better actually to represent AGCT using only 2 bits memory(00, 01, 10, 11), Since fastANI consumes a lot of memory when running all versus all, I am wondering this could save a lot of memory. There are several Rust libraries that compression kmer into 2 bits and save a lot of memory (https://github.com/jean-pierreBoth/kmerutils/blob/master/src/base/alphabet.rs). I noticed there is also one here for C++: https://github.com/dassencio/dna-compression

Thanks,

Jianshu

@cjain7
Copy link
Member

cjain7 commented Oct 18, 2022

Hi, ATCG is being represented only using 2 bits (00 is 0, 01 is 1, 10 is 2 and 11 is 3)
https://github.com/lh3/kmer-cnt/blob/e2574719cfb784915d80eb5828e78dfae4cfdd7b/kc-c1.c#L36

@jianshu93
Copy link
Author

Thanks Chirag,I also noticed this in that kc-c1.c

why all veesus all is consuming so many memory?any possibility to reduce somehow if dna string is already compressed.

Thanks

Jianshu

@jianshu93
Copy link
Author

or we need to implement compression for fastANI?

Thanks.

Jianshu

@jianshu93
Copy link
Author

Hello Chirag,

If there is no need to do string compression for fastANI, I will close this issue.

Thanks,

Jianshu

@cjain7
Copy link
Member

cjain7 commented Oct 27, 2022

Sorry Jianshu, I am not clear what string compression means in this context. FastANI maintains a k-mer database extracted from all genomes, that is subsequently queried during mapping stage.

@peterjc
Copy link

peterjc commented Nov 25, 2024

You may be able to improve the memory usage this way, but I'd focus on how multiple references are handled first - see #76. With many bacteria vs a single bacteria reference, 100M of RAM should be plenty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants