Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of genomes #76

Open
larssnip opened this issue Sep 28, 2020 · 3 comments
Open

Large number of genomes #76

larssnip opened this issue Sep 28, 2020 · 3 comments

Comments

@larssnip
Copy link

First, a suggestion: It would be very helpful to be able to turn off the screen output. We use fastANI with a single query genome against a long list (thousands) of reference genomes (--refList option) and listing thousands of filenames each time is annoying and rather useless.

But, the main problem lies in our observation that listing 30 000+ files and provide it as input using --refList results in fastANI not producing any output! There is no error message, it starts as before, but looks like the it just gives up, and finishes without producing output. I have, by experimenting, found that 10 000 files works fine. I know several UNIX programs have a limit on how long a commandline may be. Is this the reason? I run this on an HPC, and allocate 99GB for this job. It doesn't look to me like a memory problem...?

@cjain7
Copy link
Member

cjain7 commented Sep 29, 2020

For the first, you can easily turn off screen output by redirecting stderr log. Just append 2>/dev/null to the end of your FastANI command.

For the second, the memory usage is proportional to the total size of references provided. Your run could be running out of memory. To resolve this, you can run the job in batches (of say 5000 reference genomes) by using a bash script. You can use this script if you want. If you have a cluster, you could also parallelise these batches across multiple compute nodes. I don't think this is happening due to any UNIX limits.

@larssnip
Copy link
Author

Thanks for this. I did the batching myself, actually, and it works. The reason I did not think of this as a memory problem is that there was no "out of memory" message related to this termination. This is usually the case on the cluster.

@peterjc
Copy link

peterjc commented Nov 14, 2024

I can understand keeping the references all in memory when using a list of queries and a list of references (although it will limit scaling it does enable your threading support, see #140), but why does fastANI appear to keep all the references in memory even if used with a single query? This feels like a bug, or at very least undesirable behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants