Large number of genomes #76

larssnip · 2020-09-28T11:02:33Z

First, a suggestion: It would be very helpful to be able to turn off the screen output. We use fastANI with a single query genome against a long list (thousands) of reference genomes (--refList option) and listing thousands of filenames each time is annoying and rather useless.

But, the main problem lies in our observation that listing 30 000+ files and provide it as input using --refList results in fastANI not producing any output! There is no error message, it starts as before, but looks like the it just gives up, and finishes without producing output. I have, by experimenting, found that 10 000 files works fine. I know several UNIX programs have a limit on how long a commandline may be. Is this the reason? I run this on an HPC, and allocate 99GB for this job. It doesn't look to me like a memory problem...?

cjain7 · 2020-09-29T07:11:01Z

For the first, you can easily turn off screen output by redirecting stderr log. Just append 2>/dev/null to the end of your FastANI command.

For the second, the memory usage is proportional to the total size of references provided. Your run could be running out of memory. To resolve this, you can run the job in batches (of say 5000 reference genomes) by using a bash script. You can use this script if you want. If you have a cluster, you could also parallelise these batches across multiple compute nodes. I don't think this is happening due to any UNIX limits.

larssnip · 2020-09-29T07:44:20Z

Thanks for this. I did the batching myself, actually, and it works. The reason I did not think of this as a memory problem is that there was no "out of memory" message related to this termination. This is usually the case on the cluster.

peterjc · 2024-11-14T12:22:56Z

I can understand keeping the references all in memory when using a list of queries and a list of references (although it will limit scaling it does enable your threading support, see #140), but why does fastANI appear to keep all the references in memory even if used with a single query? This feels like a bug, or at very least undesirable behavior.

peterjc mentioned this issue Nov 14, 2024

How to run large scale fastANI comparisons quicker? pyani-plus/pyani-plus#180

Closed

This was referenced Nov 25, 2024

The genome reference limit. #115

Open

minimizer/kmer string compression #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of genomes #76

Large number of genomes #76

larssnip commented Sep 28, 2020

cjain7 commented Sep 29, 2020

larssnip commented Sep 29, 2020

peterjc commented Nov 14, 2024

Large number of genomes #76

Large number of genomes #76

Comments

larssnip commented Sep 28, 2020

cjain7 commented Sep 29, 2020

larssnip commented Sep 29, 2020

peterjc commented Nov 14, 2024