Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simpleaf index: runtime expectations #166

Open
kevinrue opened this issue Nov 19, 2024 · 6 comments
Open

simpleaf index: runtime expectations #166

kevinrue opened this issue Nov 19, 2024 · 6 comments

Comments

@kevinrue
Copy link

Cross-posting from https://www.reddit.com/r/bioinformatics/comments/1g6zfu6/simpleaf_index_long_runtime/

Is there some guidance about the expected runtime of simpleaf index anywhere?

The post above reports 20 min runtime for human using 16 CPUs.

In my current situation, Drosophila has a genome of approx. 180 Mb and my HPC job with 16 CPUs timed out after an hour.

  • Is there a rule of thumb that can help users guesstimate runtime based on genome size and/or annotated features?
  • Is there guidance on reasonable range of values for the number of CPU (maximum after which more CPUs don't help much)
  • Any other guidance on sanity checks and steps users can take to optimise performance and runtime?

PS: my command is simpleaf index --output resources/genome/index/alevin --fasta tmp_alevin_index.fa --gtf resources/genome/genome.gtf.gz --rlen 150 --threads 16 --use-piscem

In particular, I've set --rlen 150 based on the length of my scRNAseq reads. Is that alright?

Thanks!

@rob-p
Copy link
Contributor

rob-p commented Nov 19, 2024

Cc @DongzeHE @jamshed. In general, if you are seeing long runtimes on your HPC, it's likely related to filesystem interactions with networked disks. Specifically, you should make sure that you are executing the indexing command and writing the resulting index to a local disk (e.g. scratch or tmp), then copying the index over to a result directory. The indexing procedure creates many small intermediate files (which we are looking to address), but this really messes with networked file systems, so you should make sure index construction doesn't happen on or write to networked disks.

@kevinrue
Copy link
Author

Thanks! I'll check with our IT team how I might be able to optimise this. Feel free to close the issue. I might report back here with an update on improvement and - if applicable - advice to others.

@kevinrue
Copy link
Author

Actually, quick follow up:

When you mention intermediate files, do you refer to those under the directory workdir.noindex ?

If so, does this directory automatically shows up in the working directory? Is there a way to make it appear elsewhere? I don't see any related argument in https://simpleaf.readthedocs.io/en/latest/index-command.html

You only mention writing the resulting index file to a local disk, but it sounds like those temporary files are good candidate too.

Thanks!

@rob-p
Copy link
Contributor

rob-p commented Nov 19, 2024

Yes, those temporary files are the main offenders (more than the index itself). They appear in the execution directory. We do have a flag to set the work directory, but it is not exposed in simpleaf yet (it's on the dev branch and will be in the next release, but we are waiting on one or two other features for the next release).

@kevinrue
Copy link
Author

Right, so rather than the index file, I could make my job change directory to the TMPDIR, and run the command from there. I'll try, but definitely looking forward to future versions taking care of that automagically :)

@kevinrue
Copy link
Author

Pardon the convoluted Snakemake, but here goes:

rule alevin_build_reference_index:
    input:
        genome="resources/genome/genome.fa.gz",
        gtf="resources/genome/genome.gtf.gz",
    output:
        index=directory("resources/genome/index/alevin"),
    log:
        out="logs/alevin/build_reference_index.out",
        err="logs/alevin/build_reference_index.err",
    threads: 16
    resources:
        mem="8G",
        runtime="1h",
    shell:
        "jobdir=$(pwd) &&"
        " cd $TMPDIR &&"
        " export ALEVIN_FRY_HOME=af_home &&"
        " simpleaf set-paths &&"
        " gunzip -c $jobdir/{input.genome} > tmp_alevin_index.fa  &&"
        " simpleaf index"
        " --output $jobdir/{output.index}"
        " --fasta tmp_alevin_index.fa"
        " --gtf $jobdir/{input.gtf}"
        " --rlen 150"
        " --threads 16"
        " --use-piscem"
        " > $jobdir/{log.out} 2> $jobdir/{log.err}"

In short: changing directory to $TMPDIR (a local folder given to each Slurm job on our HPC), and running the simpleaf index command from there, so that workdir.noindex/ and all those small intermediate files get created in a local tempdir.

Seems to have shortened the job from 3 hours down to 50 min.

Ah and another attempt just completed where I've just edited the rule above to also produce the output directory in the local tempdir before copying it back to the network drive, still 50 min.

Not sure if there's anything else going on making Drosophila somehow longer to process than human. I hear there are some overlapping transcripts that might be a source of trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants