Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSPS-269 Speed up CountVariantsInChunksBeagle by using bedtools #1335

Merged
merged 9 commits into from
Jul 19, 2024

Conversation

mmorgantaylor
Copy link
Member

@mmorgantaylor mmorgantaylor commented Jul 17, 2024

Description

Stats for this task before and after the change:

1000 sample input, 1000G ref panel:

  • before - using interval lists:
    • chr1, shard 0 var_in_original=6917, var_also_in_reference=6447
    • avg task duration 1:04:14 (n=126)
    • 48 of 74 total retries (65%)
  • after - using bed files:
    • chr1, shard 0 var_in_original=6917, var_also_in_reference=6447 (identical as expected)
    • avg task duration 0:06:36 (n=126) - likely lower than 42 sample input because this input has 66% fewer sites
    • 6 of 31 total retries (19%)

This PR also includes an update to the Ref Panel generation wdl to include creating the ref panel bed files used in the new task.

Also tested that the QC check for overlapping sites between input and reference still works: https://app.terra.bio/#workspaces/morgan-fieldeng/Imputation_pipeline_testing/job_history/d116884a-d918-44d1-900e-b91691a2b3b1

old branch: 500 sample input, 1000G ref panel. shard 0 / ... / CountVariantsInChunksBeagle / shard 0
var_in_original: 7574
var_also_in_reference: 7076

this branch: 500 sample input, sim 10k ref panel. shard 0 / .../ CountVariantsInChunksBeagle / shard 0
var_in_original: 7574
var_also_in_reference: 426


Checklist

If you can answer "yes" to the following items, please add a checkmark next to the appropriate checklist item(s) and notify our WARP documentation team by tagging either @ekiernan or @kayleemathews in a comment on this PR.

  • Did you add inputs, outputs, or tasks to a workflow?
  • Did you modify, delete or move: file paths, file names, input names, output names, or task names?
  • If you made a changelog update, did you update the pipeline version number?

@dsde-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@pullapprove pullapprove bot requested a review from kachulis July 17, 2024 13:58
@mmorgantaylor mmorgantaylor requested a review from jsotobroad July 17, 2024 14:00
Copy link
Contributor

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really curious as to why the 1000 sample task with the updates ran almost twice as fast as the 42 sample task with the updates

@@ -161,7 +161,7 @@ task CountVariantsInChunks {
set -e -o pipefail

echo $(gatk --java-options "-Xms~{command_mem}m -Xmx~{max_heap}m" CountVariants -V ~{vcf} | sed 's/Tool returned://') > var_in_original
echo $(gatk --java-options "-Xms~{command_mem}m -Xmx~{max_heap}m" CountVariants -V ~{vcf} -L ~{panel_vcf} | sed 's/Tool returned://') > var_in_reference
echo $(gatk --java-options "-Xms~{command_mem}m -Xmx~{max_heap}m" CountVariants -V ~{vcf} -L ~{panel_vcf} | sed 's/Tool returned://') > var_in_reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also update this task to be more efficient too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think so, this is used in the minimac wdl and i don't really want to touch that one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

input {
File ref_panel_interval_list

Int disk_size_gb = ceil(2*size(ref_panel_interval_list, "GiB")) + 50 # not sure how big the disk size needs to be since we aren't downloading the entire VCF here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is probably copy pasta'd?

@jsotobroad
Copy link
Contributor

oh also just double checking that workflows that used to fail the QC still continue to fail the QC check (if you use the simulated data as input it should fail the QC check)

@@ -161,7 +161,7 @@ task CountVariantsInChunks {
set -e -o pipefail

echo $(gatk --java-options "-Xms~{command_mem}m -Xmx~{max_heap}m" CountVariants -V ~{vcf} | sed 's/Tool returned://') > var_in_original
echo $(gatk --java-options "-Xms~{command_mem}m -Xmx~{max_heap}m" CountVariants -V ~{vcf} -L ~{panel_vcf} | sed 's/Tool returned://') > var_in_reference
echo $(gatk --java-options "-Xms~{command_mem}m -Xmx~{max_heap}m" CountVariants -V ~{vcf} -L ~{panel_vcf} | sed 's/Tool returned://') > var_in_reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mmorgantaylor mmorgantaylor merged commit 95c3941 into TSPS-183_mma_beagle_imputation_hg38 Jul 19, 2024
1 of 2 checks passed
@mmorgantaylor mmorgantaylor deleted the TSPS-269_bedtools branch July 19, 2024 17:18
mmorgantaylor added a commit that referenced this pull request Feb 13, 2025
* try creating bed files

* try again

* try again again

* a different thing

* use bedtools and bed ref panel files

* oops update the correct task

* fix

* use the right freaking file name

* remove comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants