Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot SV counts at the end of the ClusterBatch module #389

Closed
wants to merge 2 commits into from

Conversation

VJalili
Copy link
Member

@VJalili VJalili commented Aug 18, 2022

The plots will be used for removing outliers before training the random forest model. This is the first step in resolving the issue #44.

@VJalili VJalili requested a review from mwalker174 August 18, 2022 16:44
wdl/ClusterBatch.wdl Outdated Show resolved Hide resolved
Co-authored-by: Mark Walker <[email protected]>
Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and get this in since it is useful QC at this stage of the pipeline. Can you take it further and integrate with GATKSVPipelinePhase1 and GATKSVPipelineBatch, including updating json templates?

"ClusterBatch.ped_file": {{ test_batch.ped_file | tojson }}
"ClusterBatch.ped_file": {{ test_batch.ped_file | tojson }},

"ClusterBatch.outlier_cutoff_nIQR": "10000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"ClusterBatch.outlier_cutoff_nIQR": "10000"
"ClusterBatch.outlier_cutoff_nIQR": "8"

10000 is used in FilterBatch to effectively disable sample filtering. Here we should make it something statistically reasonable, otherwise the cutoffs at +/- 10000 IQR make the plots unreadable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using 6 for plotting

@@ -65,6 +66,8 @@ workflow ClusterBatch {

Float? java_mem_fraction

Int outlier_cutoff_nIQR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest making this an optional. If it's provided, we run PlotSVCountsPerSample. If not, it skips the task (we don't need to run this for the single sample pipeline for example). Would need to add this input and the outputs to GATKSVPipelinePhase1 and GATKSVPipelineBatch.

@epiercehoffman
Copy link
Collaborator

This was completed as part of #567 so I would recommend closing this PR

@VJalili VJalili closed this Aug 18, 2023
@VJalili VJalili deleted the remove-outliers branch August 18, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants