Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TrainGCNV input specifying subset list of samples for training #294

Merged
merged 2 commits into from
Feb 11, 2022

Conversation

epiercehoffman
Copy link
Collaborator

Updates

Add sample_ids_training_subset input to TrainGCNV to allow Terra users to specify a subsetted list of sample IDs on which to train the gCNV model, while keeping the same batching they will use for GatherBatchEvidence. Intended as an alternative to the random downsampling option if the user wants a greater level of control over the samples chosen.

Testing

  • Validated all WDLs and JSONs
  • Tested TrainGCNV on test_small test set and specified four sample IDs for training, out of order. Verified that the four were sorted according to the order of the samples input (and other sample-level inputs) and the rest of the tasks were performed on the subset of samples.

@@ -111,7 +125,8 @@ workflow TrainGCNV {
}
}

Array[Int] sample_indices = select_first([RandomSubsampleStringArray.subsample_indices_array, range(length(samples))])
Array[Int] sample_indices = select_first([GetSubsampledIndices.subsample_indices_array, RandomSubsampleStringArray.subsample_indices_array, range(length(samples))])
Array[String] sample_ids = select_first([GetSubsampledIndices.subsampled_strings_array, RandomSubsampleStringArray.subsampled_strings_array, samples])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler/clearer to move this into the scatter below:

scatter (i in sample_indices) {
    String sample_ids = samples[i]
    call cov.CondenseReadCounts as CondenseReadCounts {
      input:
        counts = count_files[i],
        sample = samples[i],
        num_bins = condense_num_bins,
        expected_bin_size = condense_bin_size,
        condense_counts_docker = condense_counts_docker,
        runtime_attr_override=condense_counts_runtime_attr
    }
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minor style comment, I might rename this to something making it clear it isn't an input. I usually add an extra underscore to the end, or you could call it maybe_subsetted_sample_ids for example.

wdl/Utils.wdl Outdated

RuntimeAttr default_attr = object {
cpu_cores: 1,
mem_gb: 3.75,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mem_gb: 3.75,
mem_gb: 1,

wdl/Utils.wdl Outdated
Comment on lines 247 to 248
all_strings = ['~{sep="','" all_strings}']
subset_strings = {'~{sep="','" subset_strings}'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more robust/scalable to use write_lines() to write the strings to a file and read them in, rather than defining them inline (who knows what would happen if we fed this 10k samples for example).

@epiercehoffman
Copy link
Collaborator Author

epiercehoffman commented Feb 10, 2022

Updates since review:

  • Reduce memory in both RandomSubsampleStringArray and GetSubsampledIndices
  • Use write_lines() and then read string array from file in both RandomSubsampleStringArray and GetSubsampledIndices
  • Build sample_ids_ (renamed) array in scatter

Testing since review:

  • Re-validate all WDLs and JSONs
  • Submit TrainGCNV with default inputs, n_samples_subsample, and sample_ids_training_subset and verify expected behavior and correct arrays being passed around through the start of CNVGermlineCohortWorkflow. All three workflows succeeded.

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@epiercehoffman epiercehoffman merged commit a770cc7 into master Feb 11, 2022
@epiercehoffman epiercehoffman deleted the eph_traingcnv_sample_list branch April 22, 2022 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants