Add TrainGCNV input specifying subset list of samples for training #294

epiercehoffman · 2022-02-10T17:01:38Z

Updates

Add sample_ids_training_subset input to TrainGCNV to allow Terra users to specify a subsetted list of sample IDs on which to train the gCNV model, while keeping the same batching they will use for GatherBatchEvidence. Intended as an alternative to the random downsampling option if the user wants a greater level of control over the samples chosen.

Testing

Validated all WDLs and JSONs
Tested TrainGCNV on test_small test set and specified four sample IDs for training, out of order. Verified that the four were sorted according to the order of the samples input (and other sample-level inputs) and the rest of the tasks were performed on the subset of samples.

mwalker174 · 2022-02-10T17:23:36Z

wdl/TrainGCNV.wdl

@@ -111,7 +125,8 @@ workflow TrainGCNV {
    }
  }

-  Array[Int] sample_indices = select_first([RandomSubsampleStringArray.subsample_indices_array, range(length(samples))])
+  Array[Int] sample_indices = select_first([GetSubsampledIndices.subsample_indices_array, RandomSubsampleStringArray.subsample_indices_array, range(length(samples))])
+  Array[String] sample_ids = select_first([GetSubsampledIndices.subsampled_strings_array, RandomSubsampleStringArray.subsampled_strings_array, samples])


I think it would be simpler/clearer to move this into the scatter below:

scatter (i in sample_indices) { String sample_ids = samples[i] call cov.CondenseReadCounts as CondenseReadCounts { input: counts = count_files[i], sample = samples[i], num_bins = condense_num_bins, expected_bin_size = condense_bin_size, condense_counts_docker = condense_counts_docker, runtime_attr_override=condense_counts_runtime_attr } }

Also minor style comment, I might rename this to something making it clear it isn't an input. I usually add an extra underscore to the end, or you could call it maybe_subsetted_sample_ids for example.

mwalker174 · 2022-02-10T17:26:15Z

wdl/Utils.wdl

+
+  RuntimeAttr default_attr = object {
+    cpu_cores: 1,
+    mem_gb: 3.75,


Suggested change

mem_gb: 3.75,

mem_gb: 1,

mwalker174 · 2022-02-10T17:27:04Z

wdl/Utils.wdl

+    all_strings = ['~{sep="','" all_strings}']
+    subset_strings = {'~{sep="','" subset_strings}'}


It would be more robust/scalable to use write_lines() to write the strings to a file and read them in, rather than defining them inline (who knows what would happen if we fed this 10k samples for example).

epiercehoffman · 2022-02-10T20:39:22Z

Updates since review:

Reduce memory in both RandomSubsampleStringArray and GetSubsampledIndices
Use write_lines() and then read string array from file in both RandomSubsampleStringArray and GetSubsampledIndices
Build sample_ids_ (renamed) array in scatter

Testing since review:

Re-validate all WDLs and JSONs
Submit TrainGCNV with default inputs, n_samples_subsample, and sample_ids_training_subset and verify expected behavior and correct arrays being passed around through the start of CNVGermlineCohortWorkflow. All three workflows succeeded.

mwalker174

Looks good. Thanks!

add TrainGCNV input specifying subset list of samples for training

1645053

mwalker174 requested changes Feb 10, 2022

View reviewed changes

address review comments

ff64f42

mwalker174 approved these changes Feb 11, 2022

View reviewed changes

epiercehoffman merged commit a770cc7 into master Feb 11, 2022

epiercehoffman deleted the eph_traingcnv_sample_list branch April 22, 2022 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TrainGCNV input specifying subset list of samples for training #294

Add TrainGCNV input specifying subset list of samples for training #294

epiercehoffman commented Feb 10, 2022

mwalker174 Feb 10, 2022

mwalker174 Feb 10, 2022

mwalker174 Feb 10, 2022

mwalker174 Feb 10, 2022

epiercehoffman commented Feb 10, 2022 •

edited

Loading

mwalker174 left a comment

		all_strings = ['~{sep="','" all_strings}']
		subset_strings = {'~{sep="','" subset_strings}'}

Add TrainGCNV input specifying subset list of samples for training #294

Add TrainGCNV input specifying subset list of samples for training #294

Conversation

epiercehoffman commented Feb 10, 2022

Updates

Testing

mwalker174 Feb 10, 2022

Choose a reason for hiding this comment

mwalker174 Feb 10, 2022

Choose a reason for hiding this comment

mwalker174 Feb 10, 2022

Choose a reason for hiding this comment

mwalker174 Feb 10, 2022

Choose a reason for hiding this comment

epiercehoffman commented Feb 10, 2022 • edited Loading

mwalker174 left a comment

Choose a reason for hiding this comment

epiercehoffman commented Feb 10, 2022 •

edited

Loading