Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate DRAGEN-SV into pipeline #749

Draft
wants to merge 68 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
e823d20
Initial commit to test dockstore sync
kjaisingh Nov 14, 2024
6127ef1
Initial work - WIP
kjaisingh Nov 14, 2024
c3df02c
Merge branch 'main' into kj_dragensv_benchmarking
kjaisingh Nov 19, 2024
4f24d0f
Initial implementation of DragenStandardizer
kjaisingh Nov 20, 2024
9a7953d
Added automated sync
kjaisingh Nov 20, 2024
c05f593
Circumvented linting errors
kjaisingh Nov 20, 2024
5792480
Initialized new std_dragen file
kjaisingh Nov 21, 2024
06abb34
Updated WDL & standardizer to output std_dragen_vcf
kjaisingh Nov 22, 2024
40b7fb5
Resolved linting errors
kjaisingh Nov 22, 2024
9c457d1
Modified WDLs across workflows to integrate dragen
kjaisingh Nov 25, 2024
cce70ae
Updated WDL input params
kjaisingh Nov 25, 2024
9897ed0
Modified dragen_std to print
kjaisingh Nov 26, 2024
2fb4b7d
Modified standardizer to align with manta
kjaisingh Nov 26, 2024
37a3bc7
Python linting errors
kjaisingh Nov 26, 2024
dd0d1a5
Added MATEID indexing to drop paired mates
kjaisingh Nov 27, 2024
3d3133f
Added indexing for vcf's without it
kjaisingh Nov 27, 2024
6b4c425
Modified vapor wdl to remove unnecessary ref inputs
kjaisingh Dec 2, 2024
284f18f
Initial commit for PreprocessDragenVcf
kjaisingh Dec 3, 2024
53d0c67
Removed irrelevant inputs from vapor WDLs
kjaisingh Dec 3, 2024
311048a
Added OAUTH_TOKEN to localize files
kjaisingh Dec 3, 2024
4fa2ba2
Initial commit for CombineVcfs
kjaisingh Dec 3, 2024
d7f6ada
Minor differences
kjaisingh Dec 3, 2024
8134dc6
Modified passing of arguments to SVCluster
kjaisingh Dec 3, 2024
16c6641
Further formatting & naming changes
kjaisingh Dec 3, 2024
0327cc8
Removed /src/ from script path
kjaisingh Dec 3, 2024
e595de0
Modified to take in fai as well
kjaisingh Dec 3, 2024
5876c27
Added ref_dict to SVCluster call
kjaisingh Dec 3, 2024
9135045
Added index files
kjaisingh Dec 3, 2024
0d2c9fb
Updated combinevcf WDL syntactically
kjaisingh Dec 4, 2024
1ff0ad2
Added index file to output of combinevcfs
kjaisingh Dec 4, 2024
f2f1da4
Updated bgzip operation to explicitly use bgzip
kjaisingh Dec 4, 2024
9d6cbd4
Reverted previous commit - found root of problem
kjaisingh Dec 4, 2024
30e1cb6
Removed tabix from svconcordance
kjaisingh Dec 4, 2024
4a27320
Reverted changes to vapor inputs
kjaisingh Dec 4, 2024
06d5f1c
Removed tabix step from CombineVcfs
kjaisingh Dec 6, 2024
d6a13a2
Modified pesr vcfs to dynamically use all defined but depth
kjaisingh Dec 6, 2024
5c778d4
Updated batchsamples workflow to choose pesr vcfs dynamically
kjaisingh Dec 6, 2024
c6cdf40
Reverted to old version - doesn't support list comprehension
kjaisingh Dec 6, 2024
ed60fc0
Merge branch 'main' into kj_dragensv_benchmarking
kjaisingh Jan 6, 2025
aba4ba4
Minor formatting update
kjaisingh Jan 6, 2025
a61a342
Added inversion detection
kjaisingh Jan 7, 2025
9079a3b
Linting errors
kjaisingh Jan 7, 2025
395d519
Updated standardizer to mark inversions
kjaisingh Jan 9, 2025
416d459
Added INV to dragen metrics
kjaisingh Jan 13, 2025
ba12de1
Updated dragen standardizer
kjaisingh Jan 13, 2025
266734e
Temp updates - WIP
kjaisingh Jan 15, 2025
f4b502f
Final dragen standardizer cleanup
kjaisingh Jan 21, 2025
ab505fd
Fixed linting issues
kjaisingh Jan 21, 2025
3fb7a1b
Minor update to SVCluster to use variant prefix
kjaisingh Jan 21, 2025
3b77ee1
Removed sorting from preprocessing wdl
kjaisingh Jan 24, 2025
55bb2b6
Init commit of preprocess vcf for makegq wdl
kjaisingh Jan 27, 2025
6fb1393
Updated output fields in preprocessformakegq
kjaisingh Jan 27, 2025
06e3a6c
Minor update to file path
kjaisingh Jan 27, 2025
52ef8a1
Made localization an optional input
kjaisingh Jan 28, 2025
ea59707
Registered branch to dockstore
kjaisingh Jan 28, 2025
902b4e9
Additional commit for dockstore.yml
kjaisingh Jan 28, 2025
5ee1d60
Resolved syntax issues
kjaisingh Jan 28, 2025
7d4266b
Added branch to dockstore
kjaisingh Jan 28, 2025
fc933ac
Updated the preprocessformakegq wdl
kjaisingh Jan 28, 2025
1ec076b
Updated vapor wdl to localize crams and ignore readfilters
kjaisingh Jan 28, 2025
16e89fb
Updated standardizer
kjaisingh Jan 28, 2025
727634a
Removed .bam from local_bai file
kjaisingh Jan 28, 2025
f5ddfdd
Removed variant prefix from standardization WDL
kjaisingh Jan 31, 2025
9c598ad
Resolved merge conflicts
kjaisingh Feb 4, 2025
bdb3bcc
Removed redundant WDLs
kjaisingh Feb 4, 2025
5a48f28
Readding preprocess vcf for vapor
kjaisingh Feb 4, 2025
1959295
Added project ID for vapor WDL
kjaisingh Feb 4, 2025
d004742
Modified other files that reference manta to include dragen
kjaisingh Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/.dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand All @@ -33,6 +34,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand All @@ -42,6 +44,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand All @@ -51,6 +54,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand All @@ -60,6 +64,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand All @@ -78,6 +83,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand Down Expand Up @@ -159,6 +165,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand Down Expand Up @@ -204,6 +211,7 @@ workflows:
filters:
branches:
- main
- kj_dragensv_benchmarking
tags:
- /.*/

Expand Down
12 changes: 6 additions & 6 deletions scripts/notebooks/SampleQC.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -581,8 +581,8 @@
" if (not mad_cutoff):\n",
" print('[WARNING] Setting MAD_CUTOFF to None results in no lower cutoff being applied.')\n",
"\n",
" if (caller and caller not in ['overall', 'manta', 'melt', 'scramble', 'scramble', 'wham']):\n",
" raise Exception(f'The value {caller} for category is invalid - it must be one of \"overall\", \"manta\", \"melt\", \"scramble\" or \"wham\".')\n",
" if (caller and caller not in ['overall', 'dragen', 'manta', 'melt', 'scramble', 'scramble', 'wham']):\n",
" raise Exception(f'The value {caller} for category is invalid - it must be one of \"overall\", \"dragen\", \"manta\", \"melt\", \"scramble\" or \"wham\".')\n",
"\n",
" if (caller_type and caller_type not in ['high', 'low']):\n",
" raise Exception(f'The value {caller_type} for caller type is invalid - it must be one of \"high\" or \"low\".')\n",
Expand Down Expand Up @@ -2281,11 +2281,11 @@
"metadata": {},
"source": [
"## Raw Caller Outliers\n",
"This series of metrics look for samples with an abnormally high or low number of raw SV calls from the three initial algorithms: Manta, Wham, and Scramble (or MELT). Higher than typical SV counts may indicate technical artifacts, while extremely low SV counts may indicate that an algorithm failed to complete. The values represent the number of times the sample was an outlier for SV counts across categories defined by algorithm, SV type, and chromosome. \n",
"This series of metrics look for samples with an abnormally high or low number of raw SV calls from the initial callers: Dragen, Manta, Wham, and Scramble (or MELT). Higher than typical SV counts may indicate technical artifacts, while extremely low SV counts may indicate that an algorithm failed to complete. The values represent the number of times the sample was an outlier for SV counts across categories defined by algorithm, SV type, and chromosome. \n",
"\n",
"**Note**: \n",
"In the sections below, there are two additional parameters that have not been covered as of yet.\n",
"- `CALLER`: The caller for which to analyze results. This must be one of `['overall', 'manta', 'melt', 'scramble', 'wham']`, where 'overall' corresponds to the sum of outlier occurrences across the individual callers.\n",
"- `CALLER`: The caller for which to analyze results. This must be one of `['overall', 'dragen', 'manta', 'melt', 'scramble', 'wham']`, where 'overall' corresponds to the sum of outlier occurrences across the individual callers.\n",
"- `TYPE`: The type of outliers for which to analyze results. This must be one of `['high', 'low']`, where 'high' indicates an the number of cases in which the sample had more SVs than typical, while 'low' indicates the number of cases in which the sample had fewer SVs than typical. \n",
"\n",
"We recommend checking the overall high and low outliers (i.e. `CALLER = 'overall'` and `TYPE = 'high'/'low'`), but you may also examine results for individual algorithms."
Expand Down Expand Up @@ -2315,7 +2315,7 @@
"LINE_DEVIATIONS = None # List of integers that defines the MAD cutoff lines to draw on each histogram plot\n",
"LINE_STYLES = None # List of strings that defines the line styles of each MAD cutoff line passed above\n",
"\n",
"CALLER = 'overall' # String value that defines the caller - either 'overall', 'manta', 'melt', 'wham' or 'dragen'\n",
"CALLER = 'overall' # String value that defines the caller - either 'overall', 'dragen', 'manta', 'melt', 'wham' or 'dragen'\n",
"TYPE = 'high' # String value that defines the outlier direction - either 'high' or 'low'\n",
"\n",
"validate_qc_inputs(samples_qc_table, f\"{CALLER}_{TYPE}_outlier\", line_deviations=LINE_DEVIATIONS, \n",
Expand Down Expand Up @@ -2354,7 +2354,7 @@
"LOG_SCALE = False # Boolean value that defines whether to log-scale the plot\n",
"METHOD = 'hard' # String value that defines the cutoff method to use - either 'MAD' or 'hard'\n",
"\n",
"CALLER = 'overall' # String value that defines the caller - either 'overall', 'manta', 'melt', 'wham' or 'dragen'\n",
"CALLER = 'overall' # String value that defines the caller - either 'overall', 'dragen', 'manta', 'melt', 'wham' or 'dragen'\n",
"TYPE = 'high' # String value that defines the outlier direction - either 'high' or 'low'\n",
"\n",
"UPPER_CUTOFF = None # Numeric value that defines the upper threshold if METHOD = 'hard'\n",
Expand Down
1 change: 1 addition & 0 deletions src/denovo/denovo_svs.py
Original file line number Diff line number Diff line change
Expand Up @@ -772,6 +772,7 @@ def main():
print("Took %f seconds to process" % delta)

# Filter out INS that are manta or melt only and are SR only, have GQ=0, and FILTER contains 'HIGH_SR_BACKGROUND'
# TODO: Do I also update this to reference Dragen?
verbose_print('Filtering out INS that are manta or melt only and SR only, with GQ=0 and FILTER contains HIGH_SR_BACKGROUND', verbose)
start = time.time()
remove_ins = bed_child[(bed_child['SVTYPE'] == 'INS') & ((bed_child['ALGORITHMS'] == 'manta') | (bed_child['ALGORITHMS'] == 'melt')) & (bed_child['EVIDENCE_FIX'] == 'SR') & ((bed_child['GQ'] == '0') | (bed_child.FILTER.str.contains('HIGH_SR_BACKGROUND')))]['name_famid'].to_list()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ def __init__(self, record):
self.length = record.info['SVLEN']
self.cnv_gt_5kbp = (record.info['SVTYPE'] == 'DEL' or record.info['SVTYPE'] == 'DUP') and self.length >= 5000
self.gt_50bp = self.length >= 50
self.is_dragen = 'dragen' in record.info['ALGORITHMS']
self.is_melt = 'melt' in record.info['ALGORITHMS']
self.is_scramble = 'scramble' in record.info['ALGORITHMS']
self.is_manta = 'manta' in record.info['ALGORITHMS']
Expand Down Expand Up @@ -164,10 +165,10 @@ def __str__(self):
if len(sample_intersection) < 0.50 * max_freq:
continue
# Determine which to filter
# Special case if one is a Manta insertion and the other is MEI, keep the MEI
if first.is_manta and first.svtype == "INS" and second.is_mei:
# Special case if one is a Dragen/Manta insertion and the other is MEI, keep the MEI
if (first.is_dragen or first.is_manta) and first.svtype == "INS" and second.is_mei:
sorted_data_list = [second, first]
elif second.is_manta and second.svtype == "INS" and first.is_mei:
elif (second.is_dragen or second.is_manta) and second.svtype == "INS" and first.is_mei:
sorted_data_list = [first, second]
else:
# Otherwise use sorting spec
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def overlap_pass(phase1, pilot, fout, dist=300, frac=0.1, prefix="SSC_merged"):
sources = get_sources(fout.header)

# Helper for testing if SVRecord has pe/sr support
pesr_sources = set('delly lumpy manta wham'.split())
pesr_sources = set('delly dragen lumpy manta wham'.split())

def _has_pesr(record):
sources = set(record.record.info['SOURCES'])
Expand Down
43 changes: 33 additions & 10 deletions src/sv-pipeline/scripts/make_evidence_qc_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,13 +130,15 @@ def read_outlier(filename: str, outlier_col_label: str) -> pd.DataFrame:
return outlier_df


def read_all_outlier(outlier_manta_df: pd.DataFrame, outlier_melt_df: pd.DataFrame, outlier_wham_df: pd.DataFrame, outlier_scramble_df: pd.DataFrame, outlier_type: str) -> pd.DataFrame:
def read_all_outlier(outlier_manta_df: pd.DataFrame, outlier_melt_df: pd.DataFrame, outlier_wham_df: pd.DataFrame,
outlier_scramble_df: pd.DataFrame, outlier_dragen_df: pd.DataFrame, outlier_type: str) -> pd.DataFrame:
"""
Args:
outlier_manta_df: Outliers determined in EvidenceQC for Manta.
outlier_melt_df: Outliers determined in EvidenceQC for MELT.
outlier_wham_df: Outliers determined in EvidenceQC for Wham.
outlier_scramble_df: Outliers determined in EvidenceQC for Scramble
outlier_scramble_df: Outliers determined in EvidenceQC for Scramble.
outlier_dragen_df: Outliers determined in EvidenceQC for Dragen.
outlier_type: high or low. Determined in EvidenceQC for each of the three callers.
Returns:
The total number of times that a sample appears as an outlier
Expand All @@ -158,8 +160,12 @@ def read_all_outlier(outlier_manta_df: pd.DataFrame, outlier_melt_df: pd.DataFra
col_name = get_col_name("scramble", outlier_type)
dict_scramble = dict(zip(outlier_scramble_df[ID_COL], outlier_scramble_df[col_name]))

# Dragen:
col_name = get_col_name("dragen", outlier_type)
dict_dragen = dict(zip(outlier_dragen_df[ID_COL], outlier_dragen_df[col_name]))

# merging all the dictionaries
outlier_dicts = [dict_manta, dict_melt, dict_wham, dict_scramble]
outlier_dicts = [dict_manta, dict_melt, dict_wham, dict_scramble, dict_dragen]
merged_dicts = Counter()
for counted in outlier_dicts:
merged_dicts.update(counted)
Expand All @@ -182,10 +188,12 @@ def merge_evidence_qc_table(
filename_high_melt: str,
filename_high_wham: str,
filename_high_scramble: str,
filename_high_dragen: str,
filename_low_manta: str,
filename_low_melt: str,
filename_low_wham: str,
filename_low_scramble: str,
filename_low_dragen: str,
filename_melt_insert_size: str,
output_prefix: str) -> None:
"""
Expand All @@ -201,23 +209,28 @@ def merge_evidence_qc_table(
df_melt_high_outlier = read_outlier(filename_high_melt, get_col_name("melt", "high"))
df_wham_high_outlier = read_outlier(filename_high_wham, get_col_name("wham", "high"))
df_scramble_high_outlier = read_outlier(filename_high_scramble, get_col_name("scramble", "high"))
df_total_high_outliers = read_all_outlier(df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier, df_scramble_high_outlier, "high")
df_dragen_high_outlier = read_outlier(filename_high_dragen, get_col_name("dragen", "high"))
df_total_high_outliers = read_all_outlier(df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier,
df_scramble_high_outlier, df_dragen_high_outlier, "high")
df_manta_low_outlier = read_outlier(filename_low_manta, get_col_name("manta", "low"))
df_melt_low_outlier = read_outlier(filename_low_melt, get_col_name("melt", "low"))
df_wham_low_outlier = read_outlier(filename_low_wham, get_col_name("wham", "low"))
df_scramble_low_outlier = read_outlier(filename_low_scramble, get_col_name("scramble", "low"))
df_total_low_outliers = read_all_outlier(df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_scramble_low_outlier, "low")
df_dragen_low_outlier = read_outlier(filename_low_dragen, get_col_name("dragen", "low"))
df_total_low_outliers = read_all_outlier(df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier,
df_scramble_low_outlier, df_dragen_low_outlier, "low")
df_melt_insert_size = read_melt_insert_size(filename_melt_insert_size)

# outlier column names
callers = ["wham", "melt", "manta", "scramble", "overall"]
callers = ["wham", "melt", "manta", "scramble", "dragen", "overall"]
types = ["high", "low"]
outlier_cols = [get_col_name(caller, type) for caller in callers for type in types]

# all data frames
dfs = [df_ploidy, df_sex_assignments, df_bincov_median, df_wgd_scores, df_non_diploid,
df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier, df_scramble_high_outlier, df_total_high_outliers,
df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_scramble_low_outlier, df_total_low_outliers,
df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier, df_scramble_high_outlier,
df_dragen_high_outlier, df_total_high_outliers, df_manta_low_outlier, df_melt_low_outlier,
df_wham_low_outlier, df_scramble_low_outlier, df_dragen_low_outlier, df_total_low_outliers,
df_melt_insert_size]
for df in dfs:
df[ID_COL] = df[ID_COL].astype(object)
Expand Down Expand Up @@ -263,6 +276,14 @@ def main():
"-w", "--wham-qc-outlier-high-filename",
help="Sets the filename containing Wham QC outlier high.")

parser.add_argument(
"-t", "--scramble-qc-outlier-high-filename",
help="Sets the filename containing Scramble QC outlier high.")

parser.add_argument(
"-i", "--dragen-qc-outlier-high-filename",
help="Sets the filename containing Dragen QC outlier high.")

parser.add_argument(
"-a", "--manta-qc-outlier-low-filename",
help="Sets the filename containing Manta QC outlier low.")
Expand All @@ -280,8 +301,8 @@ def main():
help="Sets the filename containing Scramble QC outlier low.")

parser.add_argument(
"-t", "--scramble-qc-outlier-high-filename",
help="Sets the filename containing Scramble QC outlier high.")
"-j", "--dragen-qc-outlier-low-filename",
help="Sets the filename containing Dragen QC outlier low.")

parser.add_argument(
"-m", "--melt-insert-size-filename",
Expand All @@ -307,10 +328,12 @@ def main():
args.melt_qc_outlier_high_filename,
args.wham_qc_outlier_high_filename,
args.scramble_qc_outlier_high_filename,
args.dragen_qc_outlier_high_filename,
args.manta_qc_outlier_low_filename,
args.melt_qc_outlier_low_filename,
args.wham_qc_outlier_low_filename,
args.scramble_qc_outlier_low_filename,
args.dragen_qc_outlier_low_filename,
args.melt_insert_size_filename,
args.output_prefix)

Expand Down
1 change: 1 addition & 0 deletions src/sv-pipeline/scripts/make_scramble_vcf.py
Original file line number Diff line number Diff line change
Expand Up @@ -493,6 +493,7 @@ def main(argv: Optional[List[Text]] = None):
l1_size=arguments.l1_size)
logging.info("Loading MEI bed...")
mei_trees = create_trees_from_bed_records(arguments.mei_bed, padding=arguments.mei_padding)
# TODO: Do I also update this to reference Dragen?
logging.info("Loading Manta deletions...")
with pysam.VariantFile(arguments.manta_vcf) as f_manta:
del_filter_trees = dict()
Expand Down
2 changes: 1 addition & 1 deletion src/svtk/svtk/cli/standardize_vcf.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def main(argv):
parser.add_argument('vcf', help='Raw VCF.')
parser.add_argument('fout', help='Standardized VCF.')
parser.add_argument('source', help='Source algorithm. '
'[delly,lumpy,manta,wham,melt,scramble]')
'[delly,lumpy,manta,wham,melt,scramble,dragen]')
parser.add_argument('-p', '--prefix', help='If provided, variant names '
'will be overwritten with this prefix.')
parser.add_argument('--include-reference-sites', action='store_true',
Expand Down
1 change: 1 addition & 0 deletions src/svtk/svtk/standardize/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@
from .std_manta import MantaStandardizer
from .std_melt import MeltStandardizer
from .std_scramble import ScrambleStandardizer
from .std_dragen import DragenStandardizer
from .std_smoove import SmooveStandardizer
Loading
Loading