Ploidy for Foxtrot VDS [VS-1418] #9082

mcovarr · 2025-01-21T14:06:21Z

$ python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.init()
/opt/conda/lib/python3.10/site-packages/hailtop/aiocloud/aiogoogle/user_config.py:43: UserWarning: Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.
  warnings.warn(
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://saturn-bfc0786f-af2f-4a56-8ecc-d5b615682edc-m.us-central1-c.c.terra-18848130.internal:46199
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.130.post1-c69cd67afb8b
LOGGING: writing to /home/jupyter/hail-20250127-1852-0.2.130.post1-c69cd67afb8b.log
>>> diploid_vds_path = "gs://.../avro/gvs_export_diploid.vds"
>>> haploid_vds_path = "gs://.../avro/gvs_export_haploid.vds"
>>> diploid_vds = hl.vds.read_vds(diploid_vds_path)
>>> haploid_vds = hl.vds.read_vds(haploid_vds_path)
>>> diploid_rd = diploid_vds.reference_data
>>> haploid_rd = haploid_vds.reference_data
>>> diploid_rd = diploid_rd.filter_rows(diploid_rd.locus.contig == 'chrY', keep=True)
>>> diploid_rd = diploid_rd.filter_rows(diploid_rd.locus.in_autosome_or_par(), keep=False)
>>> haploid_rd = haploid_rd.filter_rows(haploid_rd.locus.contig == 'chrY', keep=True)
>>> haploid_rd = haploid_rd.filter_rows(haploid_rd.locus.in_autosome_or_par(), keep=False)
>>> haploid_rd = haploid_rd.filter_cols(haploid_rd.s == 'ERS4367797')
>>> diploid_rd = diploid_rd.filter_cols(diploid_rd.s == 'ERS4367797')
>>> diploid_rd.show(3)
+---------------+-----------------+------------------+-----------------+                                         (0 + 1) / 1]
| locus         | 'ERS4367797'.GQ | 'ERS4367797'.END | 'ERS4367797'.GT |
+---------------+-----------------+------------------+-----------------+
| locus<GRCh38> |           int32 |            int32 | call            |
+---------------+-----------------+------------------+-----------------+
| chrY:2781480  |              20 |          2781489 | 0/0             |
| chrY:2781490  |              30 |          2781501 | 0/0             |
| chrY:2781503  |              30 |          2781511 | 0/0             |
+---------------+-----------------+------------------+-----------------+
showing top 3 rows

>>> haploid_rd.show(3)
+---------------+-----------------+------------------+-----------------+                                         (0 + 1) / 1]
| locus         | 'ERS4367797'.GQ | 'ERS4367797'.END | 'ERS4367797'.GT |
+---------------+-----------------+------------------+-----------------+
| locus<GRCh38> |           int32 |            int32 | call            |
+---------------+-----------------+------------------+-----------------+
| chrY:2781480  |              20 |          2781489 | 0               |
| chrY:2781490  |              30 |          2781501 | 0               |
| chrY:2781503  |              30 |          2781511 | 0               |
+---------------+-----------------+------------------+-----------------+
showing top 3 rows

>>>

mcovarr · 2025-01-27T19:19:18Z

scripts/variantstore/scripts/import_gvs_ploidy.py

+    # hg38 = hl.get_reference("GRCh38")
+    # xy_contigs = set(hg38.x_contigs + hg38.y_contigs)
+    # ploidy_table = {
+    #     contig: ploidy_table[key]
+    #     for contig, key in zip(hg38.contigs, sorted(ploidy_table))
+    #     if contig in xy_contigs
+    # }


These are the lines from the original PR that were giving me trouble, in particular the zip when I was supplying Avro data with more than just the X and Y contigs.

Do we understand why Chris did this? Should we check in with him about it's removal?

+1 to George's comment

gbggrant

Looks good to me. Just wondering if we should check with Chris V about that removed code snippet.

gbggrant · 2025-01-27T20:20:12Z

scripts/variantstore/scripts/import_gvs_ploidy.py

+    # hg38 = hl.get_reference("GRCh38")
+    # xy_contigs = set(hg38.x_contigs + hg38.y_contigs)
+    # ploidy_table = {
+    #     contig: ploidy_table[key]
+    #     for contig, key in zip(hg38.contigs, sorted(ploidy_table))
+    #     if contig in xy_contigs
+    # }


Do we understand why Chris did this? Should we check in with him about it's removal?

gbggrant · 2025-01-27T20:21:24Z

scripts/variantstore/scripts/run_in_hail_cluster.py

@@ -64,7 +65,9 @@ def run_in_cluster(cluster_name, account, worker_machine_type, master_machine_ty
        )

        # prepare custom arguments
-        secondary_script_path_arg = f'--py-files {" ".join(secondary_script_path_list)}' if secondary_script_path_list else ''
+        # the following says `--py-files` is supposed to be a comma separated list


How did this work before? Did we never pass multiple py-files?

oooooh it probably didn't work before--we probably only ever gave it a single secondary script at a time

yes exactly, I was the lucky first person to supply more than one

gbggrant · 2025-01-27T20:24:19Z

scripts/variantstore/wdl/test/GvsTieoutVcfMaxAltAlleles.wdl

@@ -1,6 +1,6 @@
 version 1.0

-import "GvsUtils.wdl" as Utils
+import "../GvsUtils.wdl" as Utils


RoriCremer · 2025-01-27T20:52:05Z

scripts/variantstore/wdl/GvsExtractAvroFilesForHail.wdl

+            FROM \`~{project_id}.~{dataset_name}.~{ploidy_table_name}\` p
+            JOIN \`~{project_id}.~{dataset_name}.sample_info\` s ON p.sample_id = s.sample_id
+            WHERE (p.chromosome / 1000000000000 = 23 or p.chromosome / 1000000000000 = 24)
+        " --call_set_identifier ~{call_set_identifier} --dataset_name ~{dataset_name} --table_name ~{ploidy_table_name} --project_id=~{project_id}


didn't we make a change where we only got the avro files for one BQ partition/one vet_x table/group of 4k samples at a time? We did that to make Hail faster when we passed in the vet and ref data. Do we not need to do this with ploidy data because it is added to the VDS at the end?

That and ploidy data is tiny: two rows per sample.

mcovarr added 30 commits January 21, 2025 14:10

wip

722edb3

avros

00a914a

simplify

35e923f

more

8ded04f

docker

5f83cab

oops

dfd983e

docker

3c19062

dockstore

68d0e46

oops

d24f823

gahh

3a9cf8a

gah

2a3879d

docker

13699f3

doh

bab3c4c

docker

6c66380

oops

c33b181

fixees

7c02f5e

docker

68dc165

fix attempt

65acdd8

docker

ea988e4

maybe

1fee364

docker

9a686c9

dockstore

655f574

snappy codec grrr

059b5e7

snappy

0f9e6b2

gahh

931fd55

more

141776e

attempt to remove my earlier snappy flailings

2d26103

more

8a19214

more

e0bd822

gah

4b71c26

mcovarr added 23 commits January 21, 2025 16:29

docker

97c2874

gah

207c99c

docker

84e1793

omg

427f032

docker

3d4c8f0

debug

d236c20

docker

19b86cd

updates

57d5b10

docker

078edeb

try something

2eb184d

docker

c8dac51

delete a bunch of stuff that looked wrong

bfacc6f

docker

661cff1

oops

1361b02

docker

307e2bb

wip

637481b

avro code cleanup

ad4dc8f

dockstore for avro extract

e82ff67

cleanup

9907061

more cleanup

d5be505

docker

50c15e6

update name

eab2668

docker

092c736

mcovarr marked this pull request as ready for review January 27, 2025 19:17

mcovarr commented Jan 27, 2025

View reviewed changes

gbggrant approved these changes Jan 27, 2025

View reviewed changes

RoriCremer reviewed Jan 27, 2025

View reviewed changes

RoriCremer approved these changes Jan 27, 2025

View reviewed changes

mcovarr merged commit c8feb1b into ah_var_store Jan 28, 2025
20 of 21 checks passed

mcovarr deleted the vs_1418_ploidy_for_foxtrot_vds branch January 28, 2025 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ploidy for Foxtrot VDS [VS-1418] #9082

Ploidy for Foxtrot VDS [VS-1418] #9082

mcovarr commented Jan 21, 2025 •

edited

Loading

mcovarr Jan 27, 2025

gbggrant Jan 27, 2025

RoriCremer Jan 27, 2025

gbggrant left a comment

gbggrant Jan 27, 2025

gbggrant Jan 27, 2025

RoriCremer Jan 27, 2025 •

edited

Loading

mcovarr Jan 27, 2025

gbggrant Jan 27, 2025

RoriCremer Jan 27, 2025

mcovarr Jan 27, 2025

Ploidy for Foxtrot VDS [VS-1418] #9082

Ploidy for Foxtrot VDS [VS-1418] #9082

Conversation

mcovarr commented Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbggrant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RoriCremer Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcovarr commented Jan 21, 2025 •

edited

Loading

RoriCremer Jan 27, 2025 •

edited

Loading