Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ploidy for Foxtrot VDS [VS-1418] #9082

Merged
merged 57 commits into from
Jan 28, 2025
Merged

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Jan 21, 2025

$ python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.init()
/opt/conda/lib/python3.10/site-packages/hailtop/aiocloud/aiogoogle/user_config.py:43: UserWarning: Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.
  warnings.warn(
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://saturn-bfc0786f-af2f-4a56-8ecc-d5b615682edc-m.us-central1-c.c.terra-18848130.internal:46199
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.130.post1-c69cd67afb8b
LOGGING: writing to /home/jupyter/hail-20250127-1852-0.2.130.post1-c69cd67afb8b.log
>>> diploid_vds_path = "gs://.../avro/gvs_export_diploid.vds"
>>> haploid_vds_path = "gs://.../avro/gvs_export_haploid.vds"
>>> diploid_vds = hl.vds.read_vds(diploid_vds_path)
>>> haploid_vds = hl.vds.read_vds(haploid_vds_path)
>>> diploid_rd = diploid_vds.reference_data
>>> haploid_rd = haploid_vds.reference_data
>>> diploid_rd = diploid_rd.filter_rows(diploid_rd.locus.contig == 'chrY', keep=True)
>>> diploid_rd = diploid_rd.filter_rows(diploid_rd.locus.in_autosome_or_par(), keep=False)
>>> haploid_rd = haploid_rd.filter_rows(haploid_rd.locus.contig == 'chrY', keep=True)
>>> haploid_rd = haploid_rd.filter_rows(haploid_rd.locus.in_autosome_or_par(), keep=False)
>>> haploid_rd = haploid_rd.filter_cols(haploid_rd.s == 'ERS4367797')
>>> diploid_rd = diploid_rd.filter_cols(diploid_rd.s == 'ERS4367797')
>>> diploid_rd.show(3)
+---------------+-----------------+------------------+-----------------+                                         (0 + 1) / 1]
| locus         | 'ERS4367797'.GQ | 'ERS4367797'.END | 'ERS4367797'.GT |
+---------------+-----------------+------------------+-----------------+
| locus<GRCh38> |           int32 |            int32 | call            |
+---------------+-----------------+------------------+-----------------+
| chrY:2781480  |              20 |          2781489 | 0/0             |
| chrY:2781490  |              30 |          2781501 | 0/0             |
| chrY:2781503  |              30 |          2781511 | 0/0             |
+---------------+-----------------+------------------+-----------------+
showing top 3 rows

>>> haploid_rd.show(3)
+---------------+-----------------+------------------+-----------------+                                         (0 + 1) / 1]
| locus         | 'ERS4367797'.GQ | 'ERS4367797'.END | 'ERS4367797'.GT |
+---------------+-----------------+------------------+-----------------+
| locus<GRCh38> |           int32 |            int32 | call            |
+---------------+-----------------+------------------+-----------------+
| chrY:2781480  |              20 |          2781489 | 0               |
| chrY:2781490  |              30 |          2781501 | 0               |
| chrY:2781503  |              30 |          2781511 | 0               |
+---------------+-----------------+------------------+-----------------+
showing top 3 rows

>>>

@mcovarr mcovarr marked this pull request as ready for review January 27, 2025 19:17
Comment on lines +49 to +55
# hg38 = hl.get_reference("GRCh38")
# xy_contigs = set(hg38.x_contigs + hg38.y_contigs)
# ploidy_table = {
# contig: ploidy_table[key]
# for contig, key in zip(hg38.contigs, sorted(ploidy_table))
# if contig in xy_contigs
# }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the lines from the original PR that were giving me trouble, in particular the zip when I was supplying Avro data with more than just the X and Y contigs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand why Chris did this? Should we check in with him about it's removal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to George's comment

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just wondering if we should check with Chris V about that removed code snippet.

Comment on lines +49 to +55
# hg38 = hl.get_reference("GRCh38")
# xy_contigs = set(hg38.x_contigs + hg38.y_contigs)
# ploidy_table = {
# contig: ploidy_table[key]
# for contig, key in zip(hg38.contigs, sorted(ploidy_table))
# if contig in xy_contigs
# }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand why Chris did this? Should we check in with him about it's removal?

@@ -64,7 +65,9 @@ def run_in_cluster(cluster_name, account, worker_machine_type, master_machine_ty
)

# prepare custom arguments
secondary_script_path_arg = f'--py-files {" ".join(secondary_script_path_list)}' if secondary_script_path_list else ''
# the following says `--py-files` is supposed to be a comma separated list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this work before? Did we never pass multiple py-files?

Copy link
Contributor

@RoriCremer RoriCremer Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooooh it probably didn't work before--we probably only ever gave it a single secondary script at a time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, I was the lucky first person to supply more than one

@@ -1,6 +1,6 @@
version 1.0

import "GvsUtils.wdl" as Utils
import "../GvsUtils.wdl" as Utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

FROM \`~{project_id}.~{dataset_name}.~{ploidy_table_name}\` p
JOIN \`~{project_id}.~{dataset_name}.sample_info\` s ON p.sample_id = s.sample_id
WHERE (p.chromosome / 1000000000000 = 23 or p.chromosome / 1000000000000 = 24)
" --call_set_identifier ~{call_set_identifier} --dataset_name ~{dataset_name} --table_name ~{ploidy_table_name} --project_id=~{project_id}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we make a change where we only got the avro files for one BQ partition/one vet_x table/group of 4k samples at a time? We did that to make Hail faster when we passed in the vet and ref data. Do we not need to do this with ploidy data because it is added to the VDS at the end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That and ploidy data is tiny: two rows per sample.

@mcovarr mcovarr merged commit c8feb1b into ah_var_store Jan 28, 2025
20 of 21 checks passed
@mcovarr mcovarr deleted the vs_1418_ploidy_for_foxtrot_vds branch January 28, 2025 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants