Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor suspension type validation logic to be simpler and more performant #1155

Merged
merged 5 commits into from
Dec 16, 2024

Conversation

nayib-jose-gloria
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria commented Dec 13, 2024

Reason for Change

  • Anecdotally, during testing for other tickets, suspension_type was taking unreasonably long for validate (>1 hr for dataset with 4M obs rows x 31 columns).
    • This is likely due to executing 1 query per rule in the schema definition rather than batching together queries that evaluate to the same suspension_type value.
  • The code for defining + evaluating dependency rules is quite complex and can be simplified given our use cases
  • The trade-off for this complexity is slightly more specific error messages; I believe we can achieve similarly helpful error messages with slightly less specificity traded off for more performant, simpler code.

Changes

  • refactor schema_definition.yaml to always use the (formerly known as) "complex_rule" format of defining your pandas queries in YAML rather than directly writing pandas queries as part of the definition.
    - therefore, rename "complex_rule" --> "rule" and rewrite existing "rule" entries to use this format
  • refactor assay_ontology_term_id.dependencies definition to group together assays that are validating for the same suspension_type (i.e. instead of 3 separate entries for "EFO:1", "EFO:2", and "descendants of EFO:3" must evaluate to suspension_type "cell", group them all into one dependency entry).
    - the groups are: suspension_type should be "cell" vs. "nucleus" vs. "na" vs. "cell or nucleus".
  • refactor _validate_column_dependencies to simplify the query builder--build set of terms to check (including descendants, if applicable), and run a single "isin" query per rule in schema_definition.yaml rather than a custom vectorized function. This is more performant for our use cases.
  • deleted _generate_match_ancestors_query_fn, no longer used
  • add logic to build an error_message_suffix for dependencies in cases where we do not define one in the schema_definition.yaml. This will log which column values were matched and did not fit the rule definition (i.e. which assays in the df contain a suspension_type value validation error)

Testing

  • unit tests all pass
  • locally, ran tests, and suspension_type validation is much faster (in 4M row case, it took 3 min rather than 1 hr)

Notes for Reviewer

@nayib-jose-gloria nayib-jose-gloria marked this pull request as ready for review December 13, 2024 16:00
Copy link

codecov bot commented Dec 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.65%. Comparing base (c2f4979) to head (17df03a).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1155      +/-   ##
==========================================
- Coverage   90.68%   90.65%   -0.04%     
==========================================
  Files          18       18              
  Lines        2062     2054       -8     
==========================================
- Hits         1870     1862       -8     
  Misses        192      192              
Components Coverage Δ
cellxgene_schema_cli 92.04% <100.00%> (-0.05%) ⬇️
migration_assistant 91.26% <ø> (ø)
schema_bump_dry_run_genes 79.80% <ø> (ø)
schema_bump_dry_run_ontologies 99.53% <ø> (ø)

@ejmolinelli
Copy link
Contributor

This looks great! Performs as advertised

@nayib-jose-gloria nayib-jose-gloria merged commit ee393cd into main Dec 16, 2024
14 checks passed
@nayib-jose-gloria nayib-jose-gloria deleted the nayib/suspension-type-refactor-2 branch December 16, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants