Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file naming in GenotypeBatch.SplitVariants #712

Merged
merged 6 commits into from
Aug 19, 2024

Conversation

epiercehoffman
Copy link
Collaborator

Updates

The SplitVariants task in GenotypeBatch was not successfully incrementing file name suffixes; after reaching aaaaaz it looped back to aaaaaa. This resulted in variants being dropped from sufficiently large VCFs in the output.

This PR updates SplitVariants to use numeric file naming for simplicity, with a default of 9 characters in the suffix for a maximum of 1,000,000,000 files (to ensure that there can be at least as many files as alphabetical naming with 6 characters in the suffix, which allows for 308,915,776 files).

Testing

  • Tested manually on data that was large enough to trigger the issue with the previous SplitVariants version and verified that 52 files were produced instead of the previous and erroneous 26 files.
  • Tested manually on the same data with -d 1 and verified that the error message was produced after 10 files were written.
  • Validated all WDLs and JSONs with womtool and Terra validation script
  • Tested with an updated docker image to verify a successful run on the reference panel. The depth VCF was identical to the one produced by v0.28.5-beta (this test set is too small to observe the issue of overwriting files), although there were small differences in the PESR VCF, likely due to the updates to the way MEIs are treated (the number of records was identical)

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one suggestion on checking for overflow

next_char = alphabet[(index + 1) % 26]
return next_char + suffix[1:]
def get_file_name(prefix, suffix, digits):
if suffix >= 10 ** digits:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you're checking this but the unbounded exponent scares me a little. A safer check would be to get the actual suffix string with zfill, check if it's all 0's and then confirm that prefix == 0. If prefix > 0 we've overflowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as discussed offline, and repeated local tests (with -d 9 and -d 1)

@epiercehoffman epiercehoffman merged commit 84a0627 into main Aug 19, 2024
8 checks passed
@epiercehoffman epiercehoffman deleted the eph_splitvariants_drop_recs branch August 19, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants