Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor code for index directory #850

Merged
merged 4 commits into from
Mar 1, 2024
Merged

Refactor code for index directory #850

merged 4 commits into from
Mar 1, 2024

Conversation

zhuchcn
Copy link
Member

@zhuchcn zhuchcn commented Feb 29, 2024

Description

I don't like the way we handle the index directory and files currently. Right now we store versions of moPepGen, python and biopython in genome.pkl, annotation_gene.idx, annotation_tx.idx, and proteome.pkl. So here I created a metadata.json file to store all essential parameters, including versions, cleavage parameters (enzyme, miscleavages, etc), and genome/annotation source (GENCODE or ENSEMBL).

I also created an IndexDir class to handle the index directory responsible for loading and dumping different index files, making the code more organized.

Closes #818

Checklist

  • This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
  • This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
  • I have read the code review guidelines and the code review best practice on GitHub check-list.
  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
  • I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
  • All test cases passed locally.

@zhuchcn zhuchcn requested a review from lydiayliu February 29, 2024 17:16
@zhuchcn zhuchcn marked this pull request as ready for review February 29, 2024 17:19
Copy link
Collaborator

@lydiayliu lydiayliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always wondered, what happens if you use an index where the cleavage parameters are different from that in callVariant? Where are the cleavage paramters used in generateIndex (I assume just the canonical proteome fasta)? Will it be reprocessed on the fly if things do differ or would the results be erroneous? (For example when indexing I did peptide length 7-25, no miscleavages, but for callVariant I use peptide 4-35, 2 miscleavages

Also this is a pretty large change, we should run some real data for sanity check

moPepGen/index.py Show resolved Hide resolved
@zhuchcn
Copy link
Member Author

zhuchcn commented Mar 1, 2024

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

That's a good idea to run a dataset for sanity check. I have several PRs on the way so I'll just run the small collaborator's data (1 cell line) after all these PRs and before we release 1.3

@zhuchcn zhuchcn merged commit c4ee3a9 into main Mar 1, 2024
2 checks passed
@zhuchcn zhuchcn deleted the czhu-fix-index branch March 1, 2024 01:05
@lydiayliu
Copy link
Collaborator

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

So with the current approach, the results would be wrong if there is a mismatch of cleavage parameters between generateIndex and callVariant? If so then I think updating that would be a good idea, we can even allow change in enzyme?

@zhuchcn
Copy link
Member Author

zhuchcn commented Mar 1, 2024

With the current approach, callVariant will actually fail if the parameters don't match with that from metadata.json. Yeah we can change enzyme, too. I'll open an issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Complex validation for index files
2 participants