Refactor code for index directory #850

zhuchcn · 2024-02-29T17:16:46Z

Description

I don't like the way we handle the index directory and files currently. Right now we store versions of moPepGen, python and biopython in genome.pkl, annotation_gene.idx, annotation_tx.idx, and proteome.pkl. So here I created a metadata.json file to store all essential parameters, including versions, cleavage parameters (enzyme, miscleavages, etc), and genome/annotation source (GENCODE or ENSEMBL).

I also created an IndexDir class to handle the index directory responsible for loading and dumping different index files, making the code more organized.

Closes #818

Checklist

This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
All test cases passed locally.

lydiayliu

I've always wondered, what happens if you use an index where the cleavage parameters are different from that in callVariant? Where are the cleavage paramters used in generateIndex (I assume just the canonical proteome fasta)? Will it be reprocessed on the fly if things do differ or would the results be erroneous? (For example when indexing I did peptide length 7-25, no miscleavages, but for callVariant I use peptide 4-35, 2 miscleavages

Also this is a pretty large change, we should run some real data for sanity check

moPepGen/index.py

zhuchcn · 2024-03-01T01:04:17Z

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

That's a good idea to run a dataset for sanity check. I have several PRs on the way so I'll just run the small collaborator's data (1 cell line) after all these PRs and before we release 1.3

lydiayliu · 2024-03-01T01:06:25Z

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

So with the current approach, the results would be wrong if there is a mismatch of cleavage parameters between generateIndex and callVariant? If so then I think updating that would be a good idea, we can even allow change in enzyme?

zhuchcn · 2024-03-01T02:51:46Z

With the current approach, callVariant will actually fail if the parameters don't match with that from metadata.json. Yeah we can change enzyme, too. I'll open an issue for this.

zhuchcn added 2 commits February 29, 2024 09:07

add (IndexDir): refactored code for index directory

10081e7

fix (generateIndex): set metadata.json indent to 2

eb37d4f

zhuchcn requested a review from lydiayliu February 29, 2024 17:16

zhuchcn marked this pull request as ready for review February 29, 2024 17:19

zhuchcn added 2 commits February 29, 2024 09:20

doc: changelog updated

970caa1

fix (moPepGen): remove GTFIndexMetadata

4c425d5

lydiayliu approved these changes Mar 1, 2024

View reviewed changes

moPepGen/index.py Show resolved Hide resolved

zhuchcn merged commit c4ee3a9 into main Mar 1, 2024
2 checks passed

zhuchcn deleted the czhu-fix-index branch March 1, 2024 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor code for index directory #850

Refactor code for index directory #850

zhuchcn commented Feb 29, 2024

lydiayliu left a comment •

edited

Loading

zhuchcn commented Mar 1, 2024

lydiayliu commented Mar 1, 2024

zhuchcn commented Mar 1, 2024

Refactor code for index directory #850

Refactor code for index directory #850

Conversation

zhuchcn commented Feb 29, 2024

Description

Checklist

lydiayliu left a comment • edited Loading

Choose a reason for hiding this comment

zhuchcn commented Mar 1, 2024

lydiayliu commented Mar 1, 2024

zhuchcn commented Mar 1, 2024

lydiayliu left a comment •

edited

Loading