-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor code for index directory #850
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always wondered, what happens if you use an index where the cleavage parameters are different from that in callVariant
? Where are the cleavage paramters used in generateIndex (I assume just the canonical proteome fasta)? Will it be reprocessed on the fly if things do differ or would the results be erroneous? (For example when indexing I did peptide length 7-25, no miscleavages, but for callVariant
I use peptide 4-35, 2 miscleavages
Also this is a pretty large change, we should run some real data for sanity check
The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same. That's a good idea to run a dataset for sanity check. I have several PRs on the way so I'll just run the small collaborator's data (1 cell line) after all these PRs and before we release 1.3 |
So with the current approach, the results would be wrong if there is a mismatch of cleavage parameters between |
With the current approach, callVariant will actually fail if the parameters don't match with that from |
Description
I don't like the way we handle the index directory and files currently. Right now we store versions of moPepGen, python and biopython in genome.pkl, annotation_gene.idx, annotation_tx.idx, and proteome.pkl. So here I created a metadata.json file to store all essential parameters, including versions, cleavage parameters (enzyme, miscleavages, etc), and genome/annotation source (GENCODE or ENSEMBL).
I also created an
IndexDir
class to handle the index directory responsible for loading and dumping different index files, making the code more organized.Closes #818
Checklist
.png
, .jpeg
),.pdf
,.RData
,.xlsx
,.doc
,.ppt
, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.CHANGELOG.md
under the next release version or unreleased, and updated the date.