-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change data structure used when calling misc peptides to improve performance #736
Conversation
…dict instead of set to avoid hashing the entire object again and again
This is not fixed yet.. |
Will hold for updates! |
…or too short to significantly improve efficiency. #736
For this issue, it took instant time to create the cleavage graph, but forever when calling peptides with misc from it. I did the following two things to make it fast: Skip peptides earlier if too long or too shortWhen calling peptide4s with misc, we traverse the graph, and for each node, we first put all three lists of nodes into "stage" area, with each list being the nodes with 0..k miscleavages. For each node list, we then join then and check whether the sequence fit the criteria (length, mw, # variants). So here if the sequence joined from a node list is too long or too short, I'll avoid them being staged to reduce the number of nodes being processed. This improved the performance significantly. Call W2F later and sequence level rather than node levelW2F was used to be called at the node level. So when a list of misc nodes are "staged", the version of the sequences with W2F will also be generated. So if there are equivalent sequences in the graph, the same operation gets repeated. So here I moved this operation later to the variant peptide pool. So after all peptides are called from the graph, each peptides from the pool is iterated thru and generate the copy with W2F reassignments. I'll run some fuzz test now to see if anything goes crazy. |
Why is it Should I run this on the 2 samples and see if they go through? |
Not k-1, just k.
Yup, go ahead and run the 2 samples. Are the the last 2 samples left? |
The other sample CPCG0397 is stuck because of a different issue. I'll open a new issue for that and I think we can merge this PR. |
upstream_cleave_alts = [v.variant for v in self.variants | ||
if v.location.end == len(self.seq.seq)] | ||
if v.location.end == seq_len] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I don't thnk this changed anything? XD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beleive it or not, this actually speeds it up a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun! XD
Description
The issue of the two stalling samples seem to be at the step that misc peptides are called. Profiling results showed that a lot of time spent on hashing the variant record. So here I changed the data structure to
Dict[str,VariantRecord]
that the key is the variant ID. This can then avoid the same variant being hashed over and over.Also I found it spent a lot of time checking whether the sequence equals to '*'. Also updated it to make it faster. Now the transcript finishes on my local laptop in 5 min. Could you try this on the two stalling samples?
Closes #...
Checklist
.png
, .jpeg
),.pdf
,.RData
,.xlsx
,.doc
,.ppt
, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.CHANGELOG.md
under the next release version or unreleased, and updated the date.