Change data structure used when calling misc peptides to improve performance #736

zhuchcn · 2023-05-06T16:15:55Z

Description

The issue of the two stalling samples seem to be at the step that misc peptides are called. Profiling results showed that a lot of time spent on hashing the variant record. So here I changed the data structure to Dict[str,VariantRecord] that the key is the variant ID. This can then avoid the same variant being hashed over and over.

Also I found it spent a lot of time checking whether the sequence equals to '*'. Also updated it to make it faster. Now the transcript finishes on my local laptop in 5 min. Could you try this on the two stalling samples?

Closes #...

Checklist

This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
All test cases passed locally.

…dict instead of set to avoid hashing the entire object again and again

zhuchcn · 2023-05-07T13:52:22Z

This is not fixed yet..

lydiayliu · 2023-05-08T17:11:20Z

Will hold for updates!

…or too short to significantly improve efficiency. #736

zhuchcn · 2023-05-14T08:16:07Z

For this issue, it took instant time to create the cleavage graph, but forever when calling peptides with misc from it. I did the following two things to make it fast:

Skip peptides earlier if too long or too short

When calling peptide4s with misc, we traverse the graph, and for each node, we first put all three lists of nodes into "stage" area, with each list being the nodes with 0..k miscleavages. For each node list, we then join then and check whether the sequence fit the criteria (length, mw, # variants). So here if the sequence joined from a node list is too long or too short, I'll avoid them being staged to reduce the number of nodes being processed. This improved the performance significantly.

Call W2F later and sequence level rather than node level

W2F was used to be called at the node level. So when a list of misc nodes are "staged", the version of the sequences with W2F will also be generated. So if there are equivalent sequences in the graph, the same operation gets repeated. So here I moved this operation later to the variant peptide pool. So after all peptides are called from the graph, each peptides from the pool is iterated thru and generate the copy with W2F reassignments.

I'll run some fuzz test now to see if anything goes crazy.

lydiayliu · 2023-05-14T21:39:05Z

for each node, we first put all three lists of nodes into "stage" area, with each list being the nodes with 0..k-1 miscleavages

Why is it k-1?

Should I run this on the 2 samples and see if they go through?

zhuchcn · 2023-05-15T04:38:08Z

Why is it k-1?

Not k-1, just k.

Should I run this on the 2 samples and see if they go through?

Yup, go ahead and run the 2 samples. Are the the last 2 samples left?

zhuchcn · 2023-05-16T10:48:17Z

The other sample CPCG0397 is stuck because of a different issue. I'll open a new issue for that and I think we can merge this PR.

lydiayliu · 2023-05-16T15:14:02Z

moPepGen/svgraph/PVGNode.py

        upstream_cleave_alts = [v.variant for v in self.variants
-            if v.location.end == len(self.seq.seq)]
+            if v.location.end == seq_len]


lol I don't thnk this changed anything? XD

Beleive it or not, this actually speeds it up a little bit.

fix (svgraph): update data structure to improve performance by using …

81ace30

…dict instead of set to avoid hashing the entire object again and again

zhuchcn requested a review from lydiayliu May 6, 2023 16:16

fix (VariantPeptideDict): turn more set into dict

0e9b09d

fix (callVariant): skip peptides earlier if they are either too long …

d76caf6

…or too short to significantly improve efficiency. #736

zhuchcn added 2 commits May 14, 2023 16:20

doc (CHANGELOG): changelog updated

2d08a51

fix (bruteForce): sect peptides + w2f not considered

8c538af

lydiayliu approved these changes May 16, 2023

View reviewed changes

zhuchcn merged commit 3e6e7b9 into main May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change data structure used when calling misc peptides to improve performance #736

Change data structure used when calling misc peptides to improve performance #736

zhuchcn commented May 6, 2023

zhuchcn commented May 7, 2023

lydiayliu commented May 8, 2023

zhuchcn commented May 14, 2023 •

edited

Loading

lydiayliu commented May 14, 2023

zhuchcn commented May 15, 2023

zhuchcn commented May 16, 2023

lydiayliu May 16, 2023

zhuchcn May 16, 2023

lydiayliu May 16, 2023

Change data structure used when calling misc peptides to improve performance #736

Change data structure used when calling misc peptides to improve performance #736

Conversation

zhuchcn commented May 6, 2023

Description

Checklist

zhuchcn commented May 7, 2023

lydiayliu commented May 8, 2023

zhuchcn commented May 14, 2023 • edited Loading

Skip peptides earlier if too long or too short

Call W2F later and sequence level rather than node level

lydiayliu commented May 14, 2023

zhuchcn commented May 15, 2023

zhuchcn commented May 16, 2023

lydiayliu May 16, 2023

Choose a reason for hiding this comment

zhuchcn May 16, 2023

Choose a reason for hiding this comment

lydiayliu May 16, 2023

Choose a reason for hiding this comment

zhuchcn commented May 14, 2023 •

edited

Loading