Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change data structure used when calling misc peptides to improve performance #736

Merged
merged 5 commits into from
May 16, 2023

Conversation

zhuchcn
Copy link
Member

@zhuchcn zhuchcn commented May 6, 2023

Description

The issue of the two stalling samples seem to be at the step that misc peptides are called. Profiling results showed that a lot of time spent on hashing the variant record. So here I changed the data structure to Dict[str,VariantRecord] that the key is the variant ID. This can then avoid the same variant being hashed over and over.

Also I found it spent a lot of time checking whether the sequence equals to '*'. Also updated it to make it faster. Now the transcript finishes on my local laptop in 5 min. Could you try this on the two stalling samples?

Closes #...

Checklist

  • This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
  • This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
  • I have read the code review guidelines and the code review best practice on GitHub check-list.
  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
  • I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
  • All test cases passed locally.

…dict instead of set to avoid hashing the entire object again and again
@zhuchcn zhuchcn requested a review from lydiayliu May 6, 2023 16:16
@zhuchcn
Copy link
Member Author

zhuchcn commented May 7, 2023

This is not fixed yet..

@lydiayliu
Copy link
Collaborator

Will hold for updates!

…or too short to significantly improve efficiency. #736
@zhuchcn
Copy link
Member Author

zhuchcn commented May 14, 2023

For this issue, it took instant time to create the cleavage graph, but forever when calling peptides with misc from it. I did the following two things to make it fast:

Skip peptides earlier if too long or too short

When calling peptide4s with misc, we traverse the graph, and for each node, we first put all three lists of nodes into "stage" area, with each list being the nodes with 0..k miscleavages. For each node list, we then join then and check whether the sequence fit the criteria (length, mw, # variants). So here if the sequence joined from a node list is too long or too short, I'll avoid them being staged to reduce the number of nodes being processed. This improved the performance significantly.

Call W2F later and sequence level rather than node level

W2F was used to be called at the node level. So when a list of misc nodes are "staged", the version of the sequences with W2F will also be generated. So if there are equivalent sequences in the graph, the same operation gets repeated. So here I moved this operation later to the variant peptide pool. So after all peptides are called from the graph, each peptides from the pool is iterated thru and generate the copy with W2F reassignments.

I'll run some fuzz test now to see if anything goes crazy.

@lydiayliu
Copy link
Collaborator

for each node, we first put all three lists of nodes into "stage" area, with each list being the nodes with 0..k-1 miscleavages

Why is it k-1?

Should I run this on the 2 samples and see if they go through?

@zhuchcn
Copy link
Member Author

zhuchcn commented May 15, 2023

Why is it k-1?

Not k-1, just k.

Should I run this on the 2 samples and see if they go through?

Yup, go ahead and run the 2 samples. Are the the last 2 samples left?

@zhuchcn
Copy link
Member Author

zhuchcn commented May 16, 2023

The other sample CPCG0397 is stuck because of a different issue. I'll open a new issue for that and I think we can merge this PR.

upstream_cleave_alts = [v.variant for v in self.variants
if v.location.end == len(self.seq.seq)]
if v.location.end == seq_len]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I don't thnk this changed anything? XD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beleive it or not, this actually speeds it up a little bit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun! XD

@zhuchcn zhuchcn merged commit 3e6e7b9 into main May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants