-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dedup parallel #69
Comments
So on one test file (from a previous issue), dedup was 4 times longer with 4 CPUs. This presumably means that the overhead for setting up 4 processes is higher than the advantage from parallelising edit_distance calculations when there are only few umis at each position. |
Yeah that's a shame but I think your approach of checking to see how many UMIs there are at the position first makes a lot of sense. This does mean we probably wont see anything like a linear improvement in time taken for additional threads though. Have you had a look at pinning down how many UMIs you need for the additional threads to be useful? If not, I can look at this. We could then derive a sensible threshold for using the multi-threading for each call to dedup depending on the UMI length (and threads?). |
I don't really have a good file to test it on - I need something that has at least some really high depth positions. We probably could get linear improvements if we could manage to parallelise the whole The main issue with multiprocessing and the top level loop is that |
I guess the easiest way to get around any potential issue with pickling What do you think @AndreasHeger? Have you any experience with parallelising pysam? |
I have not tried parallelism on the AlignedSegment level. However, it does seem to be an embarassingly parallel problem and I don't see why the parallelism needs to be on the contig level, can you not do 10Mb windows or similar? |
The problem with this is that each 10Mb is not independent of all other
10Mb blocks because we use an implied mapping start position rather than
the actual mapping start position supplied in the BAM.
The chromosome level might be better, but we still have the case of paired
reads where the mates are on different chromosomes (probably a rare problem
when mapping to the genome, but could be more of an issue if mapping to the
transcriptome).
One solution I was thinking of was that most of the real work done by the
main loop is done by ClusterReducer, and that doesn’t actually need the
reads themselves, only the UMIs and counts. So it might be possible to peel
off a umi_counts dictionary, pass those in a parallel fashion to
CluterReducer and then recombine them with the reads on return – that’s
going to need some clever imap/izip-fu, but should be possible.
I’ve also glanced at the world of proxy objects. But that seems very
complex.
|
Looks like we are barking up the wrong tree!!
Line profiling this, it seems like the call to
I feel like making found a set rather than a list might make things better... |
Yeah, found really should be a set there. Was this from applying UMI-tools dedup to a very high depth sample? I'm surprised to see the edit_distance function being used relatively few times. |
With a set instead of a list!
This is with the file you suggested in your email. |
Wow! That's great. And a bit embarrassing! |
BTW thus far my calculations are suggesting that multiproccessing only become efficient at the edit_distance level when there are more than 1000 UMIs per site. |
Hi Ian. We can use a recursive function to find the network components for a small decrease in run time (~15%). See the TS-recursiveComponents branch. Note: I haven't checked the additional memory usage from the recursive function. Using the non-recursive function
Using the recursive function:
|
That's a shame about the parallel processing at the edit_distance level. Given the edit distance is not the major bottleneck as you indicate above, this suggests to me that it's not really worth including parallel processing at this level. |
I have combined three optimisations into a new branch. I ran tests with two different test files. The first used 5bp UMIs, and has 3,320,209 reads at 38,779 positions, that is de-duplicated down to 43,100 reads. The second file used 10bp UMIs and has 482,686 reads at 11,427 positions, that is de-duplicated down to 30,938 reads. The first optimisation is using The second optimisation is to use a set rather than a list for the Finally I merged in your recursion algo for the connected component search. The results are below
For the first file there is not much difference. But for the second file we have an almost 10 fold improvement with all three optimisations in place. In File 1, the bottleneck is For File 2, the bottleneck now is neither the finding of connected components, nor the calculating of the This might be as good as such optimisations get. |
Thanks for properly performing the comparisons. That's great news about the improvements for the second file which I think is probably more representative of the type of samples which have been taking a long time. A 10-fold improvement is much more than I thought we'd get just from tinkering with the code. I think we can squeeze out a further slight improvement by replacing the list comprehension with a call to itertools.combinations - see the TS-replaceComprehension branch. This reduces the number of times two UMIs are compared ~ 2-fold. Would you mind merging in this branch too and checking the run times on those two files? This branch also has a commit (25da7f4) to decode the UMIs to strings for outputting to flat files - previously tests were failing as b'ATATA' != ATATA |
So this brings it down to 12s. Pretty good going. Looking at this has been instructive: my Opt1 above was replacing one of the list comprehensions with a map function, which halved the time spent in |
Definitely some valuable lessons regarding profiling and code optimisation. I'm amazed that we can shave off so much of the run-time. Hopefully this will prevent any further issues relating to the time taken to run dedup. If there are still issues I guess we can try and implement parallelisation at the ClusterReducer level. Are you good to merge in the changes to master? I'll update the version on PyPi and conda afterwards. |
Yep, go ahead and merge
|
There's an issue with the recursive search. It produces a slightly different order of output with the group_cluster test (not an issue) and outputs a slightly different set of reads when I run dedup on the high depth sample I sent you (File 2 above?). I've not got to the bottom of the second issue so I thought it best just to leave the recursive search out for now and I'll return to this when I have time. |
CLosing this for now as we came to the conclusion that the overhead on parallelisation was greater than the reward. |
Hi, I've been using umi_tools extract and umi_tools dedup in combination with GNU parallel for some time in my projects, which has significantly sped up the processing time. I finally managed to consolidate everything into a GitHub repository - UMI_parallel. I'd be delighted if it proves helpful to others. Please feel free to provide any feedback or suggestions. Cheers, |
Make dedup use more than one CPU.
First attempt at an implementation see: branch IS_parallel.
Need large data file to benchmark.
cc: @TomSmithCGAT
The text was updated successfully, but these errors were encountered: