-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limited Use of Multithreading during SyN Registration (b-spline syn) #1017
Comments
All multi-threading is handled by ITK. The SyN and B-spline SyN classes are derived from the same parent class with the only difference being the use of B-spline smoothing and that B-spline smoothing class has been there for ~10 years. Did you try compiling a fresh checkout directly on your machine? |
Yes, I was using the small installation script that is offered on the ANTs website. I suppose, the git command there will checkout the latest working version. For the Mac a few hours ago… |
It's really hard to say what's going on because, as I said, multi-threading is handled by the ITK foundation. You might want to ask over at the ITK discussion forum. |
I've not come across similar problems since that old thread. Happy to test again with a reproducible example. But my knowledge of the underlying ITK behavior is limited, so we'd probably have to ask over there. |
Example in the sense of "data" or "parameters" for an antsRegistration-call? |
A reproducible example would include data and code. |
Ok, I've assembled a simple example, I just used the famous "anatomy of an antsRegistration call" call and changed the SyN stage to BSplineSyN: antsRegistration --dimensionality 3 --float 0 --verbose 1 \ As can be seen I'm trying to coregister a T1-weighted image with a T2 Flair image. For the linear stages the cpu workload always went to the allotted load, but during BSplineSyN it goes to one core on average, spiking in between to 2 or 3 cores. Thank you very much, best regards, |
Thanks for providing this. I just ran it on my machine using multithreading and the number of threads stayed the same through all 3 stages. How are you monitoring thread use? |
However, I just noticed that it's running unexpectedly slower than typical and I noticed that you're using a spline distance of 1.5mm for human brains. Where did you get this parameter choice? |
Well, this is one possible difference that might point towards an explanation. As I mentioned, all the threading is handled by ITK, so if this is indeed the culprit, there's really nothing we can do on the ANTs side. |
Looks like many threads are being created, but they can't all run in parallel for some reason. |
I take that back. If that were the case, then N4 should see a similar throttling and I haven't noticed anything but should probably check. @gdevenyi --- if you get a chance, can you just verify N4 isn't seeing something similar? And, then, can you run the following on just some random displacement field and see what the thread usage is like?
|
I adapted the script slightly. I found a similar result, multiple threads are spawned but only one is active. The CPU utilization goes up as I increase the spline spacing. I also turned on
|
Just so people have an idea of the pieces involved: I wrote much of the B-spline code in ITK except for the interpolator and original FFD-style image registration. All of the higher-level B-spline applications (e.g., N4, B-spline smoothing, B-splinev4/BsplineSyN image registration) use this filter as the underlying fitting/smoothing workhorse. I'm pretty sure that I remember correctly that this is the only filter that has explicit multithreading code. With SyN vs. B-spline SyN, although there are some differences in point-set handling, the only differences for the displacement field-only case is the actual smoothing. SyN uses Gaussian smoothing and BSplineSyN calls this analogous function which calls this filter which simply uses the B-spline workhorse mentioned above. As I mentioned above, N4 uses the same multi-threaded workhorse so if that's working properly, it's really puzzling to me that B-spline SyN wouldn't also work. |
Both seem to alternate between all cores loaded and a single thread loaded. The second does more than once, but I think that's because of the "4" levels, whereas the other call only has one. |
I've tried to reproduce for a different number of threads. Attached are two screenshots during the affine stage and bsplinesyn stage where I've specified 4 threads and it looks as expected. So, @cookpa , since you use a Mac as well, I'm guessing you see "n" threads activated in the Threads column of the Activity Monitor but only see a single bar active on your CPU usage monitor, correct? |
Looks like you hit the comment button before the picture uploads completed :) |
I'm remoting to my mac, using
There's multiple threads but only one is running, and top shows 100% CPU usage. |
Limiting the threading with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 does impact affine and SyN, but BSplineSyN is still one thread. |
@gdevenyi I don't understand what you're saying. The second image is of B-spline SyN. Actually both images are of B-spline SyN (my mistake). |
**When I change ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 on my system, affine and SyN are properly adjusted, BSpineSyN is still single threaded. |
@gdevenyi Okay, I wasn't disputing that. I was simply pointing out that I can't reproduce on my machine. |
Okay, so I'm wondering if this is a platform/build system detail here. I'm on Linux 18.04, with gcc-10.1. Are you using gcc or clang-as-gcc on OSX? I've also tested ITK_GLOBAL_DEFAULT_THREADER=Platform and ITK_GLOBAL_DEFAULT_THREADER=Pool with the same results. I'll have to rebuild with TBB to see if its different, however TBB with ANTs still segfaults randomly so I don't bother hacking it in right now. |
@ntustison using the script I posted I see CPU usage similar to yours. With the spacing of 1.5mm as in the original example, it hovers around 100%. |
(base) [ntustison@Infinite-Resignation Thu Jun 11 11:13:06] $ gcc --version Okay, @cookpa , interesting. So with four threads it doesn't drop to a single CPU but you do see a disparity |
Will try a build with clang-11 here |
|
@chrisadamsonmcri what data and spline spacing are you using to observe the slowdown? If I recall these experiments correctly, the appropriate number of threads were created but they did not execute in parallel as the spline resolution increased. Once the spline spacing got sufficiently small, the CPU usage approached 100% - making it appear to be single-threaded, but really all the threads were there, they were just running one at a time. |
Here is more information about the call I used
The moving image,
The fixed image
I can't send the specific images used for this test but I do get consistent behaviour across different images. Hope this helps. |
@ntustison I ran
I used the following call:
I took Using fewer control points was much faster and remained more multithreaded:
|
Okay, I'd like to focus on the multi-threading (elapsed time is correlated with the number of control points). What do you mean it "remained more multi-threaded"? The number of threads isn't explicitly coded to change during the execution of the filter. |
A bit more information on what I mean by that. At the very start of the smoothing step all threads are active, for about 0.1 seconds:
But after that there is only one active thread
It isnt the case that multiple threads are active and each one taking up very little CPU. Perhaps there is a loop that is running in serial? I'll try to put some timing information in the filter itself, will report back. |
Nothing that should take significant time. The displacement field B-spline smoothing filter is basically a wrapper for the more generalized B-spline approximation filter. A for loop in the former iterates through the input field and collects the data into a point-set data structure which calls the latter for fitting and sampling of the B-spline object. In the latter, both the approximation part and reconstruction part should be explicitly multi-threaded. |
It is this loop that is slow
This runs in serial. I ran this command
The output is all the timings I added in. The final 28935ms is the total elapsed time of the loop. |
This is interesting. I have a pretty good guess as to what's happening (although not the "why"). But first can you do me a favor and send me the screen dump from the following (with this image):
using |
As a follow-up, I'll explain what I think is happening here. In this section of code, the B-spline object is being evaluated at each input parametric point. This is a relatively computational intense process as one has to find the weighted average based on the local neighborhood control point values and corresponding B-spline basis functions evaluated at that parametric point. For this situation involving 3-D B-spline objects and third order B-splines, evaluation at each point involves the sum of 4 * 4 * 4 = 64 control point * basis function products. (Cf. for the corresponding function I wrote in ITK). One can imagine that doing this at each voxel in an image can be quite time consuming. However, for computational purposes in the BSplineScatteredDataApproximation filter, we took advantage of the fact that the B-spline parametric domain (i.e., ITK image) is a rectilinear grid and iterating along the same row or column maintains constant parametric values in all but one of the parametric directions such that now evaluation requires only the weighted average 4 control point * basis function products with some minor overhead when we change slices or rows during our iterating. That overhead is what the CollapsePhiLattice function is doing. This check is what is used to determine if we've gone to the next rows/columns or slices necessitating recalculation of the basis functions. If this check is erroneously met at every single point, then evaluation will reduce to the naive (i.e., more intense) computation. I'd like to see what running N4 produces given that the same issue would apply. |
Will do. Is there any code that you would like timed or debug information to be printed? |
Yeah, just let me know the elapsed time given at the end of the screen dump from the command posted above. |
Elapsed time: 16.0856 |
I also ran the N4 command with the above parameters, and it uses about 350% CPU with 4 threads on my Intel Mac (2.3 GHz Quad-Core Intel Core i7).
If I reduce the spline distance to 5mm, as used in the registration call from @chrisadamsonmcri above, the CPU utilization drops slightly, but isn't bad
But if I drop the spline resolution to 1.5mm (as in the OP's command at the top of this issue), the utilization is reduced more substantially.
|
Ok so I think I get what the optimisation is @ntustison
In a lattice organisation of control points most of the CollapsePhiLattice evaluations will be skipped since when you move one row or one column 2 of the 3 U[j]'s will be unmodified so you skip it. I counted the times that happened and you get this:
|
Hey, thanks for the additional information. Let me mentally process this info a bit more before I take a deeper dive. Right now, one obvious question might be why |
Can you check this minor change on your end?
On my machine
|
Okay, here are the links to a proposed update of the B-spline filter (file.h and file.hxx). There are two main improvements:
I stand corrected in that I apparently wrote this code back in 2005. It's hard to remember what I was thinking back then but my guess is that I had in mind that the approximated point set values would also be an output and the inefficiencies are related to that. However, one doesn't need those values if only a single approximation level is being performed and that's the case when we run both N4 and BSplineSyN registration. And I have no idea why UpdatePointSet() was never multi-threaded. I'll do some more testing and submit this to ITK hopefully early next week but would certainly welcome any feedback. |
Success @ntustison . Here are benchmark results for the Original: After update: |
Awesome. Glad it seems to be working now. |
Thanks to @chrisadamsonmcri for running down this optimization! |
Yeah, nice work @chrisadamsonmcri! Thank you all for all that effort. |
Thank you all! Thanks to @ntustison for implementing the whole class and speedup. Implementing and speeding up BSplineSyN, N4 is a big deal. |
Okay, I'm closing this issue as the fix was merged and I updated the ANTs ITK git tag. |
I ran a regression test with |
Awesome. Thanks @cookpa . |
Hi!
When using b-spline syn during ANTs registration only one cpu core is working, while normal SyN behaves in the way it should behave, when properly setting and using the ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS variable. I can see in the resource monitor a short cpu work load spike right at the beginning of each convergence step, but 99% of the time it's single core processing.
This behavior is reproducible on a Mac running El Capitan (10.11.6) or an Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-53-generic x86_64).
The Mac is running the latest ANTs version (2.3.3.dev168-g29bdf) compiled this morning, on Ubuntu slightly older (2.3.1.dev120-g2f09e) compiled in March.
I've found an older similar thread which is already closed because the problem was solved by compiling against a newer ITK, which here obviously doesn't work. [https://github.com//issues/604]
As the b-spline syn takes a lot more time a multi-threading would be highly appreciated! :)
Thanks and best regards,
Martin
The text was updated successfully, but these errors were encountered: