-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differentiable GPU kernels #60
Comments
Trying out Enzyme will depend on EnzymeAD/Enzyme.jl#230 which I hope to get to by mid April |
Hey, I kinda left this in the dark as issues with other projects cropped up. So far, I
Regardless, here are my immediate plans:
Currently, I'm hitting a few small issues with Molly in terms of memory usage and stuff like this, but I am slowly figuring things out. All-in-all, this process will take some time, but I wanted to check in to make sure none of this will step on your toes for summer development of Molly. |
Thanks for the update. Your plan sounds good and I am looking forward to seeing what you come up with. A lot of the current design is the way it is due to Zygote-compatibility, so if CUDA kernels with Enzyme were used then hopefully we can get rid of a lot of memory issues. As you have found out, GPU utilisation is very poor and broadcasting across the whole neighbour list uses a huge amount of memory. It makes sense to start with the force function and think about neighbours after. I'm not an expert on neighbour lists, though I did learn a lot from Gromacs papers from 2015 and 2020 (and a related talk). I am keen for the simulator definitions to be high-level and not implemented in CUDA/KernelAbstractions if possible. If that becomes the bottleneck later it can be reconsidered. Anything within I would be interested to hear how the speed changes with single v double precision, I don't know which you are working in currently. Numerical accuracy is something to think about, with most simulation software using mixed precision. I am currently working to replace |
I had no intention of touching simulator definitions directly. In fact, I don't really think a switch to KA would speed up much outside of For multi-GPU, we might need to add some functionality to communicate across devices, but I am not sure where that should live right now and I am putting it off until we get single GPU performance a little better. For single vs double precision, I wrote my other package to do either, but not both. Where do the benefits come in for mixed precision? Here are quick tests (again, my own code with 100,000 particles for 10 timesteps):
This kinda makes sense as my GTX 970 doesn't have double support... My guess is that the Float32 timings are similar because I made sure to use shared memory for all essential operations and the P100 is also quite old by now. If you look at the GPU performance (blue bar at the top-ish), you see that each step is basically using the GPU as efficiently as possible: If you are interested, here is the nbody (no neighborlist) code: https://github.com/leios/Galily.jl/blob/cde36f32ae5d6e650f3bbec511cb40b6dc5b269e/src/forces.jl#L68 The (big) caveat is that this ignores the neighborlist calculation and AD support. If we run Molly (10 timesteps, 5000 particles, Float32, GTX970), it runs in 5.5 s and floods my available memory. The GPU performance is also a lot bumpier: I realize a 970 isn't a target for Molly, but the same card can run my simple set-up with 5000 particles in either 0.37 seconds for Float64 or 0.12 seconds for Float32 (~45x faster with a way smaller memory footprint). My goal is to get Molly as close as possible to the numbers I am getting in the other code while also supporting AD. Anyway, this might take some time, but I'll try to keep you in the loop! Also, thanks for the papers / looking forward to the boundary changes! |
Thanks for the benchmarks and the link to the code, looks interesting.
Yes I think that's a longer-term goal. I think anything confined to force calculation can be single precision, and hence benefit from the speed available there. My understanding is that double precision is used for certain values, e.g. the virial, where accumulation error is relevant (https://manual.gromacs.org/current/reference-manual/definitions.html#mixed-or-double-precision). |
I just pushed the branch Ideally we would use KernelAbstractions.jl as per @leios' work but this could prove a partial step and/or reference point. The next step is to look at getting Enzyme working for CPU force summation, then for GPU force summation. I think the GPU step will need CUDA atomic support to be added to Enzyme (EnzymeAD/Enzyme.jl#421) and also a fix for EnzymeAD/Enzyme.jl#428. @leios I know you did some work on atomics for KernelAbstractions.jl, do you have an idea of whether CUDA atomic support will be added for Enzyme.jl? |
The kernels branch looks good and is easily extendable to KA for AMDGPU / CPU support as well (it shouldn't change CUDA performance). I have a KA branch as well that gets similar performance. I also saw a notable improvement to CPU performance in my branch, which I was somewhat surprised by. Seeing as how the kernels branch is so far developed, maybe it would be best for me to stop developing my own branch and instead just port what you have to KA when the time comes (for CPU / AMDGPU support). So my priorities are:
What do you think? Could we maybe target mid October for 1-3 (depending on how complicated 2 is)? The way I see it, you are doing a great job with the kernels branch already, but are missing key features to elsewhere in Julia. Maybe I should focus in making sure those features are getting done while you are working on the kernels. I'm actually quite behind on my branch right now, but a few comments there:
|
Yes sounds good.
This would be really useful, and is something I probably won't have the skills for myself.
I think that is possible, as you say it will depend on how difficult the atomic issue is and whether I can work Enzyme in more broadly without issues.
I agree for the case with no neighbour list - the first two links in the first post in this issue use that approach. I wouldn't be against implementing that kernel for the non-neighbour list case. I couldn't work out how to get that scheme to work with a neighbour list though, bearing in mind that the proportion of neighbours to possible interactions goes down as N goes up. Perhaps there is a way. In OpenMM they use a complex scheme to avoid atomic adds (https://github.com/openmm/openmm/blob/master/platforms/cuda/src/kernels/nonbonded.cu#L42-L103) and also use multiple force buffers (http://docs.openmm.org/7.2.0/developerguide/developer.html#computing-forces). I'm okay with leaving that speed on the table for now though to get everything working.
On the branch currently I just store a |
Great! Thanks for the info / I'll see if I can work on the Enzyme issues. Apparently the bottleneck is proving the correctness of the adjoint. I'll dig into it more next week. As for the neighborlists and atomic avoidance, I agree. The most important thing right now is a switch to kernel-based design. Optimizations can come later |
EnzymeAD/Enzyme#849 (I didn't write the PR) Looks promising! I'll chat with a few other folks tomorrow about how to easily test things in Julia. I think we need to wait for the next jll bump first. |
Great! Enzyme seems to be improving fast at the moment. |
After a lot of work and help from the Enzyme team, v0.15.0 adds differentiable CUDA.jl kernels and much-improved performance across the board. The kernels are currently rather simple and leave a lot of performance on the table. Going from simple to complex kernels is a smaller leap than going from broadcasting to kernels though. Hopefully there will be a GSoC student looking at that this summer. As Enzyme.jl begins to work on more of Julia it will be interesting to see if we can replace Zygote.jl for the simulators too, improving performance by reducing memory allocations using mutation. |
I'm happy to see the kernels branch merged! Great work! I'm happy to extend the #99 to the kernels branch as well. I don't think it has to be merged immediately as long as it is kept up-to-date for people who need it. |
It would certainly be good to have support for different GPU array types. I don't know how well the current CUDA.jl kernels play with other array types though. Maybe a job for KernelAbstractions.jl in the long run. Personally I am leaning towards optimising the CUDA.jl kernels first and then investigating a switch to KernelAbstractions.jl, but I am happy to support anyone who wants to try the switch on the current code. |
Ok, in that case, let me mock up the KernelAbstractions branch and you can look at it. It should work just fine and have the same performance. Anyone who needs the AMD / Metal / Intel GPU support can swap to it. If you are fine with the KernelAbstractions syntax, you can merge it, otherwise, we can just keep it updated and mention it in the documentation. |
Great. I would be inclined to merge it, even if it lived alongside more complex CUDA.jl kernels later on. It will be interesting to see how nicely Enzyme plays with KernelAbstractions too. |
At the minute the GPU/differentiable path is Zygote-compatible and hence uses non-mutating broadcasted operations. This works, but is rather slow and very GPU memory-intensive.
Long term the plan is to switch to Enzyme-compatible GPU kernels to calculate and sum the forces using the neighbour list. This will be much faster both with and without gradients, and should help us move towards the speeds of existing MD software. These kernels could be used as part of the general interaction interface as is, or another interface could emerge to use them. Enzyme and Zygote can be used together, so it should be possible to replace the force summation alone and retain the functionality of the package.
One consideration is how general such kernels should be. A general pairwise force summation kernel for user-defined force functions would be useful for Lennard-Jones and Coulomb interactions, and hence would be sufficient for macromolecular simulation. Other more specialised multi-body kernels could live in Molly or elsewhere depending on how generic they are.
Another concern is how the neighbour list is best stored (calculation of the neighbour list can also be GPU accelerated but that is a somewhat separate issue).
Something to bear in mind is the extension from using one to multiple GPUs for the same simulation. It is probably best to start with one GPU and go from there.
This issue is to track and discuss this development. @leios
Useful links:
The text was updated successfully, but these errors were encountered: