-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonbonded kernel #194
base: master
Are you sure you want to change the base?
Nonbonded kernel #194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. I have left most comments inline. I can have a look at the performance too. Also:
- Does it work for 2D systems? You can extract
D
as a type parameter from theSVector
s and use things likezero(SVector{D, T})
rather thanSVector{3, T}(zero(T), zero(T), zero(T))
to make it generic in this sense. You could add a GPU test to the 2D test at the top of test/simulation.jl. - Remove the empty file with a strange name.
- You will need to rebase/merge again from master due to recent changes, sorry about that.
|
||
coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm") | ||
|
||
if gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth having nl
as a Bool
argument to the function and testing both with and without the neighbour list.
@@ -67,6 +67,34 @@ end | |||
@test 3.6f0 < s.boundary.side_lengths[1] < 3.8f0 | |||
end | |||
|
|||
@testset "Peptide Float32 no units on GPU" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this test, every test adds to the test time.
@@ -262,7 +290,7 @@ end | |||
) | |||
neighbors = find_neighbors(sys) | |||
|
|||
@test_throws ErrorException forces(sys, nothing) | |||
#@test_throws ErrorException forces(sys, nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these lines.
@@ -221,8 +221,75 @@ end | |||
) | |||
random_velocities!(s, temp) | |||
|
|||
s_gpu = System( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new tests run similar things, so it might just be worth keeping one for speed.
@@ -72,7 +72,10 @@ Custom simulators should implement this function. | |||
coords_copy = similar(sys.coords) | |||
F_nounits = ustrip_vec.(similar(sys.coords)) | |||
F = F_nounits .* sys.force_units | |||
forces_buffer = init_forces_buffer(F_nounits, n_threads) | |||
forces_buffer = init_forces_buffer(F_nounits, sys.coords, n_threads) | |||
F_nounits = forces_nounits!(F_nounits, sys, neighbors, forces_buffer, 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This F_units
line is not required as it will run on entering the loop.
# ********************************************* TO BE TESTED ************************************************************ | ||
function sorted_Morton_seq(positions, w, bits::Int) | ||
N = length(positions) | ||
Morton_sequence = Vector{Int32}(undef, N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation a bit off in these functions.
function boxes_dist(x1_min::D, x1_max::D, x2_min::D, x2_max::D, Lx::D) where D | ||
|
||
a = abs(Molly.vector_1D(x2_max, x1_min, Lx)) | ||
b = abs(Molly.vector_1D(x1_max, x2_min, Lx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for Molly.
since you are in Molly.
atoms_j_shuffle = Atom(atype_j, aindex_j, amass_j, acharge_j, aσ_j, aϵ_j) | ||
dr = vector(coords_j, coords_i, boundary) | ||
r2 = dr[1]^2 + dr[2]^2 + dr[3]^2 | ||
condition = eligible[laneid(), shuffle_idx] == true && Bool_excl == true && r2 <= r_cut * r_cut |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the == true
on Bool
s here (and elsewhere).
special[laneid(), m] = special_matrix[s_idx_i, s_idx_j] | ||
end | ||
|
||
for m in laneid() + a : warpsize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add brackets for clarity: (laneid() + a)
.
pe_tuple = ntuple(length(inters)) do inter_type_i | ||
SVector(potential_energy_gpu(inters[inter_type_i], dr, atom_i, atom_j, E, special, coord_i, coord_j, boundary, | ||
vel_i, vel_j, step_n)) | ||
# why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say that using a SVector
was required to avoid a GPU error.
Nonbonded force calculation using GPU block-based algorithm
This PR introduces a GPU-optimized algorithm for the calculation of nonbonded forces. The algorithm subdivides the system into blocks of 32 particles, evaluates distances between blocks, and computes forces only for interacting blocks within a cutoff radius. The algorithm is incorporated into OpenMM and it is described in a paper by Eastman and Pande (https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21413).
This new implementation significantly accelerates nonbonded force calculations for large systems as its performance scales linearly with the number of atoms.
Implementation details
w
slightly smaller than the cutoff radius. The sequence obtained is used in the functionskernel_min_max!
,force_kernel!
andenergy_kernel
.kernel_min_max!
: computes the arrays containing minima and maxima of the bounding boxesforce_kernel!
: computes the nonbonded forces between particles within interacting bounding boxesenergy_kernel
: computes the nonbonded potential energyLet
i
andj
be the indices of the bounding boxes and letn_blocks
be the allocated number of blocks of GPU threads, which matches exactly the number of bounding boxes. Each couple of bounding boxes forms a tile and the 32 threads allocated in each block compute the interactions within each tile. The structure offorce_kernel!
andenergy_kernel
is divided in 4 main parts:j < n_blocks && i < j
: computes the forces within 32x32 tiles by shuffling the coordinates, velocities indices of the particles andAtom
objectsj == n_blocks && i < n_blocks
: computes the forces between ther=N%32
particles in the last box and the particles in the other boxes of 32.i == j && i < n_blocks
: computes interaction within boxes of 32i == n_blocks && j == n_blocks
: computes interactions within the last box ofr
particles.In cases 1 and 2, for each tile that has been identified as interacting, the distance of each atom in the first bounding box is computed with respect to the second bounding box and if not within the cutoff a flag is set for that atoms to skip the computation.
Observations and new neighbor finder (GPUNeighborFinder)
n_steps_reorder
ofGPUNeighborFinder
.force_kernel!
andenergy_kernel
make explicit use of the matriceseligible
andspecial
defined inGPUNeighborFinder
GPUNeighborFinder
: new type of neighbor finder for systems on GPU. ForGPUNeighborFinder
systems, the functionfind_neighbors
returnsnothing
as a neighbor list. This because the new calculation does not require previous knowledge of the neighbor list. The new default choices for the setup of the system are displayed below (lines 890-914 ofsetup.jl
):A new struct is introduced to store the forces matrix, the arrays of minima and maxima together with the sequence of atom indices ordered according to the Morton code.
The function
forces_nounits!
takes now aForcesBuffer
object as an input. While the forces matrix is updated at each step of dynamics, the minima and maxima arrays and the Morton sequence are only updated everyn_steps_reorder
steps.Some tests for this GPU implementation are added in the
test/simulation.jl
andtest/protein.jl
.Performance evaluation
Benchmark of
forces(sys)
performed on6mrr_equil.pdb
(protein system with 15954 atoms) for the proposed implementation and the previous implementation in Molly:Float32
Float64
The old implementation still outperforms the new one in real systems simulations where the neighbor list is not computed at every step. However, I believe that small fixes can adjust the behaviour of the kernel in long simulations as the benchmark with the reordering seems encouraging.
The command
potential_energy(sys)
always performs the reordering step in the new implementation since the potential energy will be computed much less frequently than the forces during a simulation.Float32
Float64
A possible setup for a benchmark with
Float64
precision: