Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonbonded kernel #194

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Nonbonded kernel #194

wants to merge 7 commits into from

Conversation

Peppone98
Copy link
Collaborator

Nonbonded force calculation using GPU block-based algorithm

This PR introduces a GPU-optimized algorithm for the calculation of nonbonded forces. The algorithm subdivides the system into blocks of 32 particles, evaluates distances between blocks, and computes forces only for interacting blocks within a cutoff radius. The algorithm is incorporated into OpenMM and it is described in a paper by Eastman and Pande (https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21413).

This new implementation significantly accelerates nonbonded force calculations for large systems as its performance scales linearly with the number of atoms.

Implementation details

  • The algorithm divides the N atoms in bounding boxes of 32. A bounding box is the smallest rectangular, axis-aligned box that contains 32 atoms. The identification of these boxes is a O(N) problem.
  • For reasons of spatial coherence, a simple Morton code-based reordering algorithm is employed to ensure that atoms close in space are also close to each other in a reoredered array of indices. This is done by diving the 3D space into cubic small voxels with a side length w slightly smaller than the cutoff radius. The sequence obtained is used in the functions kernel_min_max!, force_kernel! and energy_kernel.
  • kernel_min_max! : computes the arrays containing minima and maxima of the bounding boxes
  • force_kernel!: computes the nonbonded forces between particles within interacting bounding boxes
  • energy_kernel: computes the nonbonded potential energy

Let i and j be the indices of the bounding boxes and let n_blocks be the allocated number of blocks of GPU threads, which matches exactly the number of bounding boxes. Each couple of bounding boxes forms a tile and the 32 threads allocated in each block compute the interactions within each tile. The structure of force_kernel! and energy_kernel is divided in 4 main parts:

  1. j < n_blocks && i < j: computes the forces within 32x32 tiles by shuffling the coordinates, velocities indices of the particles and Atom objects
  2. j == n_blocks && i < n_blocks: computes the forces between the r=N%32 particles in the last box and the particles in the other boxes of 32.
  3. i == j && i < n_blocks: computes interaction within boxes of 32
  4. i == n_blocks && j == n_blocks: computes interactions within the last box of r particles.

In cases 1 and 2, for each tile that has been identified as interacting, the distance of each atom in the first bounding box is computed with respect to the second bounding box and if not within the cutoff a flag is set for that atoms to skip the computation.

Observations and new neighbor finder (GPUNeighborFinder)

  • The algorithm does not require a neighbor list to be passed as an input
  • The computation of the maxima and the minima is performed (by default) every 10 steps of dynamics. However, the user can modify this value by setting explicitly the field n_steps_reorder of GPUNeighborFinder.
  • force_kernel! and energy_kernel make explicit use of the matrices eligible and special defined in GPUNeighborFinder

GPUNeighborFinder: new type of neighbor finder for systems on GPU. For GPUNeighborFinder systems, the function find_neighbors returns nothing as a neighbor list. This because the new calculation does not require previous knowledge of the neighbor list. The new default choices for the setup of the system are displayed below (lines 890-914 of setup.jl):

if gpu
    neighbor_finder = GPUNeighborFinder(
        eligible=CuArray(eligible),
        dist_cutoff=T(dist_neighbors),
        special=CuArray(special),
        n_steps_reorder=10,
        initialized=false,
    )
elseif use_cell_list
    neighbor_finder = CellListMapNeighborFinder(
        eligible=eligible,
        special=special,
        n_steps=10,
        x0=coords,
        unit_cell=boundary_used,
        dist_cutoff=T(dist_neighbors),
    )
else
    neighbor_finder = DistanceNeighborFinder(
        eligible=eligible,
        special=special,
        n_steps=10,
        dist_cutoff=T(dist_neighbors),
    )
end

A new struct is introduced to store the forces matrix, the arrays of minima and maxima together with the sequence of atom indices ordered according to the Morton code.

struct ForcesBuffer{F, C}
    fs_mat::F
    box_mins::C
    box_maxs::C
    Morton_seq::CuArray{Int32}
end

The function forces_nounits! takes now a ForcesBuffer object as an input. While the forces matrix is updated at each step of dynamics, the minima and maxima arrays and the Morton sequence are only updated every n_steps_reorder steps.

Some tests for this GPU implementation are added in the test/simulation.jl and test/protein.jl.

Performance evaluation

Benchmark of forces(sys) performed on 6mrr_equil.pdb (protein system with 15954 atoms) for the proposed implementation and the previous implementation in Molly:

Float32 Float64
with reordering 2.724 ms ± 340.828 μs 8.018 ms ± 325.429 μs
without reordering 1.983 ms ± 26.878 μs 7.134 ms ± 43.339 μs
previous implementation (with neighbors calculations) 15.084 ms ± 399.510 μs 20.113 ms ± 435.548
previous implementation (without neighbors calculations) 1.257 ms ± 115.581 μs 1.585 ms ± 80.793 μs

The old implementation still outperforms the new one in real systems simulations where the neighbor list is not computed at every step. However, I believe that small fixes can adjust the behaviour of the kernel in long simulations as the benchmark with the reordering seems encouraging.

The command potential_energy(sys) always performs the reordering step in the new implementation since the potential energy will be computed much less frequently than the forces during a simulation.

Float32 Float64
with reordering 2.028 ms ± 254.345 μs 7.783 ms ± 371.269 μs
previous implementation 20.036 ms ± 523.120 μs 24.720 ms ± 488.714 μs

A possible setup for a benchmark with Float64 precision:

n_atoms = 16_000
n_steps = 500
temp = 298.0u"K"
boundary = CubicBoundary(6.0u"nm")
coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")
velocities = [random_velocity(10.0u"g/mol", temp) for i in 1:n_atoms]
simulator = VelocityVerlet(dt=0.002u"ps")

s = System(
    atoms=[Atom(mass=10.0u"g/mol", charge=0.0, σ=0.05u"nm", ϵ=0.2u"kJ * mol^-1") for i in 1:n_atoms],
    coords=coords,
    velocities=velocities,
    boundary=boundary,
    pairwise_inters=(LennardJones(cutoff=DistanceCutoff(1.0u"nm"), use_neighbors=true),),
    neighbor_finder=DistanceNeighborFinder(
        eligible=trues(n_atoms, n_atoms),
        n_steps=10,
        dist_cutoff=1.2u"nm",
    ),
    loggers=(coords=CoordinatesLogger(100),),
)

s_gpu = System(
    atoms=CuArray([Atom(mass=10.0u"g/mol", charge=0.0, σ=0.05u"nm", ϵ=0.2u"kJ * mol^-1") for i in 1:n_atoms]),
    coords=CuArray(coords),
    velocities=CuArray(velocities),
    boundary=boundary,
    pairwise_inters=(LennardJones(cutoff=DistanceCutoff(1.0u"nm"), use_neighbors=true),),
    neighbor_finder=GPUNeighborFinder(
        eligible=CuArray(trues(n_atoms, n_atoms)),
        n_steps_reorder=10,
        dist_cutoff=1.0u"nm",
    ),
    loggers=(coords=CoordinatesLogger(100),),
)

# CPU 
simulate!(deepcopy(s), simulator, 20; n_threads=1)
@time simulate!(s, simulator, n_steps; n_threads=1)

# GPU
simulate!(deepcopy(s_gpu), simulator, 20; n_threads=1)
@time simulate!(s_gpu, simulator, n_steps; n_threads=1)

Copy link
Collaborator

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I have left most comments inline. I can have a look at the performance too. Also:

  • Does it work for 2D systems? You can extract D as a type parameter from the SVectors and use things like zero(SVector{D, T}) rather than SVector{3, T}(zero(T), zero(T), zero(T)) to make it generic in this sense. You could add a GPU test to the 2D test at the top of test/simulation.jl.
  • Remove the empty file with a strange name.
  • You will need to rebase/merge again from master due to recent changes, sorry about that.


coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")

if gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth having nl as a Bool argument to the function and testing both with and without the neighbour list.

@@ -67,6 +67,34 @@ end
@test 3.6f0 < s.boundary.side_lengths[1] < 3.8f0
end

@testset "Peptide Float32 no units on GPU" begin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this test, every test adds to the test time.

@@ -262,7 +290,7 @@ end
)
neighbors = find_neighbors(sys)

@test_throws ErrorException forces(sys, nothing)
#@test_throws ErrorException forces(sys, nothing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these lines.

@@ -221,8 +221,75 @@ end
)
random_velocities!(s, temp)

s_gpu = System(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests run similar things, so it might just be worth keeping one for speed.

@@ -72,7 +72,10 @@ Custom simulators should implement this function.
coords_copy = similar(sys.coords)
F_nounits = ustrip_vec.(similar(sys.coords))
F = F_nounits .* sys.force_units
forces_buffer = init_forces_buffer(F_nounits, n_threads)
forces_buffer = init_forces_buffer(F_nounits, sys.coords, n_threads)
F_nounits = forces_nounits!(F_nounits, sys, neighbors, forces_buffer, 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This F_units line is not required as it will run on entering the loop.

# ********************************************* TO BE TESTED ************************************************************
function sorted_Morton_seq(positions, w, bits::Int)
N = length(positions)
Morton_sequence = Vector{Int32}(undef, N)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation a bit off in these functions.

function boxes_dist(x1_min::D, x1_max::D, x2_min::D, x2_max::D, Lx::D) where D

a = abs(Molly.vector_1D(x2_max, x1_min, Lx))
b = abs(Molly.vector_1D(x1_max, x2_min, Lx))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for Molly. since you are in Molly.

atoms_j_shuffle = Atom(atype_j, aindex_j, amass_j, acharge_j, aσ_j, aϵ_j)
dr = vector(coords_j, coords_i, boundary)
r2 = dr[1]^2 + dr[2]^2 + dr[3]^2
condition = eligible[laneid(), shuffle_idx] == true && Bool_excl == true && r2 <= r_cut * r_cut
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the == true on Bools here (and elsewhere).

special[laneid(), m] = special_matrix[s_idx_i, s_idx_j]
end

for m in laneid() + a : warpsize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add brackets for clarity: (laneid() + a).

pe_tuple = ntuple(length(inters)) do inter_type_i
SVector(potential_energy_gpu(inters[inter_type_i], dr, atom_i, atom_j, E, special, coord_i, coord_j, boundary,
vel_i, vel_j, step_n))
# why?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say that using a SVector was required to avoid a GPU error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants