-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to store some historical information of pairs? #81
Comments
This is something that should be handled as a parameter passed by the closure, that is, in the command: (x,y,i,j,d2,damage_map) -> cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map) You would need to pass a vector of previous lambdas as well to the The only concern there is if the lambdas are updated within that function, that you would need to take care of it being thread-safe, which complicates quite a bit everything. But if the lambdas are not updated within the function, than it is a parameter like the others. Something not related, but that called my attention. In this line: d_new = my_norm(x + u_points[i] - y -u_points[j]) you are using the internal representation of the coordinates If you need to compute something that is dependent on the original positions of the particles, you again will have to pass the original coordinates in the closure that is mapped. For example: First, the function to be mapped would not need the internal # changed following line: x,y => ponts
function cal_damage!(points,i,j,d2,u_points,points_vol,γ,λᵦ,damage)
# changed following line: x => points[i] ; y => points[j]
d_new = my_norm(points[i]+ u_points[i] - points[j] -u_points[j])
λ = d_new - sqrt(d2)
pair_dam = 0.0
if λ > λᵦ
pair_dam = 1.0
end
damage[i] += pair_dam * points_vol[j]
damage[j] += pair_dam * points_vol[i]
return damage
end And then you can close-over the coordinates as well: # changed following line: add points as a parameter
function map_damage!(damage_map,box,cl,u_points,points_vol,γ,λᵦ,points)
damage_map .= 0.0
# changed following line x,y => points
map_pairwise!((x,y,i,j,d2,damage_map) -> cal_damage!(points,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map),damage_map,box,cl)
end |
Thanks for your reply!
That's actually what I am concerned. The lambda is calculated inside the
So the historical maximum quantity of lambda should also be updated inside the 'map_pairwise!' function. I think it is something like the
Thanks for pointing out it. The
|
"So the historical maximum quantity of lambda should also be updated inside the 'map_pairwise!' function. " But you need to update the maximum lambda for each iteration, I guess? Because there you are computing a new lambda for each pair. Is there a different lambda for each pair, or is it a global one? If there is only one global lambda for all pairs, does it have to be updated every time a new greater lambda is found? If it has to be updated everytime a new greater lambda is found (not only for each iteration), then it is impossible to write a parallel version of the algorithm. Or, actually, you could use a lock, but that will greatly reduce the performance of parallel runs. However, the greatest problem would be that the result would depend on the order in which the pairs are evaluated, in such a way that every parallel run would result in a different result, and it is also unclear what a serial versio of the algorithm would be. Depending on those variables, the optimal solution is one or another. In the simplest case (there is one global lambda at each iteration and you will update it at the end of the iteration with the greatest lambda so far), then one option is to use as an output not only the Anyway, we need more details do discuss the optimal way to deal with that. "Thanks for pointing out it. The u_points here denotes the displacement of points, and d_new here is the new length of a pair after deformation" You're right, since you are using the difference of coordinates in that operation, what you are doing there is correct. Sorry for the noise. |
One way to visualize all this is to have an implementation of the same thing with the naive loop: # data preparation, etc
# The following loop is replaced by the CellListMap machinery:
for i in 1:N-1
for j in i+1:N
d = norm(x[i] - x[j])
if d < cutoff
# do stuff
end
end
end Is actually good to have such a naive implementation, not only to have everything clear, but to compare the results after the optimizations. With a correct naive implementation in hands it will be much easier to understand the complete calculation and help. |
Yes, this is clear and easy to implement and the result is absolutely right. We should always compare the optimization result with this one. |
Ok, so if the lambda is specific for each pair, that is fairly easy, as well, because its update will be thread safe anyway. # changed following line: add lambda_pairs as a parameter
function map_damage!(damage_map,box,cl,u_points,points_vol,γ,λᵦ,lambda_pairs)
damage_map .= 0.0
# changed following line: pass lambda_pairs as a well
map_pairwise!((x,y,i,j,d2,damage_map) -> cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage_map),damage_map,box,cl,lambda_pairs)
end and your function cal_damage!(x,y,i,j,d2,u_points,points_vol,γ,λᵦ,damage,lambda_pairs)
d_new = my_norm(x + u_points[i] - y -u_points[j])
λ = d_new - sqrt(d2)
pair_dam = 0.0
# update lambda_pair here somewhere and use it
if λ > λᵦ
pair_dam = 1.0
end
damage[i] += pair_dam * points_vol[j]
damage[j] += pair_dam * points_vol[i]
return damage
end The tricky part is to extract the index of the pair in the lambda_pairs array. The easiest alternative is to just store a matrix of size if i > j
lambda_pair[i,j] = ...
else
lambda_pair[j,i] = ...
end If you want to save half of the memory (if that is relevant at all to you, it may not be). You need to linear-index the list of pairs somehow. Considering that the storage includes the cases where julia> function index_in_lambda_pairs(N,i,j)
if i > j
tup = (i,j)
else
tup = (j,i)
end
return LinearIndices((N,N))[tup...]
end
index_in_lambda_pairs (generic function with 1 method)
julia> N = 3
3
julia> lambda_pairs = zeros(div(N*(N-1),2) + N); # triangular, including diagonal;
julia> length(lambda_pairs)
6
julia> index_in_lambda_pairs(3,1,2)
2
julia> index_in_lambda_pairs(3,2,1)
2
julia> index_in_lambda_pairs(3,3,2)
6
julia> index_in_lambda_pairs(3,2,3)
6 (indexing excluding the diagonal is much more cumbersome). |
I don´t think you need to go to any of the internals of CellListMap for this. You can handle all that from the interface. In the specific case where you could want to update many different properties at the same time, you need to store all output variables inside a struct, as in this example: https://m3g.github.io/CellListMap.jl/stable/PeriodicSystems/#Energy-and-forces You could use something similar to have a struct storing the |
Thanks for your reply!
Actually I have written a simple code with
The above codes works well when the number of threads is less than 8. When the number of threads is 10 or something larger, it runs slower. I noticed that in the Another thing that is somehow strange to me is that, both the
Yes, this is what I am confused. For the past, I used to store the
And these are really good suggestions. As there is a cutoff, this matrix should be sparse, I think. |
I see. What you did is mostly ok (there is a small caveat related to using However, the main point is that you do not need (as far as I understand) to compute the neighbor list at all. You can compute the damage and update All threaded-versions have some overhead, which should be small if the number of particles is large. For small numbers of particles than certainly simple serial loops will be faster than using any other machinery. Concerning the indexes: if you cannot store a list of lambdas with size equal to the number of all possible pairs (not considering the cutoff), then your situation is more complicated, because you need to find which is the index of the corresponding lambda in a list that is not necessarily ordered nor even contain the pair in question. That will slow down a lot these calculations. If you can store a vector of lambdas for all pairs, go for it no doubt. Note that if you compute the neighbor list beforehand, you have a list of lambdas. In the next iteration, if you compute a new neighbor list for new positions, the list will not correspond anymore to the same lambda list. You need to store, for each lambda, the indexes of the particles involved, and find in that list the correct lambda for each pair of the new neighbor list. This is very costly. |
Thanks for your reply!
I don't quite understand it. As long as there is a cutoff, if I want to loop through all pairs, I have to build up a neighbor list before the computation start.
I don't get this point too.
For now I just run over all pairs once each time.
Thanks for pointing out it. Currently I have a Laptop with 32GB memory, and the largest number of particle I can run with my laptop is 300,000 with a cutoff = 0.05.
Does this mean the number of particles is too small?
That's right. Time for finding the index of a pair could be a big number when there are too many pairs.
You are right. Currently only small deformation cases are considered, which means that the neighbor list are not updated during simulation. Btw, I want to know is there something I can do to improve the performance of the multi-threaded version when the number of threads is greater than 10? |
There are two possible strategies here:
# 1
nb = neighborlist(points, cutoff)
# 2
for pairs in nb
# do stuff
end Here you are essentially running over the pairs (within the cutoff) twice: first to compute he neighbor list, second to "compute stuff". If you have to compute many different properties from the same neighbor list, that's fine. This is also common if doing multiple-time stepping, in which the neighborlist is computed up to a cutoff greater than the actual cutoff, and only updated once in a while in the "simulation". I'm not sure if that is your case.
This is what is mapped to the simple naive loop: for i in 1:N
for j in i+1:N
# compute stuff for pair (i,j)
end
end Note that for doing this we don't need the explicit neighbor list, nor storing the (perhaps very large) list of neighbors. Your damage property can be computed like this. This is what is mapped, in CellListMap, to: box = Box(limits(x), cutoff)
cl = CellList(x,box)
map_parwise(
(x,y,i,j,d2,output) -> # compute stuff for pair (i,j)
box, cl
) Here we also do not build the explicit neighbor list.
No, 300k is large. But you are presenting here a different problem, which is parallelizing only the computation for pairs, given the pair list. One possible problem with a loop like Threads.@threads for i_pair in eachindex(point_pairs)
i_thread::Int64 = Threads.threadid() is that you are distributing the pairs on individual threads. This can be costly, if the inner calculation is cheap. A better strategy is to split the workload into chunks, and launch a number of threads corresponding to the number of chuncks. This is what the https://github.com/m3g/ChunkSplitters.jl package helps you to do. You would write that loop as: using ChunkSplitters
...
nchunks = Threads.nthreads() # it can be larger, or smaller, and sometimes better performance can be obtained
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
for i_pair in pair_range
# do stuff for i_pair, identical as before
....
# note that now the damage_threaded is indexes by ichunk (and has to have size nchunks)
damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
... This is usually faster than simply threading the |
I rewrote your loop as: using ChunkSplitters
...
function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
nchunks = Threads.nthreads()
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
for i_pair in pair_range
i_point = point_pairs[i_pair][1]
j_point = point_pairs[i_pair][2]
d_new = my_norm(points[i_point] + u_points[i_point] - points[j_point] - u_points[j_point])
λ = d_new - point_pairs[i_pair][3]
if λ > pairhis[i_pair]
pairhis[i_pair] = λ
end
pair_dam = 0.0
if pairhis[i_pair] > λᵦ
pair_dam = 1.0
end
damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
end
end
end But that didn't bring any speedup (I get a ~2x speedup here relative to the serial version, with 12 threads). It is possible that you are memory-limited. I could get a small speedup by using the low-level function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
nchunks = Threads.nthreads()
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
Polyester.@batch for i_pair in pair_range
i_point = point_pairs[i_pair][1]
j_point = point_pairs[i_pair][2]
d_new = my_norm(points[i_point] + u_points[i_point] - points[j_point] - u_points[j_point])
λ = d_new - point_pairs[i_pair][3]
pairhis[i_pair] = ifelse(λ > pairhis[ipair], λ, pairhis[ipair])
pair_dam = 0.0
pairhis[i_pair] = ifelse(pairhis[i_pair] > λᵦ, pair_dam, pairhis[ipair])
damage_threaded[ichunk][i_point] += pair_dam * points_vol[j_point]
damage_threaded[ichunk][j_point] += pair_dam * points_vol[i_point]
end
end
end but it went from 60 to 55 ms, so I'm not sure if it is worth the trouble. |
I got it. The main reason that I choose to loop through pairs is that there is a cutoff and the neighbor list will be used throughout the simulation, maybe thousands times. Also, if you use the naive double for-loop, there will be too much unnecessary search in my case as there is a cutoff. When cell list is adopted, unnecessary search is less but there are still some. But for really big number of pairs where the memory is the main problem, what you mentioned should be applied.
The same question also arises when use the
Thanks for telling me that. I tried the version with I tried the version with
The hardware of my laptop is
with a 32GB RAM. Maybe it is due to the memory problem. I will try it on some workstations later. Thanks again for your kind help! Actually you have helped me several times, on the julia discourse:). |
I'm curious, if the neighbors are fixed, what you compute thousand of times, for the same coordinates? In all cases you are probably bound to the speed of the memory access in your computer. |
No, the points will move, so there is the
Do you mean the frequency of the memory or the amount of memory? Currently there is DDR5 4800MHz memory in the laptop. |
Be careful, generally in that case one uses a cutoff to build the lists larger than the actual cutoff, and test for the true cutoff again for every pair. Otherwise you can have boundary problems. Concerning the memory, it is just that the property you compute there for each pair depends on |
Thanks for your reply.
That's right for MD and Peridynamcis, as in these models the pairwise force is calculated. In our model only damage variable is calculated, and the force is calculated within continuum mechanics, so the surface effect will not occur.
Thanks for telling me this. I just tried the same code with an old workstation with 56GB memory and Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz(16 threads). The same question arises, when the number of threads is larger than 10, the performance is worse than 8 threads. I guess it is the splitting the tasks into different threads which cost more time? Maybe we can post it on the discourse to see if there is some improvements. |
For a reference, this is a computation and scaling (I have 6 cores/12 threads), without any memory access (just summing up the squared distances): julia> points = rand(SVector{2,Float64},100_000)
box = Box(limits(points), 0.05)
for nb in (1,2,4,6,8,10,12)
cl = CellList(points,box; nbatches=(0,nb))
@btime map_pairwise!((x,y,i,j,d2,u) -> u += d2, 0.0, $box, $cl)
end
112.736 ms (20 allocations: 1.72 KiB)
57.675 ms (28 allocations: 2.91 KiB)
31.245 ms (44 allocations: 5.27 KiB)
26.422 ms (60 allocations: 7.62 KiB)
23.321 ms (76 allocations: 9.98 KiB)
19.235 ms (93 allocations: 12.67 KiB)
17.088 ms (109 allocations: 15.03 KiB) Thus, here, I get ~6.5x times speedup with 12 threads, which is ok given the fact that I have 6 real cores. I was playing a little with your code. I think there is no escape out of the fact that you have one hard problem in terms of memory access. Since the list of pairs is "random" (at each position of the list you can have the indexes of essentially any two particles), the accesses to all the properties of those particles ( One thing that may be worth trying is to pack all the point properties in a struct, like this: struct PointProperties{T}
coor::SVector{2,T}
u::SVector{2,T}
vol::T
end and build an array of point properties. Then the function would become something like this: function thread_damage_chunks!(damage_threaded,points,u_points,points_vol,λᵦ,pairhis,point_pairs)
nchunks = Threads.nthreads()
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
@inbounds for i_pair in pair_range
i_point = point_pairs[i_pair][1]
j_point = point_pairs[i_pair][2]
x = point_properties[i_point] # 1 memory access
y = point_properties[j_point] # 1 memory access
d_new = my_norm(x.coor + x.u - y.coor - y.u)
λ = d_new - point_pairs[i_pair][3]
if λ > pairhis[i_pair]
pairhis[i_pair] = λ
end
pair_dam = pairhis[i_pair] > λᵦ
damage_threaded[ichunk][i_point] += pair_dam * x.vol
damage_threaded[ichunk][j_point] += pair_dam * y.vol
end
end
end in which we have minimized the number of fetches from memory a bit. That kind of thing can help. (You can of course ask on Discourse, I think you'll get similar answers). |
I tested the above idea, but even then the access to memory is the limiting factor: As you can see there (or not... the image is not great), the profile says that You can get that by running @profview cal_damage_chunks!(damage_para,damage_threaded,points,λᵦ,pairhis_para,point_pairs) from within VSCode. The code is below. If you need large speedups on that, you probably have to rethink how to access the data. Maybe just copy all data for each thread, if that is possible. Another (not easy) possibility is trying to port that calculation to the GPU. That may be feasible, although it will also require some major data structure rearrangements. using CellListMap, ChunkSplitters, StaticArrays
struct Points{T}
coor::SVector{2,T}
u::SVector{2,T}
vol::T
end
function thread_damage_chunks!(damage_threaded,points,λᵦ,pairhis,point_pairs)
nchunks = length(damage_threaded)
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs), nchunks)
@inbounds for i_pair in pair_range
i_point, j_point, d = point_pairs[i_pair]
ph = pairhis[i_pair]
x = points[i_point]
y = points[j_point]
d_new = my_norm(x.coor + x.u - y.coor - y.u)
λ = d_new - d
if λ > ph
ph = λ
pairhis[i_pair] = λ
end
pair_dam = ph > λᵦ
damage_threaded[ichunk][i_point] += pair_dam * x.vol
damage_threaded[ichunk][j_point] += pair_dam * y.vol
end
end
end
function cal_damage_chunks!(damage,damage_threaded,points,λᵦ,pairhis,pairs)
thread_damage_chunks!(damage_threaded,points,λᵦ,pairhis,pairs)
reduce!(damage,damage_threaded)
end
function run3()
N = 100000
ℓ = 0.05
γ = 1.0e4
λᵦ = 1.0e-6
points = [ Points(rand(SVector{2,Float64}), 1e-4*rand(SVector{2,Float64}), rand()) for _ in 1:N ]
coor = [p.coor for p in points]
damage_para = zeros(Float64,N)
damage_threaded = [deepcopy(damage_para) for i in 1:Threads.nthreads()]
point_pairs = neighborlist(coor,ℓ)
num_pairs = length(point_pairs)
pairhis_para = zeros(Float64,num_pairs)
nchunks = 1
damage_threaded = [deepcopy(damage_para) for i in 1:nchunks]
cal_damage_chunks!(damage_para,damage_threaded,points,λᵦ,pairhis_para,point_pairs)
end |
Thanks for your reply! I remember your post in discourse about accelerating the pairwise L-J potential calculation, and the solution seems to be sorting the pairs first to optimize the access to memory then the performance will improve. But later I find that it depends on your code largely. If you comment out one special line, then sorting does not matter. I comment on that post but no one replies. Actually I have implement what you said last night, but it was too late that I didn't post the stuff here.
After optimize the access to memory, there is a 10ms improvement for the parallel version in my laptop, and 2-3ms improvement for the serial version in my laptop. That really helps a lot. One strange thing here is that, if you run the above code with
Is this caused by the so called "overhead" ? The other strange thing is that, I suppose these lines will allocate new memory
But actually they don't. Is it because the struct is too small?
That's really a question to me. Yesterday you told me that the randomly access to memory will slow down the program. Then I also tried to modify the storage of the pairs, from a Vector of Turple to a Vector of struct, like
But when I implement it, the program get slower. Attached is my code
Generally it is 10ms slower than just optimize the access to point properties. As there is actually less Another strange thing is that usually if the |
Well, hard to say. It may be the call to
No, they do not allocate because the objects are immutable. "Allocation" means allocating a position in the heap memory, with an address. There you are only creating an object that lives on the stack memory, or the register, or may be even optimized out. You would only allocate if the object was mutable and you were copying it explicitly. (see, for instance: https://m3g.github.io/JuliaNotes.jl/stable/immutable/).
That happens if the loop is loop-vectorized (in the
I also played with the idea, and the performance didn't improve either. I don't have really any other suggestion, I also tried a bunch of things and failed to accelerate that. You have a correct code there from most of what I can tell, and the performance bottleneck is already on the speed of If you need that to be much faster, or scale much better, you will have to find a way to, probably, parallelize it in a higher level, to depend less on shared data between threads, something like that. One trivial way to do that is, as I mentioned, copy the data for each thread (like you do with output), and let the threads work completely independently. That, of course, may be complicated in your case if you have to store the complete neighbor list. |
ps: I tried the copying of all variables, but it didn´t help either. I'm out of ideas, I don't know how to scale that better. #run6
struct Points6{T}
coor::SVector{2,T}
u::SVector{2,T}
vol::T
end
function thread_damage_chunks6!(damage_threaded,points_threaded,λᵦ,pairhis_threaded,point_pairs_threaded)
nchunks = length(damage_threaded)
Threads.@threads for (pair_range, ichunk) in chunks(eachindex(point_pairs_threaded[begin]), nchunks)
@inbounds for i_pair in pair_range
i_point, j_point, d = point_pairs_threaded[ichunk][i_pair]
ph = pairhis_threaded[ichunk][i_pair]
x = points_threaded[ichunk][i_point]
y = points_threaded[ichunk][j_point]
d_new = my_norm(x.coor + x.u - y.coor - y.u)
λ = d_new - d
if λ > ph
ph = λ
pairhis_threaded[ichunk][i_pair] = λ
end
pair_dam = ph > λᵦ
damage_threaded[ichunk][i_point] += pair_dam * x.vol
damage_threaded[ichunk][j_point] += pair_dam * y.vol
end
end
end
function cal_damage_chunks6!(damage,damage_threaded,points_threaded,λᵦ,pairhis_threaded,pairs_threaded)
thread_damage_chunks6!(damage_threaded,points_threaded,λᵦ,pairhis_threaded,pairs_threaded)
reduce!(damage,damage_threaded)
return nothing
end
function run6(;nchunks=Threads.nthreads())
N = 100_000
ℓ = 0.05
γ = 1.0e4
λᵦ = 1.0e-6
points = [ Points6(rand(SVector{2,Float64}), 1e-4*rand(SVector{2,Float64}), rand()) for _ in 1:N ]
coor = [p.coor for p in points]
damage_para = zeros(Float64,N)
point_pairs = neighborlist(coor,ℓ)
num_pairs = length(point_pairs)
pairhis_para = zeros(Float64,num_pairs)
damage_threaded = [deepcopy(damage_para) for _ in 1:nchunks]
points_threaded = [copy(points) for _ in 1:nchunks]
pairhis_para_threaded = [ copy(pairhis_para) for _ in 1:nchunks ]
point_pairs_threaded = [ copy(point_pairs) for _ in 1:nchunks ]
# cal_damage_chunks6!(damage_para,damage_threaded,points_threaded,λᵦ,pairhis_para_threaded,point_pairs_threaded)
@btime cal_damage_chunks6!($damage_para,$damage_threaded,$points_threaded,$λᵦ,$pairhis_para_threaded,$point_pairs_threaded)
end |
Thanks for your reply and kind help!
The same situation for me! As long as there is
A very big thanks to you. I really learned sth from the communation with you.
Not too much, haha. I used to think the speedup is related to the number of threads so I am confused that I have 20 threads but only 6-7x speedup. Today I learned that the speedup relates just to the number of physical cores not the threads. As long as I have only 8 big physical cores (4 small cores which are useless), I think the speedup is satisfactory:).
That is what I am confused now. I will post it on the discourse later. Anyway, thanks a loooooot for your help and discussions. |
Sort off. Most computers have N cores and 2N threads, because of hyperthreading. But hyperthreading is a trick to take advantage of idle processing capacity of the cores. But if your calculation already fills up the cores at their maximum, hyperthreading is not useful at all. That's very much dependent on each application and the load on each core. For very codes that saturate the cores with floating point operations usually hyperthreading does not help. |
Yes, I think it is the reason why the code performs worse with 10 threads than with 8 threads. |
I want to calculate a damage variable which is somehow similar to the force in MD simulation with
CellListMap.jl
. In my problem, the pairwise damage is related to the deformation history of the pair, and I would like to store a historical value for each pair during calculation. Is there a possible way to do this withCellListMap.jl
?I will explain it with code. Attached is my code.
In the function
cal_damage!
,I would like to replace the
λᵦ
with a historical quantity, e.g., the maximum ofλ
in previous calculation, likeIs is possible with
CellListMap.jl
? Thanks for your help.The text was updated successfully, but these errors were encountered: