Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster implementation for get_points_in_sphere() using ogrid #196

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

uellue
Copy link

@uellue uellue commented Sep 22, 2023

When profiling the performance of diffraction pattern simulation, most time was spent in this function.

This implementation runs more than three times faster since it reduces the calculation effort through broadcasting and re-uses coordinate and distance calculations.

Instead of using diffpy, it performs the equivalent transformation to cartesian and calculation of the norm directly to allow these optimizations.

Description of the change

Equivalent calculation using numpy.ogrid() and broadcasting to calculate hkl map, cartesian vectors, norm and selection to radius.

Progress of the PR

Code style is dead link:
image

Minimal example of the bug fix or new feature

image

For reviewers

  • The PR title is short, concise, and will make sense 1 year later.
  • New functions are imported in corresponding __init__.py.
  • New features, API changes, and deprecations are mentioned in the
    unreleased section in CHANGELOG.rst.
  • Contributor(s) are listed correctly in credits in diffsims/release_info.py and
    in .zenodo.json.

When profiling the performance of diffraction pattern simulation, most time was
spent in this function.

This implementation runs more than three times faster since it reduces the
calculation effort through broadcasting and re-uses coordinate and
distance calculations.

Instead of using diffpy, it performs the equivalent transformation to cartesian
and calculation of the norm directly to allow these optimizations.
@uellue
Copy link
Author

uellue commented Sep 22, 2023

The previous tests were run with parameters that turned out unusual, i.e. way too many reflections. Here's a more typical result, twice as fast for the whole simulation:

image

@hakonanes hakonanes added the enhancement New feature, request, or improvement label Sep 23, 2023
@hakonanes hakonanes added this to the v0.6.0 milestone Sep 23, 2023
@hakonanes
Copy link
Member

Thank you for this improvement, @uellue! The changes look good and the tests pass, so I'm happy to merge this PR.

Is it important for you to have this released soon? We don't have a release planned in the near future, as far as I'm aware, as the only current unreleased changes are maintenance updates. I plan to add some functionality to downstream kikuchipy in the next months, though. If I find anything to add or update to diffsims, I'll probably make a diffsims v0.6.0 release in connection with that.

A thought regarding speed of computations. While it's convenient to use class methods from diffpy.structure, they are not optimized for larger arrays, as you show here. If we replace more of these calls with our own computations, we could consider dropping diffpy.structure as a dependency.

@uellue
Copy link
Author

uellue commented Sep 25, 2023

Releasing can wait until it is convenient, no worries!

About replacing diffpy, their implementations didn't look so bad. This method was just a particular case because it still is THE hotspot of the entire simulation, intermediate values can be used for the return values instead of being recalculated, and the structure of the problem was perfect for broadcasting. The implementation here is pretty much equivalent to inlining the diffpy implementation. I also tried a Numba, but that didn't help. I guess that appending to a list of selected reflections depending on the norm is hard to vectorize for the compiler, so creating and then filtering the large intermediate arrays works well, too, in particular if they fit into the CPU caches. Broadcasting allows to only perform a minimum of operations on the full-size arrays.

If one wanted to speed this up further, I'd probably look into the structure of the whole simulation. I was wondering, shouldn't be the set of points in the sphere be the same for each rotation of the lattice? It feels so symmetric... In that case one could calculate a base version, select once, and then generate the rotated versions by rotating the selected part of the base? This sounds like a perfect job for GPUs, by the way.

@uellue
Copy link
Author

uellue commented Sep 25, 2023

This sounds like a perfect job for GPUs, by the way.

...resp. for GEMM on whatever platform -- should perform very well since it is a straight dot product with a rotation matrix.

@CSSFrancis
Copy link
Member

CSSFrancis commented Oct 17, 2023

If one wanted to speed this up further, I'd probably look into the structure of the whole simulation. I was wondering, shouldn't be the set of points in the sphere be the same for each rotation of the lattice? It feels so symmetric... In that case one could calculate a base version, select once, and then generate the rotated versions by rotating the selected part of the base? This sounds like a perfect job for GPUs, by the way.

Is there even a good reason that this is slow? I've been looking at this and I think @uellue is correct.

If you actually look at the code:

First the orientations are iterated through

for i, orientation in enumerate(tqdm(orientations, leave=False)):
simulation = diffractor.calculate_ed_data(
structure,
reciprocal_radius,
rotation=orientation,
with_direct_beam=with_direct_beam,
max_excitation_error=max_excitation_error,
shape_factor_width=shape_factor_width,
debye_waller_factors=debye_waller_factors,
)

Then the function get_points_in_sphere is called a bunch of times

recip_latt = latt.reciprocal()
g_indices, cartesian_coordinates, g_hkls = get_points_in_sphere(
recip_latt, reciprocal_radius
)
ai, aj, ak = (
np.deg2rad(rotation[0]),
np.deg2rad(rotation[1]),
np.deg2rad(rotation[2]),
)
R = euler2mat(ai, aj, ak, axes="rzxz")
cartesian_coordinates = np.matmul(R, cartesian_coordinates.T).T
# Identify the excitation errors of candidate points
r_sphere = 1 / wavelength
r_spot = np.sqrt(np.sum(np.square(cartesian_coordinates[:, :2]), axis=1))
z_spot = cartesian_coordinates[:, 2]

and then it is rotated....

So we can just call get_points_in_sphere once? ....

@hakonanes
Copy link
Member

@CSSFrancis, regardless, should we merge this and move on?

@CSSFrancis
Copy link
Member

@hakonanes The get_points_in_sphere method is going to be removed and the bigger problem is related to the fact that in this method there are repeated calls to get_points_in_sphere when there don't need to be.

It might be worth looking at the ReciporicalLatticeVector.from_min_dspacing method to see if there are changes there that could be sped up.

@hakonanes hakonanes removed this from the v0.6.0 milestone Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature, request, or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants