Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the ability to use dask arrays with chunks along spatial axes #280

Merged
merged 18 commits into from
Jul 28, 2023

Conversation

charlesgauthier-udm
Copy link
Contributor

This solves issue #222. Made modifications so that dask arrays with chunks on spatial axes (e.g. lat/lon) can be used. Here's the gist of it: It is setup so that the weight matrix is converted to a dask array and chunked to match the chunksize of the input data along the inner axes (i.e. the axes that are collapsed after the dot product), so that chunk borders align. Along the outer axes, the default behavior is to also chunk to match the chunksize of the input data, resulting in square chunks on the weight matrix which offers a good tradeoff between speed and memory usage as described here. However, I added an argument to the regridder that allows users to specify the chunks of the output data.

Key points:

  • In apply_weights, previously input data was flatten along the spatial axes in order to be multiplied to the 2D weight matrix. Then, the output data had to be reshaped back. I instead changed it so that the 4D w property from Added w property to Regridder and SpatialAverager #276 is used and np.tensordot is used instead.
  • output data inherits the same chunks as indata on axes that are not spatial as was already the case. The output_chunks kwarg is a tuple indicating the desired chunks on the spatial axes of the output data.

Examples

Regridding from a subset (lat: 5,400, lon: 3,000) of the Gridded Population of the World (gpw) dataset at 0.01° resolution to CORDEX WRF in lambert conformal with a 0.22° resolution (y:281, x:297). The original dataset is of shape (lat: 21,600, lon: 43,200), but memory usage of computing the weights limits the size of the gpw dataset that can be used. Here is what we get when we regrid from gpw (lat: 5,400, lon: 3,000) to WRF (y:281, x:297):

2D weights Numpy (Pure numpy arrays, no dask arrays, as it was before the changes): 0.010s±0.07 (100 trials)

4D weights Numpy (now using np.tensordot without flatenning indata): 0.011s±0.07 (100 trials)

Dask arrays gpw (lat:5,400, lon:3,000, chunks=(1000,1000)): 0.28s ± 0.2 (100 trials)

After testing and playing around with different chunksize, my understanding is that if using numpy is possible, it is difficult to beat it speed wise since np.tensordot relies on BLAS operations.

For the sake of the example, I can also bypass the weight computation and just generate a random sparse array with the same density as the subset weights: sps.random((lat_out,lon_out,lat_in,lon_in),density=6e-9). If we then perform the regridding from gpw(lat:21,600, lon:43,200, chunks=(1000,1000)) it takes me 8.7s ± 0.3 (100 trials). At that size, Numpy cannot be use because it requires too much memory.

Also, there does not seem to be alot of differences between 4D vs. 2D weight matrix when the input data has not extra dims. However, playing around with random sparse weights you can see for example: (lat_in:600, lon_in:600) --> (lat_out:200, lon_out:200) with random weights sps.random((lat_out,lon_out,lat_in,lon_in),density=0.00001)

2D weights: 0.24 ± 0.03 (100 trials)
4D weights: 0.13 ± 0.01 (100 trials)

Using 4D weights seems to outperform 2D when there is extra dims, particullarly in cases where the input grid is larger than the output grid.

@aulemahal aulemahal requested review from aulemahal and huard July 5, 2023 19:58
@aulemahal aulemahal linked an issue Jul 5, 2023 that may be closed by this pull request
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
@huard
Copy link
Contributor

huard commented Jul 6, 2023

Also mention the changes in CHANGES.md.

xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
charlesgauthier-udm added 2 commits July 6, 2023 14:13
@charlesgauthier-udm
Copy link
Contributor Author

I added the output_dims arg to the BaseRegridder, I still need to update the Dask notebook in the docs and mention the changes in CHANGE.md. Working on both those things right now.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@charlesgauthier-udm
Copy link
Contributor Author

Updated the Dask notebook in Docs and CHANGES.rst

@aulemahal
Copy link
Collaborator

For some reason, your merge commit that followed the "output_dims" one erased that previous work. I re-added it through magical git commands.

Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and its doc look good to go!

@huard huard requested a review from raphaeldussin July 27, 2023 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regridding xarray dataset with chunked dask-backed arrays
3 participants