Adding the ability to use dask arrays with chunks along spatial axes #280

charlesgauthier-udm · 2023-07-04T15:32:57Z

This solves issue #222. Made modifications so that dask arrays with chunks on spatial axes (e.g. lat/lon) can be used. Here's the gist of it: It is setup so that the weight matrix is converted to a dask array and chunked to match the chunksize of the input data along the inner axes (i.e. the axes that are collapsed after the dot product), so that chunk borders align. Along the outer axes, the default behavior is to also chunk to match the chunksize of the input data, resulting in square chunks on the weight matrix which offers a good tradeoff between speed and memory usage as described here. However, I added an argument to the regridder that allows users to specify the chunks of the output data.

Key points:

In apply_weights, previously input data was flatten along the spatial axes in order to be multiplied to the 2D weight matrix. Then, the output data had to be reshaped back. I instead changed it so that the 4D w property from Added w property to Regridder and SpatialAverager #276 is used and np.tensordot is used instead.
output data inherits the same chunks as indata on axes that are not spatial as was already the case. The output_chunks kwarg is a tuple indicating the desired chunks on the spatial axes of the output data.

Examples

Regridding from a subset (lat: 5,400, lon: 3,000) of the Gridded Population of the World (gpw) dataset at 0.01° resolution to CORDEX WRF in lambert conformal with a 0.22° resolution (y:281, x:297). The original dataset is of shape (lat: 21,600, lon: 43,200), but memory usage of computing the weights limits the size of the gpw dataset that can be used. Here is what we get when we regrid from gpw (lat: 5,400, lon: 3,000) to WRF (y:281, x:297):

2D weights Numpy (Pure numpy arrays, no dask arrays, as it was before the changes): 0.010s±0.07 (100 trials)

4D weights Numpy (now using np.tensordot without flatenning indata): 0.011s±0.07 (100 trials)

Dask arrays gpw (lat:5,400, lon:3,000, chunks=(1000,1000)): 0.28s ± 0.2 (100 trials)

After testing and playing around with different chunksize, my understanding is that if using numpy is possible, it is difficult to beat it speed wise since np.tensordot relies on BLAS operations.

For the sake of the example, I can also bypass the weight computation and just generate a random sparse array with the same density as the subset weights: sps.random((lat_out,lon_out,lat_in,lon_in),density=6e-9). If we then perform the regridding from gpw(lat:21,600, lon:43,200, chunks=(1000,1000)) it takes me 8.7s ± 0.3 (100 trials). At that size, Numpy cannot be use because it requires too much memory.

Also, there does not seem to be alot of differences between 4D vs. 2D weight matrix when the input data has not extra dims. However, playing around with random sparse weights you can see for example: (lat_in:600, lon_in:600) --> (lat_out:200, lon_out:200) with random weights sps.random((lat_out,lon_out,lat_in,lon_in),density=0.00001)

2D weights: 0.24 ± 0.03 (100 trials)
4D weights: 0.13 ± 0.01 (100 trials)

Using 4D weights seems to outperform 2D when there is extra dims, particullarly in cases where the input grid is larger than the output grid.

… to xesmf

for more information, see https://pre-commit.ci

xesmf/frontend.py

for more information, see https://pre-commit.ci

huard · 2023-07-06T15:18:16Z

Also mention the changes in CHANGES.md.

xesmf/frontend.py

# Conflicts: # xesmf/frontend.py

charlesgauthier-udm · 2023-07-06T18:39:40Z

I added the output_dims arg to the BaseRegridder, I still need to update the Dask notebook in the docs and mention the changes in CHANGE.md. Working on both those things right now.

review-notebook-app · 2023-07-07T13:16:21Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

charlesgauthier-udm · 2023-07-07T13:17:07Z

Updated the Dask notebook in Docs and CHANGES.rst

for more information, see https://pre-commit.ci

(cherry picked from commit 930495f)

for more information, see https://pre-commit.ci

aulemahal · 2023-07-26T19:43:56Z

For some reason, your merge commit that followed the "output_dims" one erased that previous work. I re-added it through magical git commands.

aulemahal

This and its doc look good to go!

charlesgauthier-udm and others added 6 commits June 13, 2023 11:10

Perform ci tests with python 3.11 now that numba is compatible.

fd2e9a8

Merge branch 'pangeo-data:master' into master

e4b3593

Merge branch 'pangeo-data:master' into master

5f6a8a0

Including the ability to use dask arrays chunked on spatial dimension…

ce3654d

… to xesmf

Better descriptions and code comments and clean up

3d896db

[pre-commit.ci] auto fixes from pre-commit.com hooks

e26d144

for more information, see https://pre-commit.ci

aulemahal requested review from aulemahal and huard July 5, 2023 19:58

aulemahal linked an issue Jul 5, 2023 that may be closed by this pull request

Regridding xarray dataset with chunked dask-backed arrays #222

Closed

huard approved these changes Jul 6, 2023

View reviewed changes

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

charlesgauthier-udm and others added 3 commits July 6, 2023 11:13

Changing output_chunks to accept dict

a79e800

Adjusting test to include output_chunks as dict

1d75f80

[pre-commit.ci] auto fixes from pre-commit.com hooks

b281e9f

for more information, see https://pre-commit.ci

aulemahal requested changes Jul 6, 2023

View reviewed changes

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

charlesgauthier-udm added 2 commits July 6, 2023 14:13

Added output_dims arg to the BaseRegridder

930495f

Merge remote-tracking branch 'origin/master'

e40365b

# Conflicts: # xesmf/frontend.py

Updated docs (Dask notebook and CHANGED.rst)

06ee96b

pre-commit-ci bot and others added 5 commits July 7, 2023 13:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

25dc6e9

for more information, see https://pre-commit.ci

reworked chunking of weights with output_chunks for clarity

00b5da3

Merge remote-tracking branch 'origin/master'

da08b06

Added output_dims arg to the BaseRegridder

e2fb2be

(cherry picked from commit 930495f)

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad12a3b

for more information, see https://pre-commit.ci

Apply suggestions from code review

e560be5

aulemahal approved these changes Jul 26, 2023

View reviewed changes

huard requested a review from raphaeldussin July 27, 2023 12:33

huard merged commit 1cdb45a into pangeo-data:master Jul 28, 2023

aulemahal mentioned this pull request Mar 13, 2024

Do not chunk when the input isn't chunked #348

Merged

slevang mentioned this pull request Sep 25, 2024

Improve performance of conservative routine xarray-contrib/xarray-regrid#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the ability to use dask arrays with chunks along spatial axes #280

Adding the ability to use dask arrays with chunks along spatial axes #280

charlesgauthier-udm commented Jul 4, 2023

huard commented Jul 6, 2023

charlesgauthier-udm commented Jul 6, 2023

review-notebook-app bot commented Jul 7, 2023

charlesgauthier-udm commented Jul 7, 2023

aulemahal commented Jul 26, 2023

aulemahal left a comment

Adding the ability to use dask arrays with chunks along spatial axes #280

Adding the ability to use dask arrays with chunks along spatial axes #280

Conversation

charlesgauthier-udm commented Jul 4, 2023

Key points:

Examples

huard commented Jul 6, 2023

charlesgauthier-udm commented Jul 6, 2023

review-notebook-app bot commented Jul 7, 2023

charlesgauthier-udm commented Jul 7, 2023

aulemahal commented Jul 26, 2023

aulemahal left a comment

Choose a reason for hiding this comment