Optionally skip spatial bounds in read_parquet #203

TomAugspurger · 2022-06-20T14:09:22Z

Adds a new gather_spatial_partitions keyword to read_parquet
to disable opening each file to get its spatial bounds. The name was
chosen to mimic dask's gather_statistics keyword.

Also adds a small docs section (I didn't see an easy way to
insert a snippet in the docstring).

Closes #194.

One note of hesitation: I think Dask mid-transition for handling how it reads metadata. I'm wondering whether we should just rely on the behavior of dask's gather_statistics keyword. IIUC, both it and this new gather_spatial_partitions control whether there's a per-file operation in read_parquet.

Maybe @jcrist or @rjzamora have a recommendation on whether adding a new keyword here is going against where Dask is headed.

Adds a new `gather_spatial_partitions` keyword to `read_parquet` to disable opening each file to get its spatial bounds. The name was chosen to mimic dask's `gather_statistics` keyword. Also adds a small docs section (I didn't see an easy way to insert a snippet in the docstring). Closes geopandas#194.

TomAugspurger · 2022-06-20T14:15:31Z

cc @jorisvandenbossche.

martinfleis

This looks nice! Thanks!

jorisvandenbossche · 2022-06-20T14:54:30Z

The gather_statistics keyword is deprecated now in dask, in favor of keywords for the explicit end behaviour you want (eg calculate_divisions=True, or split_row_groups=True, for which in both cases the statistics (or more generally parquet file metadata) needs to be read).
So I think adding a keyword like this specifically for controlling the spatial partitions seems in line with the latest changes in dask.

TomAugspurger · 2022-06-20T17:28:20Z

OK, thanks. In that case, I think a calculate_spatial_divisions or calculate_spatial_partitions keyword is appropriate, mirroring calculate_divisions (https://docs.dask.org/en/stable/dataframe-parquet.html#calculating-divisions).

For now I'll go with calculate_spatial_partitions.

martinfleis · 2022-06-20T20:58:35Z

We talked about that a bit and calculate is not necessarily a right word as dask is not calculating but gathering bounds that are stored in the parquet meta. Calculate imposes that dask will read all geometries and get total_bounds of those for each partition, which is not the case.

This reverts commit 5efcc17.

TomAugspurger · 2022-06-20T21:38:10Z

OK, reverted to go back to gather_spatial_partitions.

TomAugspurger · 2022-06-28T16:24:26Z

@jorisvandenbossche or @martinfleis any chance you could merge this when you get a chance?

And how hard are releases for dask-geopandas to do? We'll have a new dataset later this week / early next week that would benefit from this :)

martinfleis · 2022-06-28T16:51:42Z

Hey, I'll have a look later tonight and we can even cut 0.2.0. We already talked about that last week with @jorisvandenbossche.

martinfleis · 2022-06-28T20:19:28Z

I'll go ahead and merge this, then we should ideally get #205 in and then can cut 0.2.0.

martinfleis approved these changes Jun 20, 2022

View reviewed changes

Use calculate_spatial_partitions

5efcc17

Revert "Use calculate_spatial_partitions"

b6d1547

This reverts commit 5efcc17.

martinfleis merged commit 91b5de7 into geopandas:main Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally skip spatial bounds in read_parquet #203

Optionally skip spatial bounds in read_parquet #203

TomAugspurger commented Jun 20, 2022

TomAugspurger commented Jun 20, 2022

martinfleis left a comment

jorisvandenbossche commented Jun 20, 2022

TomAugspurger commented Jun 20, 2022

martinfleis commented Jun 20, 2022

TomAugspurger commented Jun 20, 2022

TomAugspurger commented Jun 28, 2022

martinfleis commented Jun 28, 2022

martinfleis commented Jun 28, 2022

Optionally skip spatial bounds in read_parquet #203

Optionally skip spatial bounds in read_parquet #203

Conversation

TomAugspurger commented Jun 20, 2022

TomAugspurger commented Jun 20, 2022

martinfleis left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 20, 2022

TomAugspurger commented Jun 20, 2022

martinfleis commented Jun 20, 2022

TomAugspurger commented Jun 20, 2022

TomAugspurger commented Jun 28, 2022

martinfleis commented Jun 28, 2022

martinfleis commented Jun 28, 2022