Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally skip spatial bounds in read_parquet #203

Merged

Conversation

TomAugspurger
Copy link
Contributor

Adds a new gather_spatial_partitions keyword to read_parquet
to disable opening each file to get its spatial bounds. The name was
chosen to mimic dask's gather_statistics keyword.

Also adds a small docs section (I didn't see an easy way to
insert a snippet in the docstring).

Closes #194.


One note of hesitation: I think Dask mid-transition for handling how it reads metadata. I'm wondering whether we should just rely on the behavior of dask's gather_statistics keyword. IIUC, both it and this new gather_spatial_partitions control whether there's a per-file operation in read_parquet.

Maybe @jcrist or @rjzamora have a recommendation on whether adding a new keyword here is going against where Dask is headed.

Adds a new `gather_spatial_partitions` keyword to `read_parquet`
to disable opening each file to get its spatial bounds. The name was
chosen to mimic dask's `gather_statistics` keyword.

Also adds a small docs section (I didn't see an easy way to
insert a snippet in the docstring).

Closes geopandas#194.
@TomAugspurger
Copy link
Contributor Author

cc @jorisvandenbossche.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice! Thanks!

@jorisvandenbossche
Copy link
Member

The gather_statistics keyword is deprecated now in dask, in favor of keywords for the explicit end behaviour you want (eg calculate_divisions=True, or split_row_groups=True, for which in both cases the statistics (or more generally parquet file metadata) needs to be read).
So I think adding a keyword like this specifically for controlling the spatial partitions seems in line with the latest changes in dask.

@TomAugspurger
Copy link
Contributor Author

OK, thanks. In that case, I think a calculate_spatial_divisions or calculate_spatial_partitions keyword is appropriate, mirroring calculate_divisions (https://docs.dask.org/en/stable/dataframe-parquet.html#calculating-divisions).

For now I'll go with calculate_spatial_partitions.

@martinfleis
Copy link
Member

We talked about that a bit and calculate is not necessarily a right word as dask is not calculating but gathering bounds that are stored in the parquet meta. Calculate imposes that dask will read all geometries and get total_bounds of those for each partition, which is not the case.

@TomAugspurger
Copy link
Contributor Author

OK, reverted to go back to gather_spatial_partitions.

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche or @martinfleis any chance you could merge this when you get a chance?

And how hard are releases for dask-geopandas to do? We'll have a new dataset later this week / early next week that would benefit from this :)

@martinfleis
Copy link
Member

Hey, I'll have a look later tonight and we can even cut 0.2.0. We already talked about that last week with @jorisvandenbossche.

@martinfleis
Copy link
Member

I'll go ahead and merge this, then we should ideally get #205 in and then can cut 0.2.0.

@martinfleis martinfleis merged commit 91b5de7 into geopandas:main Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow read_parquet with many files.
3 participants