Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix read_parquet with dask>=2022.12.0 #230

Merged

Conversation

the-matt-morris
Copy link
Contributor

closes #225

Updates method signature of the following so that they are compatible with all versions of dask, including dask>=2022.12.0:

  • dask_geopandas.io.arrow.ArrowDatasetEngine._arrow_table_to_pandas
  • dask_geopandas.io.arrow.GeoDatasetEngine._arrow_table_to_pandas
  • dask_geopandas.io.parquet.GeoArrowEngine._create_dd_meta

It's been a little while since I played around with this, but I had written previously that these lines in dask are problematic, as it calls _arrow_table_to_pandas, where it wasn't called prior to dask==2022.12.0.

At the time, I thought we may need an update to GeoDatasetEngine._arrow_table_to_pandas to handle this, but I feel like someone with a better grasp on this library's internals may have a better idea. Either way, it is known that this PR in its current state is an attempt to solve the problem, but it doesn't succeed entirely in doing that, as you'll see with one of the unit tests failing.

@jorisvandenbossche
Copy link
Member

Either way, it is known that this PR in its current state is an attempt to solve the problem, but it doesn't succeed entirely in doing that, as you'll see with one of the unit tests failing.

There are two failing tests at the moment. One will be fixed by #229 (and is unrelated to the parquet IO), and the other is a case that should error but is giving a wrong error message. But that indeed seems to be related to the lines you link to (_arrow_table_to_pandas is being called with an empty table, and that is not working with our implementation). I will take a look at that.

@jorisvandenbossche
Copy link
Member

And thanks for opening the PR!

@the-matt-morris
Copy link
Contributor Author

You're right, I was confused as to why that other test was failing, glad there's a PR for that one already. Thanks for checking it out!

@jorisvandenbossche jorisvandenbossche changed the title fix: read_parquet with dask>=2022.12.0 BUG: fix read_parquet with dask>=2022.12.0 Jan 22, 2023
@jorisvandenbossche
Copy link
Member

We can probably further improve this and make the integration more robust, but let's already merge this so we can have green CI and have this running with latest dask.

Thanks again @the-matt-morris!

@jorisvandenbossche jorisvandenbossche merged commit 13ee2f2 into geopandas:main Jan 22, 2023
@raybellwaves raybellwaves mentioned this pull request Jan 23, 2023
@the-matt-morris the-matt-morris deleted the dask-2022.12.0-read-parquet branch January 23, 2023 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug reading parquet files with dask==2022.12.0
2 participants