Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read parquet from s3 failing with 'GeoArrowEngine' has no attribute 'extract_filesystem' #250

Open
raybellwaves opened this issue Apr 17, 2023 · 3 comments

Comments

@raybellwaves
Copy link
Contributor

raybellwaves commented Apr 17, 2023

We have nightly testing of reading geoparquet in our s3 buckets (using intake-geopandas). This started failing with the release of dask 2023.4.0 three days ago cc. @jrbourbeau.

I try and update this if I can find a geoparquet hosted on a public s3 bucket.

Create new environment:
mamba create -n test_env python=3.10 --y && conda activate test_env

Install dask-geopandas and s3fs:
pip install dask-geopandas s3fs

open a (geo)parquet:

import dask_geopandas as dgpd
dgpd.read_parquet("s3://BUCKET/FILE.parquet")
Traceback (most recent call last):
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/backends.py", line 135, in wrapper
    return func(*args, **kwargs)
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 519, in read_parquet
    fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
AttributeError: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask_geopandas/io/parquet.py", line 111, in read_parquet
    result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
  File "/opt/userenvs/ray.bell/test_env/lib/python3.10/site-packages/dask/backends.py", line 137, in wrapper
    raise type(e)(
AttributeError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: type object 'GeoArrowEngine' has no attribute 'extract_filesystem'

See packages installed:

pip freeze

aiobotocore==2.5.0
aiohttp==3.8.4
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.2
attrs==23.1.0
botocore==1.29.76
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.2.1
dask==2023.4.0
dask-geopandas==0.3.0
distributed==2023.4.0
Fiona==1.9.3
frozenlist==1.3.3
fsspec==2023.4.0
geopandas==0.12.2
HeapDict==1.0.1
idna==3.4
importlib-metadata==6.4.1
Jinja2==3.1.2
jmespath==1.0.1
locket==1.0.0
MarkupSafe==2.1.2
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
numpy==1.24.2
packaging==23.1
pandas==2.0.0
partd==1.4.0
psutil==5.9.5
pyproj==3.5.0
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
s3fs==2023.4.0
shapely==2.0.1
six==1.16.0
sortedcontainers==2.4.0
tblib==1.7.0
toolz==0.12.0
tornado==6.2
tzdata==2023.3
urllib3==1.26.15
wrapt==1.15.0
yarl==1.8.2
zict==2.2.0
zipp==3.15.0
@jrbourbeau
Copy link
Contributor

Thanks @raybellwaves. I wonder if this is a duplicate of #241?

This started failing with the release of dask 2023.4.0 three days ago

I'm not aware of any related changes in this release. The extract_filesystem method in the traceback was added several releases ago (xref dask/dask#9699).

Also, as Joris mentioned here #241 (comment), I would expect GeoArrowEngine to have an extract_filesystem method regardless since it subclasses the arrow parquet engine in dask

@jtmiclat
Copy link
Contributor

Cross posting from the other thread #241 (comment)

=======
hi! i was able to look into this! if pyarrow is not installed then the inheritances falls apart because of the fallback import.

try:
# pyarrow is imported here, but is an optional dependency
from dask.dataframe.io.parquet.arrow import (
ArrowDatasetEngine as DaskArrowDatasetEngine,
)
except ImportError:
DaskArrowDatasetEngine = object

I think some envs default to have pyarrow so you really need a clean env to test this. A solution to this is to throw an import error/warning when instantiating GeoArrowEngine if pyarrow was not properly imported.

To reiterate

this fails

pip install dask dask-geopandas 

this works

pip install dask dask-geopandas  pyarrow
# or 
pip install dask[complete] dask-geopandas

@jtmiclat
Copy link
Contributor

might want to do something similar to how geopandas check if pygeos is installed

https://github.com/geopandas/geopandas/blob/04c2dee547777d9e87f9df4c85cc372a03b29f93/geopandas/_compat.py#L51-L67

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants