Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow map type in parquet files unsupported #5612

Open
TevenLeScao opened this issue Mar 6, 2023 · 4 comments
Open

Arrow map type in parquet files unsupported #5612

TevenLeScao opened this issue Mar 6, 2023 · 4 comments

Comments

@TevenLeScao
Copy link
Contributor

Describe the bug

When I try to load parquet files that were processed with Spark, I get the following issue:

ValueError: Arrow type map<string, string ('warc_headers')> does not have a datasets dtype equivalent.

Strangely, loading the dataset with streaming=True solves the issue.

Steps to reproduce the bug

The dataset is private, but this can be reproduced with any dataset that has Arrow maps.

Expected behavior

Loading the dataset no matter whether streaming is True or not.

Environment info

  • datasets version: 2.10.1
  • Platform: Linux-5.15.0-1029-gcp-x86_64-with-glibc2.31
  • Python version: 3.10.7
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2
@lhoestq lhoestq removed their assignment Mar 6, 2023
@mariosasko
Copy link
Collaborator

I'm attaching a minimal reproducible example:

from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq

table_with_map = pa.Table.from_pydict(
    {"a": [1, 2], "b": [[("a", 2)], [("b", 4)]]},
    schema=pa.schema({"a": pa.int32(), "b": pa.map_(pa.string(), pa.int32())})
)
pq.write_table(table_with_map, "parquet_with_map.parquet")
dset = load_dataset("parquet", data_files="parquet_with_map.parquet", split="train") # error unless streaming=True

For a dataset generated with the packaged loaders (CSV, JSON, Parquet), streaming=True sets the dataset's features to None (unless explicitly provided in load_dataset), hence no error will be thrown as long as the features stay "unresolved" (resolving the features with _resolve_features will lead to an error).

@eware-godaddy
Copy link

I've also been wondering about datasets support for Arrow Map datatypes. I had a situation where I had a pandas series of dict[str, float] with hundreds of different possible key values (ie. not bounded), and this got converted to a sequence of structs where every single struct had the entire set of keys.

I worked around it, by explicitly creating a sequence of [str, float], but given that pyarrow has an explicit Map datatype, it would be good to be able to explicitly cast/force this data type combination.

@severo
Copy link
Collaborator

severo commented Mar 15, 2024

(feel free to ignore) polars will not support this type: pola-rs/polars#3942 (comment)

Polars will not add the map dtype. It's benefit do not outweigh the extra complexity. Maybe we can investigate conversion of maps to struct. But I will have to explore that.

@metasj
Copy link

metasj commented Mar 15, 2024

Looks like they chose to convert every instance with pola-rs/polars#4226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants