-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow map type in parquet files unsupported #5612
Comments
I'm attaching a minimal reproducible example: from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq
table_with_map = pa.Table.from_pydict(
{"a": [1, 2], "b": [[("a", 2)], [("b", 4)]]},
schema=pa.schema({"a": pa.int32(), "b": pa.map_(pa.string(), pa.int32())})
)
pq.write_table(table_with_map, "parquet_with_map.parquet")
dset = load_dataset("parquet", data_files="parquet_with_map.parquet", split="train") # error unless streaming=True For a dataset generated with the packaged loaders (CSV, JSON, Parquet), |
I've also been wondering about datasets support for Arrow Map datatypes. I had a situation where I had a pandas series of dict[str, float] with hundreds of different possible key values (ie. not bounded), and this got converted to a sequence of structs where every single struct had the entire set of keys. I worked around it, by explicitly creating a sequence of [str, float], but given that pyarrow has an explicit Map datatype, it would be good to be able to explicitly cast/force this data type combination. |
(feel free to ignore) polars will not support this type: pola-rs/polars#3942 (comment)
|
Looks like they chose to convert every instance with pola-rs/polars#4226 |
Describe the bug
When I try to load parquet files that were processed with Spark, I get the following issue:
ValueError: Arrow type map<string, string ('warc_headers')> does not have a datasets dtype equivalent.
Strangely, loading the dataset with
streaming=True
solves the issue.Steps to reproduce the bug
The dataset is private, but this can be reproduced with any dataset that has Arrow maps.
Expected behavior
Loading the dataset no matter whether streaming is True or not.
Environment info
datasets
version: 2.10.1The text was updated successfully, but these errors were encountered: