Arrow map type in parquet files unsupported #5612

TevenLeScao · 2023-03-06T12:03:24Z

Describe the bug

When I try to load parquet files that were processed with Spark, I get the following issue:

ValueError: Arrow type map<string, string ('warc_headers')> does not have a datasets dtype equivalent.

Strangely, loading the dataset with streaming=True solves the issue.

Steps to reproduce the bug

The dataset is private, but this can be reproduced with any dataset that has Arrow maps.

Expected behavior

Loading the dataset no matter whether streaming is True or not.

Environment info

datasets version: 2.10.1
Platform: Linux-5.15.0-1029-gcp-x86_64-with-glibc2.31
Python version: 3.10.7
PyArrow version: 8.0.0
Pandas version: 1.4.2

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-03-14T17:20:25Z

I'm attaching a minimal reproducible example:

from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq

table_with_map = pa.Table.from_pydict(
    {"a": [1, 2], "b": [[("a", 2)], [("b", 4)]]},
    schema=pa.schema({"a": pa.int32(), "b": pa.map_(pa.string(), pa.int32())})
)
pq.write_table(table_with_map, "parquet_with_map.parquet")
dset = load_dataset("parquet", data_files="parquet_with_map.parquet", split="train") # error unless streaming=True

For a dataset generated with the packaged loaders (CSV, JSON, Parquet), streaming=True sets the dataset's features to None (unless explicitly provided in load_dataset), hence no error will be thrown as long as the features stay "unresolved" (resolving the features with _resolve_features will lead to an error).

eware-godaddy · 2023-11-26T21:43:19Z

I've also been wondering about datasets support for Arrow Map datatypes. I had a situation where I had a pandas series of dict[str, float] with hundreds of different possible key values (ie. not bounded), and this got converted to a sequence of structs where every single struct had the entire set of keys.

I worked around it, by explicitly creating a sequence of [str, float], but given that pyarrow has an explicit Map datatype, it would be good to be able to explicitly cast/force this data type combination.

severo · 2024-03-15T12:26:03Z

(feel free to ignore) polars will not support this type: pola-rs/polars#3942 (comment)

Polars will not add the map dtype. It's benefit do not outweigh the extra complexity. Maybe we can investigate conversion of maps to struct. But I will have to explore that.

metasj · 2024-03-15T18:56:11Z

Looks like they chose to convert every instance with pola-rs/polars#4226

TevenLeScao assigned lhoestq Mar 6, 2023

lhoestq removed their assignment Mar 6, 2023

jaygala24 mentioned this issue May 1, 2024

add xsim++ task under retrieval category embeddings-benchmark/mteb#609

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow map type in parquet files unsupported #5612

Arrow map type in parquet files unsupported #5612

TevenLeScao commented Mar 6, 2023

mariosasko commented Mar 14, 2023

eware-godaddy commented Nov 26, 2023

severo commented Mar 15, 2024

metasj commented Mar 15, 2024

Arrow map type in parquet files unsupported #5612

Arrow map type in parquet files unsupported #5612

Comments

TevenLeScao commented Mar 6, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Mar 14, 2023

eware-godaddy commented Nov 26, 2023

severo commented Mar 15, 2024

metasj commented Mar 15, 2024