Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Support Map type from parquet #996

Closed
cjermain opened this issue May 20, 2022 · 3 comments
Closed

Support Map type from parquet #996

cjermain opened this issue May 20, 2022 · 3 comments
Labels
feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@cjermain
Copy link
Contributor

The Map type is reasonably common in parquet files, so being able to deserialize them would be of significant value. At first glance it looks like arrow2::array::MapArray already provides the Arrow support for this. A simple parquet example can be generated with the following code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({
        'col1': pd.Series([
            [('key', 'foo'), ('value', 'bar')],
            [('key', 'foo'), ('value', 'biz')],
        ]),
        'col2': pd.Series(['fiz', 'buz'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
pq.write_table(table, 'test.parquet')

Loading using arrow2 fails on the match statement in the parquet deserialization:

@jorgecarleitao jorgecarleitao added the feature A new feature label May 21, 2022
@jorgecarleitao
Copy link
Owner

Agree - we should support this.

The steps I usually perform to implement this are:

  • Add cases in parquet_integration/write_parquet.py
  • Add the corresponding data in tests/it/io/parquet/mod.rs
  • Add tests in tests/it/io/parquet/read.rs
  • Implement

in no particular order

I can try to take a stab at this in 2 weeks or so - I am focused on writing nested parquet atm

@jorgecarleitao
Copy link
Owner

Addressed in #1045 :)

@cjermain
Copy link
Contributor Author

cjermain commented Jun 4, 2022

Awesome, thanks!

@cjermain cjermain closed this as completed Jun 4, 2022
@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jun 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants