Python dict to map<string, string> #14116

oleksandr-yatsuk · 2022-09-14T08:09:18Z

Hello guys,

We are trying to write to parquet python dictionaries into map<string, string> column and facing an issue converting them to pyarrow.Table

The simple code snippet

def test_map_type(self):
        from pyarrow import Table
        from pyarrow import int64, map_, schema, string, field

        tags_updated = {
            "id": 1,
            "tags": {
                "tag1": "value1",
                "tag2": "value2"
            }
        }
        pyarrow_schema = schema([
            field("id", int64(), False),
            field("tags", map_(string(), string()), False)
        ])

        table = Table.from_pylist(mapping=[tags_updated], schema=pyarrow_schema)

        print(table.to_pydict())

fails with an error

pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
   ???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
   ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowTypeError: Could not convert {'tag1': 'value1', 'tag2': 'value2'} with type dict: was not a sequence or recognized null for conversion to list type

pyarrow/error.pxi:122: ArrowTypeError

We are using pyarrow == 7.0.0, but it does not work with pyarrow == 9.0.0 either

The text was updated successfully, but these errors were encountered:

drin · 2022-09-14T19:54:15Z

I believe the problem is that you are incorrectly providing data to from_pylist and you are incorrectly structuring the data for tags.

Here is a code snippet that I think does what you're looking for:

from pyarrow import Table
from pyarrow import int64, map_, schema, string, field

def test_map_type():
    # 2 column schema; a map array is a list of tuples
    pyarrow_schema = schema([
         field('id'  , int64()                 , False)
        ,field('tags', map_(string(), string()), False)
    ])

    # each row should have: <key count> <= <column count>
    first_row ={
         'id'  : 1
        ,'tags': [
              ('tag1', 'value1')
             ,('tag2', 'value2' )
         ]
    }
    second_row = {
         'id'  : 2
        ,'tags': [
              ('tag1', 'value1')
             ,('tag2', 'value2' )
         ]
    }

    tags_updated = [first_row, second_row]
    table        = Table.from_pylist(mapping=tags_updated, schema=pyarrow_schema)

    print(table.to_pydict())

test_map_type()

output:

>> python test.py
{'id': [1, 2], 'tags': [[('tag1', 'value1'), ('tag2', 'value2')], [('tag1', 'value1'), ('tag2', 'value2')]]}

oleksandr-yatsuk · 2022-09-14T21:56:47Z

@drin thank you for the response.
Our input is JSON which is deserialized to python dict as I posted, it is not an array of tuples.
Changing the pyarrow_schema to 'array of arrays'

pyarrow_schema = schema([
        field('id', int64(), False),
        field('tags', list_(list_(string())), False)
    ])

works as well.
Do you know if there is any way to save a python dict as a map (but not struct) parquet column?

drin · 2022-09-15T20:20:35Z

I don't know very much about parquet, though I do understand Arrow.

If I understand the question correctly, you want to know how to go from a dict to a MapArray?

What I tried to show was that the mapping param of from_pylist seems to take a list of rows, where each row can be a dictionary mapping a column name to a value for that column. In that sense, you could just have a function that transforms a dict to a list of tuples. Logically this should be easy, though I'm not sure how important performance is for you.

# this dictionary represents a single row
tags_updated = {
    "id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    }
}

# extract and convert the value for the "tags" column
tags_as_map = [
    (tag_key, tag_val)
    for tag_key, tag_val in tags_updated.get('tags', {}).items()
]

# then we replace the value
tags_updated['tags'] = tags_as_map

# then we can use "from_pylist" as usual
table = Table.from_pylist(mapping=[tags_updated], schema=pyarrow_schema)

drin · 2022-09-15T20:27:08Z

If you're not actually asking about how to make from_pylist work from your original example, then I don't think I understand the question and maybe you can rephrase or provide more context?

If you're writing to parquet files, then I am probably not able to help much since I don't know parquet that well, so I'm not sure what the requirements are for "saving a python dict as a map column".

If you're reading from parquet files into Arrow tables, and json is what's stored in your parquet files, then I think the above should address the problem (though maybe not very fast).

Arrow is able to read parquet files using the Datasets API, so I assume you're not trying to read parquet files into JSON and then use the JSON to construct Arrow tables?

oleksandr-yatsuk · 2022-09-16T13:48:39Z

@drin answering your question: we read JSON, validate it with marshmallow-dataclass schema, and then store those JSONs as parquet files.
Parquet supports map type and pyarrow stores map columns as expected.

The problem here is to get a pyarrow table, once we have it it can be saved as a parquet file.

Java DTO

public class VideoScoreUpdated {
    public String id;
    public Map<String, String> tags;
}

serializes to JSON as an object

{
"id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    }
}

in python it deserializes as dict object, not as an array of tuples

{
"id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    }
}

Normally Map field of any language (Java, Scala, C#, etc) serializes into JSON as an object, not as an array of JSON tuples/objects.
That's why I would expect that in pyarrow it works the same way: JSON object -> python dict -> pyarrow map

Manually converting python dict into array of tuples is a super hard job, and is not an option for us.

The optional option would be: with the help of pyarrow schema control which dict field will be a map or a struct.

For example:

tags_updated = {
    "id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    },
   "user": {
       "id": "user-1",
       "country": "ES"
    }
}

tags_updated_schema = schema(
   field("id", string(), False),
   field("tags", map_(string(), string()), False),
   field("user, struct([
      field("id", string(), False),
      field("country", string(), False)
    ]), False)
)

Correct me if I'm wrong, but the only option to create pyarrow schema on python dict field is a pyarrow struct only?

drin · 2022-09-22T22:03:47Z

As far as I can tell, yes only struct arrays take elements as a dictionary, not a map array.

Just for completeness, though, I wanted to mention that you can use dict.items() to get a list of tuples. I realize that I don't know how complex your actual structure is or how many you have, etc; so, this is just if you have a stronger preference for map arrays than struct arrays.

    # tags_ref is a generator, so maybe not too much performance impact?
    # could also consider wrapping the dict in a class that delegates all calls except when accessing 'tags'?
    # some_row = {'id': 1, 'tags': {'tag1': 'value1', 'tag2': 'value2'}}
    tags_ref = (
        {
             'id'  : some_row.get('id')
            ,'tags': list(some_row.get('tags').items())
        }
        for some_row in tags_updated
    )

    table = Table.from_pylist(mapping=list(tags_ref), schema=pyarrow_schema)

jorisvandenbossche · 2022-09-23T16:08:40Z

To boil down the issue to a minimal example, creating a MapArray from a python sequence currently requires a list of tuples:

arr = pa.array([[('a', 1), ('b', 2)], [('c', 3)]], pa.map_(pa.string(), pa.int64()))

While I think it certainly makes sense that the following could also work (using dicts instead, which would avoid the manual conversion step from dicts to list of tuples in advance):

arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64()))

I opened https://issues.apache.org/jira/browse/ARROW-17832 to track this.

drin · 2022-09-23T18:03:28Z

I tried adding an initial comment to help give a quick starting point for anyone who can work on the issue. If it sits around for long enough I can try and pick it up. If you have time @oleksandr-yatsuk to do it, I can also try and help you out where useful

jorisvandenbossche · 2023-12-01T08:09:34Z

In the meantime, this conversion of dicts to map type has been implemented:

ARROW-17832: [Python] Construct MapArray from sequence of dicts (instead of list of tuples) #14547

and thus my minimal example above now works:

In [9]: arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64()))

In [10]: arr
Out[10]: 
<pyarrow.lib.MapArray object at 0x7f8b4e93d8a0>
[
  keys:
  [
    "a",
    "b"
  ]
  values:
  [
    1,
    2
  ],
  keys:
  [
    "c"
  ]
  values:
  [
    3
  ]
]

So this issue can be closed now, I think.

asfimport mentioned this issue Nov 9, 2022

[Python] Construct MapArray from sequence of dicts (instead of list of tuples) #33053

Closed

HyukjinKwon mentioned this issue Jan 30, 2023

[Python] Construct nested MapArray from nested sequence of dicts #33928

Open

haixuanTao mentioned this issue Apr 5, 2023

Share events to Python without copying via arrow crate dora-rs/dora#228

Merged

jorisvandenbossche closed this as completed Dec 1, 2023

jorisvandenbossche added the Component: Python label Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python dict to map<string, string> #14116

Python dict to map<string, string> #14116

oleksandr-yatsuk commented Sep 14, 2022 •

edited

Loading

drin commented Sep 14, 2022

oleksandr-yatsuk commented Sep 14, 2022 •

edited

Loading

drin commented Sep 15, 2022

drin commented Sep 15, 2022

oleksandr-yatsuk commented Sep 16, 2022 •

edited

Loading

drin commented Sep 22, 2022

jorisvandenbossche commented Sep 23, 2022

drin commented Sep 23, 2022

jorisvandenbossche commented Dec 1, 2023

Python dict to map<string, string> #14116

Python dict to map<string, string> #14116

Comments

oleksandr-yatsuk commented Sep 14, 2022 • edited Loading

drin commented Sep 14, 2022

oleksandr-yatsuk commented Sep 14, 2022 • edited Loading

drin commented Sep 15, 2022

drin commented Sep 15, 2022

oleksandr-yatsuk commented Sep 16, 2022 • edited Loading

drin commented Sep 22, 2022

jorisvandenbossche commented Sep 23, 2022

drin commented Sep 23, 2022

jorisvandenbossche commented Dec 1, 2023

oleksandr-yatsuk commented Sep 14, 2022 •

edited

Loading

oleksandr-yatsuk commented Sep 14, 2022 •

edited

Loading

oleksandr-yatsuk commented Sep 16, 2022 •

edited

Loading