Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python dict to map<string, string> #14116

Closed
oleksandr-yatsuk opened this issue Sep 14, 2022 · 9 comments
Closed

Python dict to map<string, string> #14116

oleksandr-yatsuk opened this issue Sep 14, 2022 · 9 comments

Comments

@oleksandr-yatsuk
Copy link

oleksandr-yatsuk commented Sep 14, 2022

Hello guys,

We are trying to write to parquet python dictionaries into map<string, string> column and facing an issue converting them to pyarrow.Table

The simple code snippet

def test_map_type(self):
        from pyarrow import Table
        from pyarrow import int64, map_, schema, string, field

        tags_updated = {
            "id": 1,
            "tags": {
                "tag1": "value1",
                "tag2": "value2"
            }
        }
        pyarrow_schema = schema([
            field("id", int64(), False),
            field("tags", map_(string(), string()), False)
        ])

        table = Table.from_pylist(mapping=[tags_updated], schema=pyarrow_schema)

        print(table.to_pydict())

fails with an error

pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
   ???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
   ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowTypeError: Could not convert {'tag1': 'value1', 'tag2': 'value2'} with type dict: was not a sequence or recognized null for conversion to list type

pyarrow/error.pxi:122: ArrowTypeError

We are using pyarrow == 7.0.0, but it does not work with pyarrow == 9.0.0 either

@drin
Copy link
Contributor

drin commented Sep 14, 2022

I believe the problem is that you are incorrectly providing data to from_pylist and you are incorrectly structuring the data for tags.

Here is a code snippet that I think does what you're looking for:

from pyarrow import Table
from pyarrow import int64, map_, schema, string, field

def test_map_type():
    # 2 column schema; a map array is a list of tuples
    pyarrow_schema = schema([
         field('id'  , int64()                 , False)
        ,field('tags', map_(string(), string()), False)
    ])

    # each row should have: <key count> <= <column count>
    first_row ={
         'id'  : 1
        ,'tags': [
              ('tag1', 'value1')
             ,('tag2', 'value2' )
         ]
    }
    second_row = {
         'id'  : 2
        ,'tags': [
              ('tag1', 'value1')
             ,('tag2', 'value2' )
         ]
    }

    tags_updated = [first_row, second_row]
    table        = Table.from_pylist(mapping=tags_updated, schema=pyarrow_schema)

    print(table.to_pydict())

test_map_type()

output:

>> python test.py
{'id': [1, 2], 'tags': [[('tag1', 'value1'), ('tag2', 'value2')], [('tag1', 'value1'), ('tag2', 'value2')]]}

@oleksandr-yatsuk
Copy link
Author

oleksandr-yatsuk commented Sep 14, 2022

@drin thank you for the response.
Our input is JSON which is deserialized to python dict as I posted, it is not an array of tuples.
Changing the pyarrow_schema to 'array of arrays'

pyarrow_schema = schema([
        field('id', int64(), False),
        field('tags', list_(list_(string())), False)
    ])

works as well.
Do you know if there is any way to save a python dict as a map (but not struct) parquet column?

@drin
Copy link
Contributor

drin commented Sep 15, 2022

I don't know very much about parquet, though I do understand Arrow.

If I understand the question correctly, you want to know how to go from a dict to a MapArray?

What I tried to show was that the mapping param of from_pylist seems to take a list of rows, where each row can be a dictionary mapping a column name to a value for that column. In that sense, you could just have a function that transforms a dict to a list of tuples. Logically this should be easy, though I'm not sure how important performance is for you.

# this dictionary represents a single row
tags_updated = {
    "id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    }
}

# extract and convert the value for the "tags" column
tags_as_map = [
    (tag_key, tag_val)
    for tag_key, tag_val in tags_updated.get('tags', {}).items()
]

# then we replace the value
tags_updated['tags'] = tags_as_map

# then we can use "from_pylist" as usual
table = Table.from_pylist(mapping=[tags_updated], schema=pyarrow_schema)

@drin
Copy link
Contributor

drin commented Sep 15, 2022

If you're not actually asking about how to make from_pylist work from your original example, then I don't think I understand the question and maybe you can rephrase or provide more context?

If you're writing to parquet files, then I am probably not able to help much since I don't know parquet that well, so I'm not sure what the requirements are for "saving a python dict as a map column".

If you're reading from parquet files into Arrow tables, and json is what's stored in your parquet files, then I think the above should address the problem (though maybe not very fast).

Arrow is able to read parquet files using the Datasets API, so I assume you're not trying to read parquet files into JSON and then use the JSON to construct Arrow tables?

@oleksandr-yatsuk
Copy link
Author

oleksandr-yatsuk commented Sep 16, 2022

@drin answering your question: we read JSON, validate it with marshmallow-dataclass schema, and then store those JSONs as parquet files.
Parquet supports map type and pyarrow stores map columns as expected.

The problem here is to get a pyarrow table, once we have it it can be saved as a parquet file.

Java DTO

public class VideoScoreUpdated {
    public String id;
    public Map<String, String> tags;
}

serializes to JSON as an object

{
"id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    }
}

in python it deserializes as dict object, not as an array of tuples

{
"id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    }
}

Normally Map field of any language (Java, Scala, C#, etc) serializes into JSON as an object, not as an array of JSON tuples/objects.
That's why I would expect that in pyarrow it works the same way: JSON object -> python dict -> pyarrow map

Manually converting python dict into array of tuples is a super hard job, and is not an option for us.

The optional option would be: with the help of pyarrow schema control which dict field will be a map or a struct.

For example:

tags_updated = {
    "id": 1,
    "tags": {
        "tag1": "value1",
        "tag2": "value2"
    },
   "user": {
       "id": "user-1",
       "country": "ES"
    }
}

tags_updated_schema = schema(
   field("id", string(), False),
   field("tags", map_(string(), string()), False),
   field("user, struct([
      field("id", string(), False),
      field("country", string(), False)
    ]), False)
)

Correct me if I'm wrong, but the only option to create pyarrow schema on python dict field is a pyarrow struct only?

@drin
Copy link
Contributor

drin commented Sep 22, 2022

As far as I can tell, yes only struct arrays take elements as a dictionary, not a map array.

Just for completeness, though, I wanted to mention that you can use dict.items() to get a list of tuples. I realize that I don't know how complex your actual structure is or how many you have, etc; so, this is just if you have a stronger preference for map arrays than struct arrays.

    # tags_ref is a generator, so maybe not too much performance impact?
    # could also consider wrapping the dict in a class that delegates all calls except when accessing 'tags'?
    # some_row = {'id': 1, 'tags': {'tag1': 'value1', 'tag2': 'value2'}}
    tags_ref = (
        {
             'id'  : some_row.get('id')
            ,'tags': list(some_row.get('tags').items())
        }
        for some_row in tags_updated
    )

    table = Table.from_pylist(mapping=list(tags_ref), schema=pyarrow_schema)

@jorisvandenbossche
Copy link
Member

To boil down the issue to a minimal example, creating a MapArray from a python sequence currently requires a list of tuples:

arr = pa.array([[('a', 1), ('b', 2)], [('c', 3)]], pa.map_(pa.string(), pa.int64()))

While I think it certainly makes sense that the following could also work (using dicts instead, which would avoid the manual conversion step from dicts to list of tuples in advance):

arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64()))

I opened https://issues.apache.org/jira/browse/ARROW-17832 to track this.

@drin
Copy link
Contributor

drin commented Sep 23, 2022

I tried adding an initial comment to help give a quick starting point for anyone who can work on the issue. If it sits around for long enough I can try and pick it up. If you have time @oleksandr-yatsuk to do it, I can also try and help you out where useful

@jorisvandenbossche
Copy link
Member

In the meantime, this conversion of dicts to map type has been implemented:

and thus my minimal example above now works:

In [9]: arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64()))

In [10]: arr
Out[10]: 
<pyarrow.lib.MapArray object at 0x7f8b4e93d8a0>
[
  keys:
  [
    "a",
    "b"
  ]
  values:
  [
    1,
    2
  ],
  keys:
  [
    "c"
  ]
  values:
  [
    3
  ]
]

So this issue can be closed now, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants