-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python dict to map<string, string> #14116
Comments
I believe the problem is that you are incorrectly providing data to Here is a code snippet that I think does what you're looking for: from pyarrow import Table
from pyarrow import int64, map_, schema, string, field
def test_map_type():
# 2 column schema; a map array is a list of tuples
pyarrow_schema = schema([
field('id' , int64() , False)
,field('tags', map_(string(), string()), False)
])
# each row should have: <key count> <= <column count>
first_row ={
'id' : 1
,'tags': [
('tag1', 'value1')
,('tag2', 'value2' )
]
}
second_row = {
'id' : 2
,'tags': [
('tag1', 'value1')
,('tag2', 'value2' )
]
}
tags_updated = [first_row, second_row]
table = Table.from_pylist(mapping=tags_updated, schema=pyarrow_schema)
print(table.to_pydict())
test_map_type() output: >> python test.py
{'id': [1, 2], 'tags': [[('tag1', 'value1'), ('tag2', 'value2')], [('tag1', 'value1'), ('tag2', 'value2')]]} |
@drin thank you for the response.
works as well. |
I don't know very much about parquet, though I do understand Arrow. If I understand the question correctly, you want to know how to go from a What I tried to show was that the # this dictionary represents a single row
tags_updated = {
"id": 1,
"tags": {
"tag1": "value1",
"tag2": "value2"
}
}
# extract and convert the value for the "tags" column
tags_as_map = [
(tag_key, tag_val)
for tag_key, tag_val in tags_updated.get('tags', {}).items()
]
# then we replace the value
tags_updated['tags'] = tags_as_map
# then we can use "from_pylist" as usual
table = Table.from_pylist(mapping=[tags_updated], schema=pyarrow_schema) |
If you're not actually asking about how to make If you're writing to parquet files, then I am probably not able to help much since I don't know parquet that well, so I'm not sure what the requirements are for "saving a python If you're reading from parquet files into Arrow tables, and json is what's stored in your parquet files, then I think the above should address the problem (though maybe not very fast). Arrow is able to read parquet files using the Datasets API, so I assume you're not trying to read parquet files into JSON and then use the JSON to construct Arrow tables? |
@drin answering your question: we read JSON, validate it with marshmallow-dataclass schema, and then store those JSONs as parquet files. The problem here is to get a pyarrow table, once we have it it can be saved as a parquet file. Java DTO
serializes to JSON as an object
in python it deserializes as
Normally Manually converting python The optional option would be: with the help of pyarrow schema control which For example:
Correct me if I'm wrong, but the only option to create pyarrow schema on python |
As far as I can tell, yes only struct arrays take elements as a dictionary, not a map array. Just for completeness, though, I wanted to mention that you can use # tags_ref is a generator, so maybe not too much performance impact?
# could also consider wrapping the dict in a class that delegates all calls except when accessing 'tags'?
# some_row = {'id': 1, 'tags': {'tag1': 'value1', 'tag2': 'value2'}}
tags_ref = (
{
'id' : some_row.get('id')
,'tags': list(some_row.get('tags').items())
}
for some_row in tags_updated
)
table = Table.from_pylist(mapping=list(tags_ref), schema=pyarrow_schema) |
To boil down the issue to a minimal example, creating a MapArray from a python sequence currently requires a list of tuples:
While I think it certainly makes sense that the following could also work (using dicts instead, which would avoid the manual conversion step from dicts to list of tuples in advance):
I opened https://issues.apache.org/jira/browse/ARROW-17832 to track this. |
I tried adding an initial comment to help give a quick starting point for anyone who can work on the issue. If it sits around for long enough I can try and pick it up. If you have time @oleksandr-yatsuk to do it, I can also try and help you out where useful |
In the meantime, this conversion of dicts to map type has been implemented: and thus my minimal example above now works:
So this issue can be closed now, I think. |
Hello guys,
We are trying to write to parquet python dictionaries into
map<string, string>
column and facing an issue converting them topyarrow.Table
The simple code snippet
fails with an error
We are using
pyarrow == 7.0.0
, but it does not work withpyarrow == 9.0.0
eitherThe text was updated successfully, but these errors were encountered: