Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: replace deprecated arrow::json::reader::Decoder #1226

Merged
merged 8 commits into from
May 12, 2023

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented Mar 16, 2023

Description

This test will fail with the following error:

---- writer::record_batch::tests::test_divide_record_batch_with_map_single_partition stdout ----
thread 'writer::record_batch::tests::test_divide_record_batch_with_map_single_partition' panicked at 'not implemented: Take not supported for data type Map(Field { name: "key_value", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)', /home/tyler/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-select-33.0.0/src/take.rs:234:14

As far as I can tell this should be supported behavior and a supported schema. Note that with no partition columns the write works just fine.

=# Related Issue(s)

apache/arrow-rs#3875

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Mar 16, 2023
@rtyler
Copy link
Member Author

rtyler commented Mar 16, 2023

The problem appears to be the take for batch_data in the divide_by_partition_values function. That kernel is not supported by Arrow, and I think the function can be fixed to avoid that for Map data but RecordBatches are sufficiently difficult to work with that I haven't yet landed on a more succinct test.

@wjones127
Copy link
Collaborator

I created a PR to add the kernel upstream.

@rtyler rtyler force-pushed the testing-map-types-with-partitions branch 2 times, most recently from 53a4fe2 to ddf70f1 Compare March 25, 2023 22:32
@rtyler
Copy link
Member Author

rtyler commented Mar 25, 2023

Now looking at these clippy errors 😒

@rtyler
Copy link
Member Author

rtyler commented Apr 1, 2023

This will be addressed by #1249 , so once that pull request is ready, this test should be merged into that branch

@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@rtyler rtyler force-pushed the testing-map-types-with-partitions branch from e36a074 to 4c43eb9 Compare April 29, 2023 19:52
@rtyler rtyler changed the title Introduce a failing test for writing with map columns and partitions fix: replace deprecated arrow::json::reader::Decoder Apr 29, 2023
@rtyler rtyler force-pushed the testing-map-types-with-partitions branch 2 times, most recently from 3ebf6be to 12c02a8 Compare May 9, 2023 21:52
@github-actions github-actions bot added the binding/python Issues for the Python package label May 9, 2023
@rtyler rtyler force-pushed the testing-map-types-with-partitions branch 2 times, most recently from 3629d55 to 4265844 Compare May 9, 2023 23:14
@rtyler rtyler marked this pull request as ready for review May 11, 2023 01:06
@rtyler rtyler force-pushed the testing-map-types-with-partitions branch from 4c2ce43 to 4265844 Compare May 11, 2023 15:06
@@ -9,7 +11,7 @@ def test_table_schema():
table_path = "../rust/tests/data/simple_table"
dt = DeltaTable(table_path)
schema = dt.schema()
assert schema.json() == {
assert json.loads(schema.to_json()) == {
Copy link
Collaborator

@wjones127 wjones127 May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. This seems like an API-breaking change? What happened? nevermind I can't read. Looks good 👍

let mut buf = vec![];
for message in message_buffer {
buf.write_all(
serde_json::to_string(&message)
Copy link
Collaborator

@wjones127 wjones127 May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are making you re-serialize the messages?! Ouch...

Copy link
Collaborator

@wjones127 wjones127 May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider this API? https://docs.rs/arrow-json/39.0.0/arrow_json/reader/struct.Decoder.html#method.serialize

That's what Raphael built for the use case you were asking about, I think. It avoids the overhead of serializing the data to the buffer.

rtyler added 5 commits May 12, 2023 08:12
This commit incorporates the latest upstream error changes and ensures that a
RecordBatch which contains map data _with_ a partition can be segmented out
correctly.

This depends on the upstream work to support take() on MapArray that @wjones127
recently committed
@rtyler rtyler force-pushed the testing-map-types-with-partitions branch 2 times, most recently from 667dbea to cf881f5 Compare May 12, 2023 15:52
@rtyler rtyler force-pushed the testing-map-types-with-partitions branch from cf881f5 to aecd898 Compare May 12, 2023 15:56
Comment on lines 397 to 402
let mut buf = vec![];
for res in jsons {
let json = res?;
buf.write_all(serde_json::to_string(&json)?.as_bytes())?;
}
let mut consumed = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, right? we shouldn't need to serialize to json string if in the end we are trying to make a parquet file?

@rtyler rtyler merged commit e94128d into delta-io:main May 12, 2023
@rtyler rtyler deleted the testing-map-types-with-partitions branch May 12, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants