Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW2: support avro #1061

Closed
houqp opened this issue Sep 28, 2021 · 12 comments
Closed

ARROW2: support avro #1061

houqp opened this issue Sep 28, 2021 · 12 comments
Labels
bug Something isn't working
Milestone

Comments

@houqp
Copy link
Member

houqp commented Sep 28, 2021

Avro integration needs to be fixed.

@houqp houqp added the bug Something isn't working label Sep 28, 2021
@houqp houqp added this to the arrow2 milestone Sep 28, 2021
@Igosuki
Copy link
Contributor

Igosuki commented Oct 2, 2021

Is this about the dict issue or is it something else ?

@houqp
Copy link
Member Author

houqp commented Oct 2, 2021

This is something else, we need to port your avro implementation to https://github.com/jorgecarleitao/arrow2/tree/main/src/io/avro for #68 ;)

@houqp
Copy link
Member Author

houqp commented Oct 2, 2021

@Igosuki if you are interested in taking on this work, please coordinate with @jorgecarleitao .

@Igosuki
Copy link
Contributor

Igosuki commented Oct 5, 2021

@houqp will do

@jorgecarleitao
Copy link
Member

The approach I took in arrow2 was to not use AvroValue and work directly with the byte stream. The reason for this is that AvroValue takes bytes by value and not by reference, which implies that when we perform the conversion File -> AvroValue -> Arrow for a Utf8 or Binary, we end up with the transformation bytes -> AvroValue::String(String) -> Utf8Array, performing an extra allocation per item.

Since the avro format is relatively simple to read from, I just implemented a reader from bytes directly to arrow. I did not implemented it for all types, only for the basic ones, but the idea stands.

So, if I understood, the goal is to generalize the parser to more types. Which ones are needed here?

@houqp
Copy link
Member Author

houqp commented Oct 6, 2021

I think struct and list are the main ones: https://github.com/apache/arrow-datafusion/blob/5cc4e9f53fab29e81ea7c98baac8ce277a0cb54a/datafusion/src/avro_to_arrow/arrow_array_reader.rs#L530

@Igosuki
Copy link
Contributor

Igosuki commented Oct 6, 2021 via email

@jorgecarleitao
Copy link
Member

as parquet a was meant to be a columnar version of Avro

ahhh, now it makes sense why they are so similar! Thanks a lot for sharing that.

I've implemented the remaining types here: jorgecarleitao/arrow2#493 . We are missing Struct and Decimal.

@Igosuki
Copy link
Contributor

Igosuki commented Oct 9, 2021

@jorgecarleitao Nice one, seems like you got the encoding part almost rounded up

@houqp
Copy link
Member Author

houqp commented Jan 14, 2022

@Igosuki I believe you have completed the migration for all avro related code right? Is there any left over for this issue?

@Igosuki
Copy link
Contributor

Igosuki commented Jan 14, 2022

No, besides the patch I mentioned which enables arrays to be null https://github.com/jorgecarleitao/arrow2/blob/main/src/io/avro/read/deserialize.rs#L101

@houqp
Copy link
Member Author

houqp commented Jan 24, 2022

Thanks @Igosuki for all your work on Avro :)

@houqp houqp closed this as completed Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants