-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow Row Format #2677
Comments
Hello, Would this help with processing row/record-based data, where the number of columns (that is, the structure of the row/record) vary depending on the type of row/record in the file. Example: An ndjson file where rows can be of |
This will still not allow operations on data with multiple schema, same as with RecordBatch, if that is what you're asking for? That being said, in the case of rows with different variants, nulls will be inserted by the JSON reader for the columns found in other variants but not present in the current record. The data is effectively unified to a single schema. This trades off memory efficiency for the ability to efficiently process data in a columnar fashion, without per-value dynamic dispatch. This is likely acceptable, however, partitioning the data so that different schema aren't interleaved will lead to better performance and memory efficiency. The row format could help with implementing this, but does not alter the nature of arrow schema |
Thanks for clarifying. As you say, I'll have to make a separate stream for each row-type. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I think this crate has pretty good stories for operating on individual columns, either by downcasting to a concrete type, or invoking a
dyn
kernel.The stories for multi-column operations are substantially weaker, with patchy support for common multi-column operations such as sorts, groupings, aggregations, reassembly, etc... We have some pieces such as
MutableArrayData
,DynComparator
, but they're not especially performant, making extensive use of dynamic dispatch at the row-level, nor easy to use.Describe the solution you'd like
Having a first-class row representation will not only allow us to implement more performant versions of existing kernels such as lexsort, but also provide a pretty compelling primitive to downstreams with which to implement more advanced operations such as streaming merges, joins, aggregates, etc... There is also precedent, with the C++ arrow library providing its own row format.
Goals
Non-Goals
Describe alternatives you've considered
We could extend the row format in DataFusion, however, this would limit its benefits to DataFusion. I think a row-oriented representation is such a fundamental primitive that it makes sense for inclusion in arrow-rs, so that it can be both used in its kernels and by downstreams that don't make use of DataFusion.
Additional context
The text was updated successfully, but these errors were encountered: