-
Notifications
You must be signed in to change notification settings - Fork 224
Consider removing RecordBatch
#673
Comments
👍 datafusion is already considering rolling its own enum based record batch abstraction. I also think it's a waste to clone and pass the same schema over and over again through out the code base. |
@sundy-li @ritchie46 , does any of you use the schemas on each of the batches coming from arrow2? |
Nothing we cannot refactor. I think its a good idea. 👍 |
I do agree it's better to remove the scheme inside the batch. So there will be a better name |
'ArrayGroup`? |
That's ok, Chunk is a list of columns with the same length TIDB: https://github.com/pingcap/tidb/blob/master/util/chunk/chunk.go#L36-L50 |
Yeah.. maybe its also generic enough to not be confusing. :) |
Chunk sounds like a good name. I also think |
A reason I can think of introducing a struct for this would be to validate that all arrays have the same length when the struct is created (and document the struct's invariant), but it is a bit weak xD |
One downside of no longer having a |
I just realized that importing an array via the C data interface only requires the array's datatype; everything else is unused. In this context, the field |
For historical reasons, we have
RecordBatch
.RecordBatch
represents a collection of columns with a schema.I see a couple of problems with
RecordBatch
:it mixes metadata (
Schema
) with data (Array
). In all IO cases we have, theSchema
is known when the metadata from the file is read, way before data is read. I.e. the user has access to theSchema
very early, and does not really need to pass it to an iterator or stream of data for the stream to contain the metadata. However, it is required to do so by our APIs, because our APIs currently return aRecordBatch
(and thus need a schema on them) even though all the schemas are the same.it is not part of the arrow spec. A RecordBatch is only mentioned in the IPC, and it does not contain a schema (only columns)
it is a struct that can easily be recreated by users that need it
It indirectly drives design decisions to use it as the data carrier, even though it is not a good one. For example, in DataFusion (apache/arrow-datafusion) the physical nodes return a stream of
RecordBatch
, which requires piping schemas all the way to the physical nodes so that they can in turn use them to create a RecordBatch. This could have been replaced byVec<Arc<dyn Array>>
, or even more exotic carriers (e.g. an enum with a scalar and vector variants).The text was updated successfully, but these errors were encountered: