Consider removing `RecordBatch` #673

jorgecarleitao · 2021-12-11T07:12:06Z

For historical reasons, we have RecordBatch. RecordBatch represents a collection of columns with a schema.

I see a couple of problems with RecordBatch:

it mixes metadata (Schema) with data (Array). In all IO cases we have, the Schema is known when the metadata from the file is read, way before data is read. I.e. the user has access to the Schema very early, and does not really need to pass it to an iterator or stream of data for the stream to contain the metadata. However, it is required to do so by our APIs, because our APIs currently return a RecordBatch (and thus need a schema on them) even though all the schemas are the same.
it is not part of the arrow spec. A RecordBatch is only mentioned in the IPC, and it does not contain a schema (only columns)
it is a struct that can easily be recreated by users that need it
It indirectly drives design decisions to use it as the data carrier, even though it is not a good one. For example, in DataFusion (apache/arrow-datafusion) the physical nodes return a stream of RecordBatch, which requires piping schemas all the way to the physical nodes so that they can in turn use them to create a RecordBatch. This could have been replaced by Vec<Arc<dyn Array>>, or even more exotic carriers (e.g. an enum with a scalar and vector variants).

The text was updated successfully, but these errors were encountered:

houqp · 2021-12-11T07:23:49Z

👍 datafusion is already considering rolling its own enum based record batch abstraction. I also think it's a waste to clone and pass the same schema over and over again through out the code base.

jorgecarleitao · 2021-12-11T17:52:13Z

@sundy-li @ritchie46 , does any of you use the schemas on each of the batches coming from arrow2?

ritchie46 · 2021-12-11T18:28:56Z

Nothing we cannot refactor. I think its a good idea. 👍

sundy-li · 2021-12-12T06:47:30Z

I do agree it's better to remove the scheme inside the batch.

So there will be a better name type Chunk = Vec<Arc<dyn Array>> ?

ritchie46 · 2021-12-12T06:53:04Z

So there will be a better name type Chunk = Vec<Arc> ?

ChunkedArrays are vertical in pyarrow and polars, so that might be confusing.

'ArrayGroup`?

sundy-li · 2021-12-12T06:58:14Z

That's ok, chunk is from some famous database naming style, arrow2 can still have own name for that.

Chunk is a list of columns with the same length

TIDB: https://github.com/pingcap/tidb/blob/master/util/chunk/chunk.go#L36-L50

ClickHouse: https://github.com/ClickHouse/ClickHouse/blob/3c348a2998079ec0908d76fc35095223f362f7ad/src/Processors/Chunk.h#L18-L34

ritchie46 · 2021-12-12T07:04:41Z

Yeah.. maybe its also generic enough to not be confusing. :)

houqp · 2021-12-14T05:59:37Z

Chunk sounds like a good name. I also think Vec<Arc<dyn Array>> is quite readable by itself :P

jorgecarleitao · 2021-12-14T20:19:22Z

A reason I can think of introducing a struct for this would be to validate that all arrays have the same length when the struct is created (and document the struct's invariant), but it is a bit weak xD

multimeric · 2022-02-18T16:25:12Z

One downside of no longer having a RecordBatch is that it makes it harder to implement conversion traits for DataFrame/Tables. ie if we had one then each library could implement From<MyTable> for RecordBatch and vice versa, and then we could use FFI to convert between them. But as it is, there is no struct to hook onto for this.

jorgecarleitao · 2022-02-18T17:09:58Z

I just realized that importing an array via the C data interface only requires the array's datatype; everything else is unused.

In this context, the field Field::new("", array.data_type().clone(), false) is sufficient for a consumer to correctly read the array. Created #854 based on this. Let me know what do you think.

jorgecarleitao added help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged. labels Dec 11, 2021

This was referenced Dec 26, 2021

Moved dict_id to IPC-specific IO #713

Merged

Replaced RecordBatch by Chunk #717

Merged

Ergonomic field and schema creation with Metadata apache/arrow-rs#1091

Closed

jorgecarleitao closed this as completed in #717 Jan 3, 2022

jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jan 14, 2022

jorgecarleitao mentioned this issue Feb 18, 2022

Simplified API to import from FFI #854

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing `RecordBatch` #673

Consider removing `RecordBatch` #673

jorgecarleitao commented Dec 11, 2021 •

edited

Loading

houqp commented Dec 11, 2021

jorgecarleitao commented Dec 11, 2021

ritchie46 commented Dec 11, 2021

sundy-li commented Dec 12, 2021

ritchie46 commented Dec 12, 2021

sundy-li commented Dec 12, 2021 •

edited

Loading

ritchie46 commented Dec 12, 2021

houqp commented Dec 14, 2021

jorgecarleitao commented Dec 14, 2021

multimeric commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022

Consider removing RecordBatch #673

Consider removing RecordBatch #673

Comments

jorgecarleitao commented Dec 11, 2021 • edited Loading

houqp commented Dec 11, 2021

jorgecarleitao commented Dec 11, 2021

ritchie46 commented Dec 11, 2021

sundy-li commented Dec 12, 2021

ritchie46 commented Dec 12, 2021

sundy-li commented Dec 12, 2021 • edited Loading

ritchie46 commented Dec 12, 2021

houqp commented Dec 14, 2021

jorgecarleitao commented Dec 14, 2021

multimeric commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022

Consider removing `RecordBatch` #673

Consider removing `RecordBatch` #673

jorgecarleitao commented Dec 11, 2021 •

edited

Loading

sundy-li commented Dec 12, 2021 •

edited

Loading