Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Consider removing RecordBatch #673

Closed
jorgecarleitao opened this issue Dec 11, 2021 · 11 comments · Fixed by #717
Closed

Consider removing RecordBatch #673

jorgecarleitao opened this issue Dec 11, 2021 · 11 comments · Fixed by #717
Labels
help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Dec 11, 2021

For historical reasons, we have RecordBatch. RecordBatch represents a collection of columns with a schema.

I see a couple of problems with RecordBatch:

  1. it mixes metadata (Schema) with data (Array). In all IO cases we have, the Schema is known when the metadata from the file is read, way before data is read. I.e. the user has access to the Schema very early, and does not really need to pass it to an iterator or stream of data for the stream to contain the metadata. However, it is required to do so by our APIs, because our APIs currently return a RecordBatch (and thus need a schema on them) even though all the schemas are the same.

  2. it is not part of the arrow spec. A RecordBatch is only mentioned in the IPC, and it does not contain a schema (only columns)

  3. it is a struct that can easily be recreated by users that need it

  4. It indirectly drives design decisions to use it as the data carrier, even though it is not a good one. For example, in DataFusion (apache/arrow-datafusion) the physical nodes return a stream of RecordBatch, which requires piping schemas all the way to the physical nodes so that they can in turn use them to create a RecordBatch. This could have been replaced by Vec<Arc<dyn Array>>, or even more exotic carriers (e.g. an enum with a scalar and vector variants).

@jorgecarleitao jorgecarleitao added help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged. labels Dec 11, 2021
@houqp
Copy link
Collaborator

houqp commented Dec 11, 2021

👍 datafusion is already considering rolling its own enum based record batch abstraction. I also think it's a waste to clone and pass the same schema over and over again through out the code base.

@jorgecarleitao
Copy link
Owner Author

@sundy-li @ritchie46 , does any of you use the schemas on each of the batches coming from arrow2?

@ritchie46
Copy link
Collaborator

Nothing we cannot refactor. I think its a good idea. 👍

@sundy-li
Copy link
Collaborator

I do agree it's better to remove the scheme inside the batch.

So there will be a better name type Chunk = Vec<Arc<dyn Array>> ?

@ritchie46
Copy link
Collaborator

So there will be a better name type Chunk = Vec<Arc> ?

ChunkedArrays are vertical in pyarrow and polars, so that might be confusing.

'ArrayGroup`?

@sundy-li
Copy link
Collaborator

sundy-li commented Dec 12, 2021

That's ok, chunk is from some famous database naming style, arrow2 can still have own name for that.

Chunk is a list of columns with the same length

TIDB: https://github.com/pingcap/tidb/blob/master/util/chunk/chunk.go#L36-L50

ClickHouse: https://github.com/ClickHouse/ClickHouse/blob/3c348a2998079ec0908d76fc35095223f362f7ad/src/Processors/Chunk.h#L18-L34

@ritchie46
Copy link
Collaborator

Yeah.. maybe its also generic enough to not be confusing. :)

@houqp
Copy link
Collaborator

houqp commented Dec 14, 2021

Chunk sounds like a good name. I also think Vec<Arc<dyn Array>> is quite readable by itself :P

@jorgecarleitao
Copy link
Owner Author

A reason I can think of introducing a struct for this would be to validate that all arrays have the same length when the struct is created (and document the struct's invariant), but it is a bit weak xD

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jan 14, 2022
@multimeric
Copy link

One downside of no longer having a RecordBatch is that it makes it harder to implement conversion traits for DataFrame/Tables. ie if we had one then each library could implement From<MyTable> for RecordBatch and vice versa, and then we could use FFI to convert between them. But as it is, there is no struct to hook onto for this.

@jorgecarleitao
Copy link
Owner Author

I just realized that importing an array via the C data interface only requires the array's datatype; everything else is unused.

In this context, the field Field::new("", array.data_type().clone(), false) is sufficient for a consumer to correctly read the array. Created #854 based on this. Let me know what do you think.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants