Decouple IPC-specific information from `Schema` #581

jorgecarleitao · 2021-11-06T20:59:36Z

Background

The schema declared in the Arrow IPC specification concerns aspects related to sharing of data under IPC. However, when the data is in memory, not all parts of the schema are necessary. Two important bits are:

endianess information
dict_id

As a user of the crate, I do not care about the schema endianess because my architecture has set it and the crate has already decoded values according to the incoming endianess (swap bytes as needed) in the IPC boundary. Likewise, dict_id is something used to keep track of dictionaries throughout the IPC file / stream, but is not necessary when the data has been loaded in memory.

The proof of this observation is that the c data interface do not have dict_id nor endianess - other mechanisms are used to ensure lossless roundtrips.

The existence of dict_id in Field makes it difficult to offer a good UX to use dictionaries effectively over the IPC. The gist is that dict_id is expected to be a number uniquely representing a dictionary array.

Currently, users are required to pass a dict_id for dictionary fields. However, setting this number right requires managing global state, as we need to track how many dictionaries were created and which ids were assigned to them. As a corollary, most people disregard that number, rendering transffering dictionaries via IPC useless and/or wrong, since settings the same id for two different dictionaries may lead to a "dictionary replacement" in the IPC.

dict_id is only used when writing to IPC, when building the relationships between dictionaries and corresponding index arrays. Likewise, endianess information is only necessary to declare to the IPC which endianess the file has been written as.

This issue is also tied with why it is currently difficult to support nested Dictionaries (#499) and some back and forth in what datatypes must contain to ensure loseless roundtrip over the IPC (e.g. #501 and #439)

Goals

The goal of this issue is to investigate declaring a separate schema IpcSchema, IpcDataType, IpcField specifically for IPC, and only expose this schema when handling the Ipc format. Offer a From implementation that allows moving from Schema to IpcSchema and vice-versa. This would be very similar to what happens in parquet, where parquet has its own schema that we map to arrow. The core difference is that IpcSchema <-> Schema is lossless since by design this crate adopts all and only the logical types declared in arrow.

This allow us to remove dict_id from the fields, but, more importantly, allows offering the ability to write dictionary arrays effectively over the IPC boundary without user effort (by performing the calculation of the dict_id on the IPC boundary instead of by the user).

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2021-12-28T06:23:53Z

Closed by #713

jorgecarleitao added the investigation Issues or PRs that are investigations. Prs may or may not be merged. label Nov 6, 2021

jorgecarleitao self-assigned this Nov 6, 2021

jorgecarleitao mentioned this issue Dec 26, 2021

Moved dict_id to IPC-specific IO #713

Merged

jorgecarleitao closed this as completed Dec 28, 2021

jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Dec 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple IPC-specific information from `Schema` #581

Decouple IPC-specific information from `Schema` #581

jorgecarleitao commented Nov 6, 2021 •

edited

Loading

jorgecarleitao commented Dec 28, 2021

Decouple IPC-specific information from Schema #581

Decouple IPC-specific information from Schema #581

Comments

jorgecarleitao commented Nov 6, 2021 • edited Loading

Background

Goals

jorgecarleitao commented Dec 28, 2021

Decouple IPC-specific information from `Schema` #581

Decouple IPC-specific information from `Schema` #581

jorgecarleitao commented Nov 6, 2021 •

edited

Loading