This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
Decouple IPC-specific information from Schema
#581
Labels
investigation
Issues or PRs that are investigations. Prs may or may not be merged.
no-changelog
Issues whose changes are covered by a PR and thus should not be shown in the changelog
Background
The schema declared in the Arrow IPC specification concerns aspects related to sharing of data under IPC. However, when the data is in memory, not all parts of the schema are necessary. Two important bits are:
dict_id
As a user of the crate, I do not care about the schema endianess because my architecture has set it and the crate has already decoded values according to the incoming endianess (swap bytes as needed) in the IPC boundary. Likewise,
dict_id
is something used to keep track of dictionaries throughout the IPC file / stream, but is not necessary when the data has been loaded in memory.The proof of this observation is that the c data interface do not have
dict_id
nor endianess - other mechanisms are used to ensure lossless roundtrips.The existence of
dict_id
inField
makes it difficult to offer a good UX to use dictionaries effectively over the IPC. The gist is thatdict_id
is expected to be a number uniquely representing a dictionary array.Currently, users are required to pass a
dict_id
for dictionary fields. However, setting this number right requires managing global state, as we need to track how many dictionaries were created and which ids were assigned to them. As a corollary, most people disregard that number, rendering transffering dictionaries via IPC useless and/or wrong, since settings the same id for two different dictionaries may lead to a "dictionary replacement" in the IPC.dict_id
is only used when writing to IPC, when building the relationships between dictionaries and corresponding index arrays. Likewise, endianess information is only necessary to declare to the IPC which endianess the file has been written as.This issue is also tied with why it is currently difficult to support nested Dictionaries (#499) and some back and forth in what datatypes must contain to ensure loseless roundtrip over the IPC (e.g. #501 and #439)
Goals
The goal of this issue is to investigate declaring a separate schema
IpcSchema, IpcDataType, IpcField
specifically for IPC, and only expose this schema when handling the Ipc format. Offer aFrom
implementation that allows moving fromSchema
toIpcSchema
and vice-versa. This would be very similar to what happens in parquet, where parquet has its own schema that we map to arrow. The core difference is thatIpcSchema <-> Schema
is lossless since by design this crate adopts all and only the logical types declared in arrow.This allow us to remove
dict_id
from the fields, but, more importantly, allows offering the ability to write dictionary arrays effectively over the IPC boundary without user effort (by performing the calculation of thedict_id
on the IPC boundary instead of by the user).The text was updated successfully, but these errors were encountered: