Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Decouple IPC-specific information from Schema #581

Closed
jorgecarleitao opened this issue Nov 6, 2021 · 1 comment
Closed

Decouple IPC-specific information from Schema #581

jorgecarleitao opened this issue Nov 6, 2021 · 1 comment
Assignees
Labels
investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Nov 6, 2021

Background

The schema declared in the Arrow IPC specification concerns aspects related to sharing of data under IPC. However, when the data is in memory, not all parts of the schema are necessary. Two important bits are:

  • endianess information
  • dict_id

As a user of the crate, I do not care about the schema endianess because my architecture has set it and the crate has already decoded values according to the incoming endianess (swap bytes as needed) in the IPC boundary. Likewise, dict_id is something used to keep track of dictionaries throughout the IPC file / stream, but is not necessary when the data has been loaded in memory.

The proof of this observation is that the c data interface do not have dict_id nor endianess - other mechanisms are used to ensure lossless roundtrips.

The existence of dict_id in Field makes it difficult to offer a good UX to use dictionaries effectively over the IPC. The gist is that dict_id is expected to be a number uniquely representing a dictionary array.

Currently, users are required to pass a dict_id for dictionary fields. However, setting this number right requires managing global state, as we need to track how many dictionaries were created and which ids were assigned to them. As a corollary, most people disregard that number, rendering transffering dictionaries via IPC useless and/or wrong, since settings the same id for two different dictionaries may lead to a "dictionary replacement" in the IPC.

dict_id is only used when writing to IPC, when building the relationships between dictionaries and corresponding index arrays. Likewise, endianess information is only necessary to declare to the IPC which endianess the file has been written as.

This issue is also tied with why it is currently difficult to support nested Dictionaries (#499) and some back and forth in what datatypes must contain to ensure loseless roundtrip over the IPC (e.g. #501 and #439)

Goals

The goal of this issue is to investigate declaring a separate schema IpcSchema, IpcDataType, IpcField specifically for IPC, and only expose this schema when handling the Ipc format. Offer a From implementation that allows moving from Schema to IpcSchema and vice-versa. This would be very similar to what happens in parquet, where parquet has its own schema that we map to arrow. The core difference is that IpcSchema <-> Schema is lossless since by design this crate adopts all and only the logical types declared in arrow.

This allow us to remove dict_id from the fields, but, more importantly, allows offering the ability to write dictionary arrays effectively over the IPC boundary without user effort (by performing the calculation of the dict_id on the IPC boundary instead of by the user).

@jorgecarleitao jorgecarleitao added the investigation Issues or PRs that are investigations. Prs may or may not be merged. label Nov 6, 2021
@jorgecarleitao jorgecarleitao self-assigned this Nov 6, 2021
@jorgecarleitao
Copy link
Owner Author

Closed by #713

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Dec 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

1 participant