Minimal component-transform interface #1882

emilk · 2023-04-17T12:23:17Z

This would be a first step towards the bigger RFC: https://github.com/rerun-io/rerun/blob/main/design/component_datatypes.md

We would add something that transforms:

Transforms (quaternions, Accept quaternions in both wxyz and xyzw format #883)
Rectangles (all our different formats)
Jpegs

More design needed here

Wumpf · 2023-04-17T13:07:17Z

Imho the original RFC has several somewhat separable layers/steps:

converters, implementing some trait, exist to take a given component and convert it to another component
converters are globally registered and can be used to direct conversions
converters may be applied during query automatically (with either opt-in or opt-out), ignoring ambiguities for now (to be resolved later with semantics)
conversions for converts that are marked as "heavy" are cached automatically
conversions may go several hoops, choosing a minimal path
components are "split" into semantic & datatype
semantics are used to restrict what's logged on a single path (e.g. don't allow two kinds of box2 representations)
semantics are used to graph out possible conversions (... starting to think that maybe semantics are irrelevant for conversions 🤔)

Adding the semantics concept is probably the hardest part of this!

Anyways, I think the only thing is needed as our first steps would be 1 through 4. If keep doing custom non-systemized code for jpeg, we can get away with just (1) but what's the point then :)

Proposal for an easy & non-invasive way: Every component is required to have a phantom type field which refers to a semantic trait which itself is never implemented. This way semantic information is accessible but also occurs no runtime cost and doesn't change the way we store things.
For most component types the relationship between semantic and data is 1:1 - e.g. a jpeg is always a tensor. For types where several semantics are possible this solution gets trickier. E.g. a 4x4 matrix may refer to a host of different things. In this model we need to either define several Mat4 components for different semantic use or we provide a

jleibs · 2023-04-17T19:57:13Z

My thought is to approach this by doing some form of 1, 3, and part of 4.

Rename the current Component trait to LogDataType. This will be the trait implemented by what we currently call Components. Each LogDataType will map to a single (though not necessarily unique) arrow DataType. This will be the new definition of what a DataCell contains, which is the roll currently filled by Component.

Now, re-introduce a new Component trait. Each implementation of Component should be a struct that represents the preferred way of using the data within Rerun. In most cases this will just be a wrapper struct around a specific LogDataType.

Each DataCell will also store the ComponentName -- this will need to be provided at log-time and book-kept through the serialization. The component name will go into the arrow-table metadata for the column, though columns will continue to be named after the LogDataType. (Note: multiple columns may have the same ComponentName.)

Buckets in the data-store, however, should continue to be indexed by-component. They will no-longer be single-typed, but I believe this is technically ok with the new DataCell architecture (at least until we bring back compaction? @teh-cmc to-verify).

Now, Component needs to implement a function that takes in a DataCell (i.e. type-erased LogDataType) and implements the conversion, returning the destination Component. This implementation would simply be manual for all non-trivial Components right now but could be replaced with a lookup in the registration-system in the future.

All of the query logic stays the same but the returned DataCells for the query will all have the same component, but possibly varying LogDataTypes. implementation of to_native and related functions on the DataCell now dispatches to the Component implementation of the converter, and the returned Component always ends up with the correct type.

This basically gives us the (component, datatype) tuple for our logged data as described by the rfc, and a hand-rolled way of mapping (component, X) -> (component, Y) at query-time. It just punts on all the dynamic / registration / lookup stuff.

teh-cmc · 2023-04-18T07:47:51Z

Buckets in the data-store, however, should continue to be indexed by-component. They will no-longer be single-typed, but I believe this is technically ok with the new DataCell architecture (at least until we bring back compaction? @teh-cmc to-verify).

Single columns containing cells of distinct datatypes will definitely be challenging (assuming we don't rely on native arrow unions obviously), or at the very least full of surprises... but I think that's feasible:

When the data is loaded in memory (i.e. it's a DataTable), having heterogenous columns isn't an issue: individual cells can point to whatever datatype they want and reference a slice in a larger array somewhere that has the same type.
When the data is getting in and out of the system (i.e. it's serialized as a batch in an arrow Chunk) that would obviously fail, but that's fine: either it just means that a change in datatype has to mark the end of the current batch and the start of the next or it's as you just said: different datatypes yield different columns when in transit.

Compaction and serialization/batching are one and the same these days, so same conclusion.

Wumpf · 2023-04-18T08:42:29Z

Note: multiple columns may have the same ComponentName.

I'm probably not following entirely, but wouldn't that meant that we again allow several representations of the same thing on a path? E.g. two ways of representing a box. I thought we wanted to make this a runtime error.

I'm not sure I like the idea of a single preferred data representation. Different parts of the applications may have different requirements. On the other hand it does have advantages as we can keep the number of conversions down and predictable.

Sidenote: This entire discussions cuts very deep into the data/semantic separation that I would have liked to punt on as this is obviously quite deep.

jleibs · 2023-04-18T11:00:58Z

I thought we wanted to make this a runtime error.

Only if we have to?

Wumpf · 2023-04-18T12:16:55Z

What's the alternative? I mean, what do we render/return when there are two conflicting definitions of the same thing? Something that might happen easily with transforms for example.
But sure I guess it doesn't need to en error - we could pick the first and make it a warning. But some feedback/detection is surely required

Wumpf · 2023-08-11T11:51:49Z

Different representations of Rect highlight that the current proposal of component-datatype-conversion doesn't cover all cases well enough:
A Box2D is going to represented now as HalfExtend2D component plus (optionally) an Origin.

This means that representing a Rect with min/max would require changing the semantic of several components. I.e. it becomes a transform of the archetype and not merely of components.

emilk added the 💬 discussion label Apr 17, 2023

emilk added the ⛃ re_datastore affects the datastore itself label Apr 18, 2023

teh-cmc mentioned this issue Apr 18, 2023

re_datastore: schema registry / runtime payload validation #447

Closed

Wumpf mentioned this issue Apr 22, 2023

Support affine transformations (allow scaling!) #1956

Closed

jleibs mentioned this issue Sep 27, 2023

New component/archetype for encoded images #3494

Open

emilk mentioned this issue Oct 7, 2024

Convert components to arrow early in all log APIs #7620

Closed

emilk mentioned this issue Nov 4, 2024

Eager/early serialize of components to arrow in Rust and C++ #7245

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal component-transform interface #1882

Minimal component-transform interface #1882

emilk commented Apr 17, 2023

Wumpf commented Apr 17, 2023 •

edited

Loading

jleibs commented Apr 17, 2023 •

edited

Loading

teh-cmc commented Apr 18, 2023

Wumpf commented Apr 18, 2023 •

edited

Loading

jleibs commented Apr 18, 2023

Wumpf commented Apr 18, 2023

Wumpf commented Aug 11, 2023 •

edited

Loading

Minimal component-transform interface #1882

Minimal component-transform interface #1882

Comments

emilk commented Apr 17, 2023

Wumpf commented Apr 17, 2023 • edited Loading

jleibs commented Apr 17, 2023 • edited Loading

teh-cmc commented Apr 18, 2023

Wumpf commented Apr 18, 2023 • edited Loading

jleibs commented Apr 18, 2023

Wumpf commented Apr 18, 2023

Wumpf commented Aug 11, 2023 • edited Loading

Wumpf commented Apr 17, 2023 •

edited

Loading

jleibs commented Apr 17, 2023 •

edited

Loading

Wumpf commented Apr 18, 2023 •

edited

Loading

Wumpf commented Aug 11, 2023 •

edited

Loading