Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal component-transform interface #1882

Open
emilk opened this issue Apr 17, 2023 · 7 comments
Open

Minimal component-transform interface #1882

emilk opened this issue Apr 17, 2023 · 7 comments
Labels
💬 discussion ⛃ re_datastore affects the datastore itself

Comments

@emilk
Copy link
Member

emilk commented Apr 17, 2023

This would be a first step towards the bigger RFC: https://github.com/rerun-io/rerun/blob/main/design/component_datatypes.md

We would add something that transforms:

More design needed here

@Wumpf
Copy link
Member

Wumpf commented Apr 17, 2023

Imho the original RFC has several somewhat separable layers/steps:

  1. converters, implementing some trait, exist to take a given component and convert it to another component
  2. converters are globally registered and can be used to direct conversions
  3. converters may be applied during query automatically (with either opt-in or opt-out), ignoring ambiguities for now (to be resolved later with semantics)
  4. conversions for converts that are marked as "heavy" are cached automatically
  5. conversions may go several hoops, choosing a minimal path
  6. components are "split" into semantic & datatype
  7. semantics are used to restrict what's logged on a single path (e.g. don't allow two kinds of box2 representations)
  8. semantics are used to graph out possible conversions (... starting to think that maybe semantics are irrelevant for conversions 🤔)

Adding the semantics concept is probably the hardest part of this!

Anyways, I think the only thing is needed as our first steps would be 1 through 4. If keep doing custom non-systemized code for jpeg, we can get away with just (1) but what's the point then :)


Proposal for an easy & non-invasive way: Every component is required to have a phantom type field which refers to a semantic trait which itself is never implemented. This way semantic information is accessible but also occurs no runtime cost and doesn't change the way we store things.
For most component types the relationship between semantic and data is 1:1 - e.g. a jpeg is always a tensor. For types where several semantics are possible this solution gets trickier. E.g. a 4x4 matrix may refer to a host of different things. In this model we need to either define several Mat4 components for different semantic use or we provide a

@jleibs
Copy link
Member

jleibs commented Apr 17, 2023

My thought is to approach this by doing some form of 1, 3, and part of 4.

Rename the current Component trait to LogDataType. This will be the trait implemented by what we currently call Components. Each LogDataType will map to a single (though not necessarily unique) arrow DataType. This will be the new definition of what a DataCell contains, which is the roll currently filled by Component.

Now, re-introduce a new Component trait. Each implementation of Component should be a struct that represents the preferred way of using the data within Rerun. In most cases this will just be a wrapper struct around a specific LogDataType.

Each DataCell will also store the ComponentName -- this will need to be provided at log-time and book-kept through the serialization. The component name will go into the arrow-table metadata for the column, though columns will continue to be named after the LogDataType. (Note: multiple columns may have the same ComponentName.)

Buckets in the data-store, however, should continue to be indexed by-component. They will no-longer be single-typed, but I believe this is technically ok with the new DataCell architecture (at least until we bring back compaction? @teh-cmc to-verify).

Now, Component needs to implement a function that takes in a DataCell (i.e. type-erased LogDataType) and implements the conversion, returning the destination Component. This implementation would simply be manual for all non-trivial Components right now but could be replaced with a lookup in the registration-system in the future.

All of the query logic stays the same but the returned DataCells for the query will all have the same component, but possibly varying LogDataTypes. implementation of to_native and related functions on the DataCell now dispatches to the Component implementation of the converter, and the returned Component always ends up with the correct type.

This basically gives us the (component, datatype) tuple for our logged data as described by the rfc, and a hand-rolled way of mapping (component, X) -> (component, Y) at query-time. It just punts on all the dynamic / registration / lookup stuff.

@teh-cmc
Copy link
Member

teh-cmc commented Apr 18, 2023

Buckets in the data-store, however, should continue to be indexed by-component. They will no-longer be single-typed, but I believe this is technically ok with the new DataCell architecture (at least until we bring back compaction? @teh-cmc to-verify).

Single columns containing cells of distinct datatypes will definitely be challenging (assuming we don't rely on native arrow unions obviously), or at the very least full of surprises... but I think that's feasible:

  • When the data is loaded in memory (i.e. it's a DataTable), having heterogenous columns isn't an issue: individual cells can point to whatever datatype they want and reference a slice in a larger array somewhere that has the same type.
  • When the data is getting in and out of the system (i.e. it's serialized as a batch in an arrow Chunk) that would obviously fail, but that's fine: either it just means that a change in datatype has to mark the end of the current batch and the start of the next or it's as you just said: different datatypes yield different columns when in transit.

Compaction and serialization/batching are one and the same these days, so same conclusion.

@Wumpf
Copy link
Member

Wumpf commented Apr 18, 2023

Note: multiple columns may have the same ComponentName.

I'm probably not following entirely, but wouldn't that meant that we again allow several representations of the same thing on a path? E.g. two ways of representing a box. I thought we wanted to make this a runtime error.

I'm not sure I like the idea of a single preferred data representation. Different parts of the applications may have different requirements. On the other hand it does have advantages as we can keep the number of conversions down and predictable.

Sidenote: This entire discussions cuts very deep into the data/semantic separation that I would have liked to punt on as this is obviously quite deep.

@jleibs
Copy link
Member

jleibs commented Apr 18, 2023

I thought we wanted to make this a runtime error.

Only if we have to?

@Wumpf
Copy link
Member

Wumpf commented Apr 18, 2023

What's the alternative? I mean, what do we render/return when there are two conflicting definitions of the same thing? Something that might happen easily with transforms for example.
But sure I guess it doesn't need to en error - we could pick the first and make it a warning. But some feedback/detection is surely required

@Wumpf
Copy link
Member

Wumpf commented Aug 11, 2023

Different representations of Rect highlight that the current proposal of component-datatype-conversion doesn't cover all cases well enough:
A Box2D is going to represented now as HalfExtend2D component plus (optionally) an Origin.

This means that representing a Rect with min/max would require changing the semantic of several components. I.e. it becomes a transform of the archetype and not merely of components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💬 discussion ⛃ re_datastore affects the datastore itself
Projects
None yet
Development

No branches or pull requests

4 participants