Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send dataframe API #7204

Open
Famok opened this issue Aug 15, 2024 · 2 comments
Open

Send dataframe API #7204

Famok opened this issue Aug 15, 2024 · 2 comments
Labels
enhancement New feature or request 🪵 Log & send APIs Affects the user-facing API for all languages

Comments

@Famok
Copy link

Famok commented Aug 15, 2024

Describe the solution you'd like
I'd like to send dataframes (e.g. pandas and/or arrow) at once. They have the same timeline but multiple columns (e.g. time, x, y, z), whereas most often the index is the time either in us, seconds or pd.TimedeltaIndex.
Great would be something like:

send_dataframe( base_entity_path = 'mydataframe',
             timeline = 'mytimeline',
             data = df, 
             time_column:Union[None,str]= 'index',  # None would always select the index
             columns:Union[None, List[str]] = ['x','y']                 # None would select all columns
            ) 

Describe alternatives you've considered
Sending each column in separate calls. This works but might generate more overhead then necessary.

@Famok Famok added enhancement New feature or request 👀 needs triage This issue needs to be triaged by the Rerun team labels Aug 15, 2024
@abey79
Copy link
Member

abey79 commented Aug 15, 2024

If I understand correctly, your proposed API would result in the following data being logged:

  • entity mydataframe/x with index timestamps and a component with df["x"] as content,
  • entity mydataframe/y with index timestamps and a component with df["y"] as content,

both on the mytimeline timeline.

Is that correct?

In general, having a dataframe-based API is very good fit for our new columnar stuff. I see at least two points here:

  • If the send_dataframe API ends up logging to multiple "sub-entities" (as I think you suggest here), there would be little performance gain w.r.t separate send_columns calls. Chunks (our new fundamental data structure) always apply to a single entity, so multiple chunks would need to be emitted here in any case. (This is not to say that a convenience API wouldn't be useful.)
  • If the send_dataframe API logs column to a single entity, but different components, then we'd need to figure out a mapping from Python-side column dtype/label to component type (with the restriction that each components of a single entity must have a unique type). In particular, your example seems ambiguous as to what component type should be used.

@Famok
Copy link
Author

Famok commented Aug 17, 2024

Creating subentities seems to be rhe easiest way.

I can't see how the second option would work, I don't know enough about the inner workings of rerun.

But maybe there is a third if there was a datatframe entity type? Or is that against the design principles?

@Wumpf Wumpf added 🪵 Log & send APIs Affects the user-facing API for all languages and removed 👀 needs triage This issue needs to be triaged by the Rerun team labels Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request 🪵 Log & send APIs Affects the user-facing API for all languages
Projects
None yet
Development

No branches or pull requests

3 participants