Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.collect() should be extensible #62

Closed
andygrove opened this issue Apr 25, 2021 · 2 comments
Closed

DataFrame.collect() should be extensible #62

andygrove opened this issue Apr 25, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Ballista provides its own execution context but uses the DataFusion DataFrame. Calling collect on the DataFrame will run the query in-memory rather than distributed and Ballista users must instead extract the logical plan from the DataFrame and call BallistaContext.collect instead. This is not good UX.

Describe the solution you'd like
As a user, I would just like to call DataFrame.collect() and have it run either in-memory or distributed depending on how I created the context.

I think the way to do this is by making it possible to customize ExecutionContext and override the behavior when a DataFrame is collected.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added the enhancement New feature or request label Apr 25, 2021
@jorgecarleitao
Copy link
Member

One way to address this:

  • Create a trait ExecutionContext with necessary APIs
  • make DataFrame hold Box<dyn ExecutionContext>

This allows DataFusions' DataFrame to have different contexts resolved at run time (e.g. whether local or cluster is passed?).

@andygrove
Copy link
Member Author

I am closing this since I found a different solution to this issue. Ballista uses the DataFusion context but provides its own query planner which results in a DistributedQueryExec physical plan, so DataFrame.collect() invokes that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants