Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Roadmap to Documentation #1104

Merged
merged 9 commits into from
Oct 19, 2021
Merged

Add Roadmap to Documentation #1104

merged 9 commits into from
Oct 19, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 11, 2021

Which issue does this PR close?

Closes #1102 suggested by @xudong963

** All Suggestions Welcome **

Rationale for this change

  1. Coordinate development work
  2. Help Developers new to datafusion to understand where we are going

See also:

What changes are included in this PR?

Add roadmap to docs published on https://arrow.apache.org/datafusion/

Are there any user-facing changes?

Docs

@alamb alamb added documentation Improvements or additions to documentation development-process Related to development process of DataFusion labels Oct 11, 2021
@alamb alamb marked this pull request as draft October 11, 2021 11:27

## Vision

DataFusion's goal is to become _the de facto query engine_ of choice
Copy link
Member

@xudong963 xudong963 Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_the de facto query engine_ typo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the query engine of choice by fact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reword this to less idiomatic English

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. My poor English needs improving (/ω\)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's okay :-) @xudong963 it was not english and indeed a latin phrase to begin with.

@alamb i think de facto is a common phrase, one way to make it smoother is to put it in italic (like in many books containing latin expressions).

i'd make probably hammer it home by including more examples inline, even by using footnote (not sure GitHub markdown supports that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 makes sense. I do think in general it is a good idea to try and minimize 'flowery prose' and idiomatic English is probably the best plan for what have become world wide communities.

@andygrove
Copy link
Member

I will contribute some content for the Ballista vision

@xudong963
Copy link
Member

xudong963 commented Oct 11, 2021

2021 is less than three months away. I have a plan to do something for datafusion for the rest of the time.

  1. Implement the rest of Set Operators: INTERSECT, EXCEPT, etc #1082
  2. Make subqueries more complete: make some of the tests pass https://github.com/apache/arrow-datafusion/issues?q=is%3Aopen+is%3Aissue+milestone%3ATPC-H. Later I'll pull a specific issue and split the task into small pieces.
  3. some easy and small fixes

@houqp houqp mentioned this pull request Oct 12, 2021
@jimexist
Copy link
Member

I think it's good to merge as is, more detailed sections can be subsequent pull requests.

@xudong963
Copy link
Member

I think it's good to merge as is, more detailed sections can be subsequent pull requests.

@andygrove said he'll contribute some content for the Ballista vision👀

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2021

Thanks @jimexist and @xudong963 . I'll plan to leave this open for the weekend to give anyone who has more time then to contribute and then merge it early next week

@alamb alamb marked this pull request as ready for review October 15, 2021 21:40
@alamb alamb changed the title Add Roadmap (WIP) Add Roadmap to Documentation Oct 15, 2021
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
docs/source/specification/roadmap.md Show resolved Hide resolved
docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @alamb for putting this up.

If I recall correctly, @jorgecarleitao has plans to decouple datafusion into smaller reusable crates too. But if @jorgecarleitao and @andygrove are too busy adding their items right now, we could merge and add those items later as follow up PRs.

@Dandandan do you want to add your tokomak optimizer to the list?

@houqp
Copy link
Member

houqp commented Oct 18, 2021

also cc @yjshen in case we missed any item needed from your native spark executor work.

@Dandandan
Copy link
Contributor

@Dandandan do you want to add your tokomak optimizer to the list?
Yes, added!

@yjshen
Copy link
Member

yjshen commented Oct 19, 2021

also cc @yjshen in case we missed any item needed from your native spark executor work.

Thanks, @houqp. I think what I need most is covered by the Resource Management section. I'm working on prototyping a memory limit version of SortExec currently.

On the Ballista side, I feel Broadcast join is great to add. Besides, we could have a sort-based shuffle writer for memory usage friendly and have a single map output file for each task to avoid creating too many small files when the output partition number is significant.

Copy link
Contributor

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be the right place to rate the maturity of each component? I feel that there is a pretty big gap between the DataFusion part of the codebase and Ballista (hard coded parameters, huge functions to be decomposed, un-supported SQL...). This information would be very important for a newcomer, but I am not sure how to formulate it 😄

docs/source/specification/roadmap.md Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Oct 19, 2021

Would it be the right place to rate the maturity of each component? I feel that there is a pretty big gap between the DataFusion part of the codebase and Ballista (hard coded parameters, huge functions to be decomposed, un-supported SQL...). This information would be very important for a newcomer, but I am not sure how to formulate it 😄

This would be a great thing to describe, but the roadmap is probably not the right place for it (maybe the roadmap could have an entry like "* Mature ballista (see for details)"

How about a document in the userguide https://arrow.apache.org/datafusion/#toc-guide somewhere?

Co-authored-by: Carlos <[email protected]>
Co-authored-by: rdettai <[email protected]>
@alamb alamb merged commit ff243a4 into apache:master Oct 19, 2021
@alamb alamb deleted the alamb/roadmap branch October 19, 2021 16:25
@alamb
Copy link
Contributor Author

alamb commented Oct 19, 2021

We can add changes as follow on PRs perhaps. Thanks everyone for all the help!

@xudong963
Copy link
Member

We can add changes as follow on PRs perhaps. Thanks everyone for all the help!

🎉

@houqp
Copy link
Member

houqp commented Oct 19, 2021

Thank you @alamb for making this happen. I can help deploy the doc change later today.

Reminder for @jorgecarleitao and @andygrove to add your entries before the upcoming datafusion 6.0.0 release ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add roadmap for datafusion
9 participants