-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] We need a Hero for datafusion-python #440
Comments
I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help. |
This looks like a great idea! I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content. As @mesejo mentioned, they've been making great contributions to the DataFusion backend so there's currently some momentum that we can take advantage of. I'll be up front about it: there's still a lot of work to do, the DataFusion backend is missing a lot of functionality. The good news is that we've made really easy to see what functionality is missing from any given backend using our backend support matrix app. Anyone can take a pass at implementing the operations that have a 🚫 in the What do say we ... |
I propose we leave the the decision of where to take this project and what to focus on to whatever hero(s) step forward. What I think
Thank you @cpcloud -- this is an excellent idea and it would be awesome to see the DataFusion ibis backend become more full featured. I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.
That is one of cleverest summaries I have seen in a long time. Nicely done 👏 |
Thank you @mesejo -- that is great. Like many projects, I think what would be most valuable in this project is
Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others. |
just to throw out an idea related to this:
if we agree Ibis is a delightful dataframe API and we can close the gaps in the DataFusion backend, then you could avoid a lot of work in defining a new dataframe API by wrapping Ibis so that code looks like: [ins] In [3]: t = datafusion.read_parquet("penguins.parquet")
[ins] In [4]: t
Out[4]:
DatabaseTable: _ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
[ins] In [5]: datafusion.options.interactive = True
[ins] In [6]: t
Out[6]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
│ Adelie │ Torgersen │ nan │ nan │ NULL │ NULL │ NULL │ 2007 │
│ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │
│ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │
│ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │
│ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │
│ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │
│ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │
│ … │ … │ … │ … │ … │ … │ … │ … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
[ins] In [7]: t.group_by(["species", "island"]).agg(datafusion._.count())
Out[7]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ CountStar(_ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ int64 │
├───────────┼───────────┼──────────────────────────────────────────────────────────┤
│ Adelie │ Biscoe │ 44 │
│ Adelie │ Torgersen │ 52 │
│ Adelie │ Dream │ 56 │
│ Chinstrap │ Dream │ 68 │
│ Gentoo │ Biscoe │ 124 │
└───────────┴───────────┴──────────────────────────────────────────────────────────┘
[ins] In [8]: t.group_by(["species", "island"]).agg(datafusion._.count().name("count"))
Out[8]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string │ string │ int64 │
├───────────┼───────────┼───────┤
│ Adelie │ Biscoe │ 44 │
│ Adelie │ Torgersen │ 52 │
│ Chinstrap │ Dream │ 68 │
│ Gentoo │ Biscoe │ 124 │
│ Adelie │ Dream │ 56 │
└───────────┴───────────┴───────┘ |
@alamb Thanks for the feedback
I open a PR with a draft for the User Guide 😃. While I was writing the guide, I noticed two issues that have a huge impact on the UX and are simple to solve:
For solving 1. we could follow the PyO3 guide and add information in .pyi files. For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example). What are your thoughts? |
That seems like a very good idea to me
I think this is likewise a great idea Thank you @mesejo |
Just a note that with manual |
I’m late to this discussion (and new to this project in general), but the contributions I’ve been focused on over the past month or so have been aimed at solving some of the gaps I see as a heavy python user with a Data Science / Engineering background. Particularly for ETL usecases it needs to be easy to move and transform data between various formats and Object stores leveraging every core available to the maximum extent possible. Default options need to be well tuned since most of these users imo won’t give DataFusion a second look if they run their job and it is much slower than polars or XYZ tool they use currently. I haven’t gotten to actually looking much at the python interface yet, but it is on my list. I am very much on board with the vision you describe @alamb. |
Hi everyone. I'm happy to help out with this. I think it might be a good idea to get a sense for what people think this should ultimately look like as well as what features they think a good DataFrame library should have. To that end, I've started this issue which hopefully will help gather ideas and fodder for documentation #462 |
Folks! I've created some issues to tackle the missing functions in the Python bindings. These are a perfect fit for a good first issue, so contributions are more than welcome. (@alamb perhaps we could label the issues as such and promote them on Twitter to increase the involvement with the project?) |
Thanks @mesejo -- I marked the tickets as good first issue and posted a tweet: https://twitter.com/andrewlamb1111/status/1699827809462440353 |
Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work |
Thanks @woxiaosa -- I do not know of anyone currently adding pyi files |
I am also late to this, but as I am trying to evaluate datafusion (Python) I can give some of my input. I am sure that folks know that the documentation is scant, with most API functions having no more than the method name and args (auto extracted from sphinx). My next idea was to test the SQL vs native expression filtering. Got the SQL to work, but I cannot see how to use an 'and' expr/function. As this is a reserved word I saw no way to apply it or import it. Would the native expr filtering be faster than the equivalent SQL? So yes, complete examples (showing all the imports, etc) for all the functions and expressions would be great. Hope this is useful |
I think github got a little excited about closing this |
somebody at GitHub is going to use this as evidence for LLM-based issue closing instead of the current rules |
Given the work of @timsaucer @Michael-J-Ward @kosiew and @ion-elgreco have done with this repository recently, I think we could say "the call has been answered here" 🦸 🚀 I think we should close it Thank you again for all your hard work |
Also a big thank you to @mesejo ! |
Closing per recommendation from @alamb and no additional requests. |
What this project could be
I think this project needs someone who wants to make a world class python dataframe library and user experience take the helm. I will argue why I think this is a compelling opportunity to make a great piece of technology and have a wide impact across the data analytic space:
What this project could be
I think this project could be one of the most widely used data analysis libraries out there. Imagine a system that allows BOTH a fast dataframe API (ala pol.rs) but also first class SQL support (ala duckdb) that are both screaming fast (due to all the effort that goes into https://github.com/apache/arrow-datafusion) as well as easy to plug into the eco system (arrow / parquet) and extensible (UDFS, UDAs, etc)
DataFusion already posts great benchmark numbers, and I will post datafusion 28.0.0 benchmark when we have them.
How is this different than the mission of DataFusion?
DataFusion is a great project but is currently focused on building the core analytic engine:
This repository contains basic python bindings, but the user experience (UX) could be improved in so many ways.
The opportunity
This would be a great opportunity for someone to:
The text was updated successfully, but these errors were encountered: