[DISCUSSION] We need a Hero for datafusion-python #440

alamb · 2023-07-26T14:52:22Z

What this project could be

I think this project needs someone who wants to make a world class python dataframe library and user experience take the helm. I will argue why I think this is a compelling opportunity to make a great piece of technology and have a wide impact across the data analytic space:

What this project could be

I think this project could be one of the most widely used data analysis libraries out there. Imagine a system that allows BOTH a fast dataframe API (ala pol.rs) but also first class SQL support (ala duckdb) that are both screaming fast (due to all the effort that goes into https://github.com/apache/arrow-datafusion) as well as easy to plug into the eco system (arrow / parquet) and extensible (UDFS, UDAs, etc)

DataFusion already posts great benchmark numbers, and I will post datafusion 28.0.0 benchmark when we have them.

How is this different than the mission of DataFusion?

DataFusion is a great project but is currently focused on building the core analytic engine:

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

This repository contains basic python bindings, but the user experience (UX) could be improved in so many ways.

The opportunity

This would be a great opportunity for someone to:

Build some really cool technology
Learn how to help grow an open source project and community with help and guidance from the rest of the DataFusion community
Learn about analytic database technology, Arrow, etc
Influence the direction of Development in DataFusion

mesejo · 2023-07-26T17:19:38Z

I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.

cpcloud · 2023-07-27T10:41:25Z

This looks like a great idea!

I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.

As @mesejo mentioned, they've been making great contributions to the DataFusion backend so there's currently some momentum that we can take advantage of.

I'll be up front about it: there's still a lot of work to do, the DataFusion backend is missing a lot of functionality.

The good news is that we've made really easy to see what functionality is missing from any given backend using our backend support matrix app.

Anyone can take a pass at implementing the operations that have a 🚫 in the datafusion column. Some operations will be more challenging than others, and the ibis maintainers (@kszucs, @gforsyth, @jcrist and myself) are here to help.

What do say we ... COALESCE 😉 around ibis as the DataFrame API for DataFusion?

alamb · 2023-07-28T13:51:22Z

I propose we leave the the decision of where to take this project and what to focus on to whatever hero(s) step forward. What I think datafusion-python needs is someone to invest the time to drive it forward, and the path to take, as in all open source projects, would be largely influenced by the contributors.

I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.

Thank you @cpcloud -- this is an excellent idea and it would be awesome to see the DataFusion ibis backend become more full featured.

I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.

What do say we ... COALESCE 😉 around ibis as the DataFrame API for DataFusion?

That is one of cleverest summaries I have seen in a long time. Nicely done 👏

alamb · 2023-07-28T13:55:13Z

I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.

Thank you @mesejo -- that is great. Like many projects, I think what would be most valuable in this project is

Reviewing PRs and encouraging more involvment
Ensuring the project is easy to both use and contribute such as New users guide #432

Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.

lostmygithubaccount · 2023-07-28T20:22:03Z

just to throw out an idea related to this:

I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.

if we agree Ibis is a delightful dataframe API and we can close the gaps in the DataFusion backend, then you could avoid a lot of work in defining a new dataframe API by wrapping Ibis so that code looks like:

[ins] In [3]: t = datafusion.read_parquet("penguins.parquet")

[ins] In [4]: t
Out[4]:
DatabaseTable: _ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse
  species           string
  island            string
  bill_length_mm    float64
  bill_depth_mm     float64
  flipper_length_mm int64
  body_mass_g       int64
  sex               string
  year              int64

[ins] In [5]: datafusion.options.interactive = True

[ins] In [6]: t
Out[6]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male   │  2007 │
│ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female │  2007 │
│ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female │  2007 │
│ Adelie  │ Torgersen │            nan │           nan │              NULL │        NULL │ NULL   │  2007 │
│ Adelie  │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female │  2007 │
│ Adelie  │ Torgersen │           39.3 │          20.6 │               190 │        3650 │ male   │  2007 │
│ Adelie  │ Torgersen │           38.9 │          17.8 │               181 │        3625 │ female │  2007 │
│ Adelie  │ Torgersen │           39.2 │          19.6 │               195 │        4675 │ male   │  2007 │
│ Adelie  │ Torgersen │           34.1 │          18.1 │               193 │        3475 │ NULL   │  2007 │
│ Adelie  │ Torgersen │           42.0 │          20.2 │               190 │        4250 │ NULL   │  2007 │
│ …       │ …         │              … │             … │                 … │           … │ …      │     … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

[ins] In [7]: t.group_by(["species", "island"]).agg(datafusion._.count())
Out[7]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species   ┃ island    ┃ CountStar(_ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string    │ string    │ int64                                                    │
├───────────┼───────────┼──────────────────────────────────────────────────────────┤
│ Adelie    │ Biscoe    │                                                       44 │
│ Adelie    │ Torgersen │                                                       52 │
│ Adelie    │ Dream     │                                                       56 │
│ Chinstrap │ Dream     │                                                       68 │
│ Gentoo    │ Biscoe    │                                                      124 │
└───────────┴───────────┴──────────────────────────────────────────────────────────┘

[ins] In [8]: t.group_by(["species", "island"]).agg(datafusion._.count().name("count"))
Out[8]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species   ┃ island    ┃ count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string    │ string    │ int64 │
├───────────┼───────────┼───────┤
│ Adelie    │ Biscoe    │    44 │
│ Adelie    │ Torgersen │    52 │
│ Chinstrap │ Dream     │    68 │
│ Gentoo    │ Biscoe    │   124 │
│ Adelie    │ Dream     │    56 │
└───────────┴───────────┴───────┘

mesejo · 2023-08-06T12:26:05Z

@alamb Thanks for the feedback

Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.

I open a PR with a draft for the User Guide 😃. While I was writing the guide, I noticed two issues that have a huge impact on the UX and are simple to solve:

The IDE cannot provide hints (or autocompletion) because there is no typing information.
There are no examples of how to use each method (or function)

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

What are your thoughts?

alamb · 2023-08-07T13:46:22Z

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

That seems like a very good idea to me

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

I think this is likewise a great idea

Thank you @mesejo

kylebarron · 2023-08-07T15:41:59Z

Just a note that with manual .pyi files you have the endless problem of ensuring that the .pyi files and code match up correctly. Wrapping every rust function in a pure-python function works as polars does but also incurs a ton of overhead (edit: development overhead, not runtime performance overhead). The long-term solution is if pyo3 can emit python type files automatically, as wasm-bindgen does for TypeScript, but that's likely far off

devinjdangelo · 2023-08-18T16:14:37Z

I’m late to this discussion (and new to this project in general), but the contributions I’ve been focused on over the past month or so have been aimed at solving some of the gaps I see as a heavy python user with a Data Science / Engineering background. Particularly for ETL usecases it needs to be easy to move and transform data between various formats and Object stores leveraging every core available to the maximum extent possible. Default options need to be well tuned since most of these users imo won’t give DataFusion a second look if they run their job and it is much slower than polars or XYZ tool they use currently.

I haven’t gotten to actually looking much at the python interface yet, but it is on my list.

I am very much on board with the vision you describe @alamb.

magarick · 2023-08-22T01:33:28Z

Hi everyone. I'm happy to help out with this. I think it might be a good idea to get a sense for what people think this should ultimately look like as well as what features they think a good DataFrame library should have. To that end, I've started this issue which hopefully will help gather ideas and fodder for documentation #462

mesejo · 2023-09-06T10:03:39Z

Folks! I've created some issues to tackle the missing functions in the Python bindings.

These are a perfect fit for a good first issue, so contributions are more than welcome. (@alamb perhaps we could label the issues as such and promote them on Twitter to increase the involvement with the project?)

alamb · 2023-09-07T16:53:00Z

Thanks @mesejo -- I marked the tickets as good first issue and posted a tweet: https://twitter.com/andrewlamb1111/status/1699827809462440353

woxiaosa · 2024-01-04T02:42:16Z

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

That seems like a very good idea to me

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

I think this is likewise a great idea

Thank you @mesejo

Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work

alamb · 2024-01-05T20:47:59Z

Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work

Thanks @woxiaosa -- I do not know of anyone currently adding pyi files

dlr2 · 2024-01-31T16:37:50Z

I am also late to this, but as I am trying to evaluate datafusion (Python) I can give some of my input. I am sure that folks know that the documentation is scant, with most API functions having no more than the method name and args (auto extracted from sphinx).

My next idea was to test the SQL vs native expression filtering. Got the SQL to work, but I cannot see how to use an 'and' expr/function. As this is a reserved word I saw no way to apply it or import it. Would the native expr filtering be faster than the equivalent SQL?

So yes, complete examples (showing all the imports, etc) for all the functions and expressions would be great. Hope this is useful

alamb · 2024-05-13T15:24:56Z

I think github got a little excited about closing this

lostmygithubaccount · 2024-05-13T15:33:53Z

This PR does not close #440 but it helps to address one part of it.

somebody at GitHub is going to use this as evidence for LLM-based issue closing instead of the current rules

alamb · 2024-11-13T05:29:09Z

Given the work of @timsaucer @Michael-J-Ward @kosiew and @ion-elgreco have done with this repository recently, I think we could say "the call has been answered here" 🦸 🚀

I think we should close it

Thank you again for all your hard work

timsaucer · 2024-11-13T13:33:29Z

Also a big thank you to @mesejo !

timsaucer · 2024-11-15T15:30:31Z

Closing per recommendation from @alamb and no additional requests.

timsaucer mentioned this issue May 6, 2024

Add examples from TPC-H #666

Merged

andygrove closed this as completed in #666 May 13, 2024

alamb reopened this May 13, 2024

timsaucer mentioned this issue May 19, 2024

Draft: Add pyi stubs for type hinting #709

Closed

alamb mentioned this issue Aug 14, 2024

Proposal: Create dfdb, a new CLI different than datafusion-cli with pre-built integrations apache/datafusion#11979

Closed

alamb mentioned this issue Sep 12, 2024

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) apache/datafusion#12357

Closed

timsaucer closed this as completed Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] We need a Hero for datafusion-python #440

[DISCUSSION] We need a Hero for datafusion-python #440

alamb commented Jul 26, 2023

mesejo commented Jul 26, 2023 •

edited

Loading

cpcloud commented Jul 27, 2023 •

edited

Loading

alamb commented Jul 28, 2023

alamb commented Jul 28, 2023

lostmygithubaccount commented Jul 28, 2023

mesejo commented Aug 6, 2023

alamb commented Aug 7, 2023

kylebarron commented Aug 7, 2023 •

edited

Loading

devinjdangelo commented Aug 18, 2023 •

edited

Loading

magarick commented Aug 22, 2023

mesejo commented Sep 6, 2023

alamb commented Sep 7, 2023

woxiaosa commented Jan 4, 2024

alamb commented Jan 5, 2024

dlr2 commented Jan 31, 2024

alamb commented May 13, 2024

lostmygithubaccount commented May 13, 2024

alamb commented Nov 13, 2024

timsaucer commented Nov 13, 2024

timsaucer commented Nov 15, 2024

[DISCUSSION] We need a Hero for datafusion-python #440

[DISCUSSION] We need a Hero for datafusion-python #440

Comments

alamb commented Jul 26, 2023

What this project could be

What this project could be

How is this different than the mission of DataFusion?

The opportunity

mesejo commented Jul 26, 2023 • edited Loading

cpcloud commented Jul 27, 2023 • edited Loading

alamb commented Jul 28, 2023

alamb commented Jul 28, 2023

lostmygithubaccount commented Jul 28, 2023

mesejo commented Aug 6, 2023

alamb commented Aug 7, 2023

kylebarron commented Aug 7, 2023 • edited Loading

devinjdangelo commented Aug 18, 2023 • edited Loading

magarick commented Aug 22, 2023

mesejo commented Sep 6, 2023

alamb commented Sep 7, 2023

woxiaosa commented Jan 4, 2024

alamb commented Jan 5, 2024

dlr2 commented Jan 31, 2024

alamb commented May 13, 2024

lostmygithubaccount commented May 13, 2024

alamb commented Nov 13, 2024

timsaucer commented Nov 13, 2024

timsaucer commented Nov 15, 2024

mesejo commented Jul 26, 2023 •

edited

Loading

cpcloud commented Jul 27, 2023 •

edited

Loading

kylebarron commented Aug 7, 2023 •

edited

Loading

devinjdangelo commented Aug 18, 2023 •

edited

Loading