-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling the sub-classing of core data types #2846
Comments
I have created a PR for tasks 1-3 as outlined above here: #2859 |
This PR introduces the necessary changes such that all methods of DataFrame and LazyFrame which return a new DataFrame or LazyFrame objects, respectively, preserve the type when these classes have been inherited from. This PR implements solutions for tasks 1-3 as outlined in #2846. Preservation of data types after roundtrips from DataFrame to GroupBy/LazyFrame and back still remains to be done as well. I have a solution in mind for task 4 and 5 as outlined in the issue, but I will introduce that one in a separate PR in order to make the review process a bit simpler relaxed
Any updates on this? |
Is there any workaround for this now? |
We have decided not to support subclassing our core data types. You can work around this in a few ways.
import polars as pl
from typing import Any
class MyDF:
def __init__(self, *args: Any, **kwargs: Any) -> None:
self._df = pl.DataFrame(*args, **kwargs)
def __getattr__(self, name: str) -> Any:
"""Redirect any calls to the underlying Polars DataFrame."""
return getattr(self._df, name)
def select(self, column: str) -> None:
"""Custom select function."""
... |
I want to highlight the insufficiency of these approaches.
This is not desirable when you want to have several different classes that are wrappers around Polars DataFrames, each with different sources of data where you want to expose methods on only those classes. E.g. three DataFrames of customers = pl.DataFrame(...)
customers.customers.top_buyers(sales) and to have this namespace exposed on a DataFrame where it does not make sense is, to me, a deal-breaker when it comes to design.
Because the way Python detects which dunders are available is not through getting an attribute, you need to also add all dunders to the subclass. This is an extensive list and (I think) prone to changing through various versions of Polars. If you wish to retain your subclass through operations, you must implement logic wrapping each method, so using
Completely respect Polars' decision, but I do think it would be helpful to understand why this decision was made. Is it:
To be sure, it is my opinion that the approach of using composition exposing a public DataFrame: class MyDF:
def __init__(self, df: pl.DataFrame) -> None:
self.df = df is slightly inconvenient, a little ugly, but overall a fine and working solution to this problem. |
Enabling the sub-classing of core data types
Description
Currently, the following test will fail on the last line:
That is,
pl.DataFrame
does support sub-classing, but will "lose" the sub-class once you invoke any method which produces a new object. The idea is that any method that yields a new data frame should use the constructor of its own class.Motivation
The idea is that if Polars has first-class support for sub-classing, it will allow library users to extend the functionality of the library in an object-oriented manner. Such code might end up looking something like this:
On one hand this would enable library authors to extend the functionality of Polars with domain-specific functionality, while product developers could create DataFrame structures for representing specific business logic such as:
It would also be a very distinguishing factor for Polars, creating an additional value add relative to the existing solutions out there. I know that it would provide a huge benefit for the type of code I write!
Changes required
Now, this is is easier said than done, for sure. There are several complications when writing a generic library that supports arbitrary sub-classing of its core data structures. I will try to outline some of the solutions that need to be applied in order to support this use-case.
1. Dynamic class references
The main idea is that any hard-coded references to
DataFrame
/DataFrame.__init__
/DataFrame.__new__
needs to be replaced with dynamic class references. Take the following snippet as an example:The problem here is the hard-coded reference to
DataFrame
, which needs to be replaced with a dynamic reference tocls
instead. That way the type is preserved.2. Adapting constructor functions
Most constructions of new
DataFrame
objects is done through thewrap_df
utility method, defined as follows:Here we also have a hard-coded reference to
DataFrame
. An example invocation is inDataFrame.__mul__
:The problem here is how
wrap_df
accepts aPyDataFrame
object instead of theDataFrame
object. The type information is therefore lost. I guessPyDataFrame
is a class implemented in Rust and made available through the rust bindings? Some different approaches come to mind:cls
parameter towrap_df()
with default valueDataFrame
. Pass in the correct class inDataFrame
methods where we would like to preserve the type.DataFrame._wrap_df()
, a clasmethod implemented much likewrap_df
only that it usescls
to construct a new dataframe instance instead.self._from_pydf()
directly instead ofwrap_df()
in most methods ofDataFrame
. Any reasons not to do so?PyDataFrame
object to the class which is supposed to wrap it. Here my expertise is lacking to evaluate the feasibility since I'm not that familiar with rust.I'm currently leaning toward option 3.
3. Change type annotations (optional)
To illustrate this change, take the following example code:
Here the problem is how the type annotation of the return type of
Foo.bar
is hard-coded toFoo
, although it is really the type of self. The solution is to type it as follows:For Polars this would mean that we would have to type annotate most of the
DataFrame
methods withDataFrameType = TypeVar("DataFrameType", bound="DataFrame")
in order for sub-classes to get correct type inference by most type checkers. This will become even easier to type once PEP 673 -- Self Type is available for use as well.4. Preserve sub-classes after
LazyFrame
round-tripsOne of the main problems with Polars dataframes are how they are often casted between
DataFrame
andLazyDataFrame
(and back). We would ideally like to see.lazy().collect()
return the same type as the original object. With other words, we would have to store the original wrapper class onLazyFrame
and the use it when going back with.collect()
and so on. The same goes for other classes such asGBSelection
.5. Allow users to sub-class both
pl.DataFrame
,pl.LazyFrame
, andpl.Series
and connect them togetherThis is starting to become a bit too complex perhaps, and it is not strictly necessary. But I will write it here anyway. You could imagine users being able to specify something like this:
And that any methods on
pl.DataFrame
that construct series would useself.SERIES_CLASS
instead of hard-coded references topl.Series
for instance. But I will write up a separate issue discuss it if we get there!Conclusion
As you can see, there is quite a lot of things to keep in mind, but I think it would offer some really nice benefits to the library users. One thing that is important to note is that all five of the aforementioned tasks are not required and I/we could implement any subset of them.
If this use case makes sense to you I could make an attempt at solving task 1, 2, and 3☺️ Task 4 and 5 might come later in separate PRs.
I also totally understand that you would have to see how a PR would look like in practice in order to make the final decision.
The text was updated successfully, but these errors were encountered: