Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling the sub-classing of core data types #2846

Closed
JakobGM opened this issue Mar 7, 2022 · 5 comments
Closed

Enabling the sub-classing of core data types #2846

JakobGM opened this issue Mar 7, 2022 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics

Comments

@JakobGM
Copy link
Contributor

JakobGM commented Mar 7, 2022

Enabling the sub-classing of core data types

Description

Currently, the following test will fail on the last line:

def test_preservation_of_subclasses() -> None:
    class SubClassedDataFrame(pl.DataFrame):
        pass

    df = SubClassedDataFrame({"column_1": [1, 2, 3]})
    assert isinstance(df, pl.DataFrame)  # OK, is super-class
    assert isinstance(df, SubClassedDataFrame)  # OK

    expanded_df = df.with_column(pl.lit(2).alias("column_2"))
    assert isinstance(expanded_df, pl.DataFrame)  # OK
    assert isinstance(expanded_df, SubClassedDataFrame) # Fail!

That is, pl.DataFrame does support sub-classing, but will "lose" the sub-class once you invoke any method which produces a new object. The idea is that any method that yields a new data frame should use the constructor of its own class.

Motivation

The idea is that if Polars has first-class support for sub-classing, it will allow library users to extend the functionality of the library in an object-oriented manner. Such code might end up looking something like this:

class ExtendedDataFrame(polars.DataFrame):
    def __init__(self, *args, **kwargs):
        # Code for supporting other data types here...
        super().__init__(self, *args, **kwargs)

    def validate(self, schema: pydantic.BaseModel) -> ExtendedDataFrame:
        # Custom validation logic based on pydantic model here...

    def clean(self) -> ExtendedDataFrame:
        # Custom data cleaning logic

    # And so on...

On one hand this would enable library authors to extend the functionality of Polars with domain-specific functionality, while product developers could create DataFrame structures for representing specific business logic such as:

class Products(polars.DataFrame):
    def for_sale(self) -> Products:
        return self.filter(pl.col("status") == "for_sale")

    # And so on...

It would also be a very distinguishing factor for Polars, creating an additional value add relative to the existing solutions out there. I know that it would provide a huge benefit for the type of code I write!

Changes required

Now, this is is easier said than done, for sure. There are several complications when writing a generic library that supports arbitrary sub-classing of its core data structures. I will try to outline some of the solutions that need to be applied in order to support this use-case.

1. Dynamic class references

The main idea is that any hard-coded references to DataFrame/DataFrame.__init__/DataFrame.__new__ needs to be replaced with dynamic class references. Take the following snippet as an example:

    @classmethod
    def _from_dicts(cls, data: Sequence[Dict[str, Any]]) -> "DataFrame":
        pydf = PyDataFrame.read_dicts(data)
        return DataFrame._from_pydf(pydf)

The problem here is the hard-coded reference to DataFrame, which needs to be replaced with a dynamic reference to cls instead. That way the type is preserved.

    @classmethod
    def _from_dicts(cls, data: Sequence[Dict[str, Any]]) -> "DataFrame":
        pydf = PyDataFrame.read_dicts(data)
        return cls._from_pydf(pydf)

2. Adapting constructor functions

Most constructions of new DataFrame objects is done through the wrap_df utility method, defined as follows:

def wrap_df(df: "PyDataFrame") -> "DataFrame":
    return DataFrame._from_pydf(df)

Here we also have a hard-coded reference to DataFrame. An example invocation is in DataFrame.__mul__:

    def __mul__(
        self, other: Union["DataFrame", "pli.Series", int, float, bool]
    ) -> "DataFrame":
        if isinstance(other, DataFrame):
            return wrap_df(self._df.mul_df(other._df))

The problem here is how wrap_df accepts a PyDataFrame object instead of the DataFrame object. The type information is therefore lost. I guess PyDataFrame is a class implemented in Rust and made available through the rust bindings? Some different approaches come to mind:

  1. Add an additional cls parameter to wrap_df() with default value DataFrame. Pass in the correct class in DataFrame methods where we would like to preserve the type.
  2. Implement DataFrame._wrap_df(), a clasmethod implemented much like wrap_df only that it uses cls to construct a new dataframe instance instead.
  3. It also looks like we can invoke self._from_pydf() directly instead of wrap_df() in most methods of DataFrame. Any reasons not to do so?
  4. Add some (optional?) reference on the PyDataFrame object to the class which is supposed to wrap it. Here my expertise is lacking to evaluate the feasibility since I'm not that familiar with rust.

I'm currently leaning toward option 3.

3. Change type annotations (optional)

To illustrate this change, take the following example code:

class Foo:
    @classmethod
    def bar(cls) -> "Foo":
        return cls()


class SubFoo(Foo):
    pass


reveal_type(Foo.bar())  # Foo (correct)
reveal_type(SubFoo.bar())  # Foo (incorrect)

Here the problem is how the type annotation of the return type of Foo.bar is hard-coded to Foo, although it is really the type of self. The solution is to type it as follows:

from typing import Type, TypeVar

FooType = TypeVar("FooType", bound="Foo")


class Foo:
    @classmethod
    def bar(cls: Type[FooType]) -> FooType:
        return cls()


class SubFoo(Foo):
    pass


reveal_type(Foo.bar())  # Foo (correct)
reveal_type(SubFoo.bar())  # SubFoo (correct)

For Polars this would mean that we would have to type annotate most of the DataFrame methods with DataFrameType = TypeVar("DataFrameType", bound="DataFrame") in order for sub-classes to get correct type inference by most type checkers. This will become even easier to type once PEP 673 -- Self Type is available for use as well.

4. Preserve sub-classes after LazyFrame round-trips

One of the main problems with Polars dataframes are how they are often casted between DataFrame and LazyDataFrame (and back). We would ideally like to see .lazy().collect() return the same type as the original object. With other words, we would have to store the original wrapper class on LazyFrame and the use it when going back with .collect() and so on. The same goes for other classes such as GBSelection.

5. Allow users to sub-class both pl.DataFrame, pl.LazyFrame, and pl.Series and connect them together

This is starting to become a bit too complex perhaps, and it is not strictly necessary. But I will write it here anyway. You could imagine users being able to specify something like this:

class MySeries(pl.Series):
    ...

class MyDataFrame(pl.DataFrame):
    SERIES_CLASS = MySeries
    ...

And that any methods on pl.DataFrame that construct series would use self.SERIES_CLASS instead of hard-coded references to pl.Series for instance. But I will write up a separate issue discuss it if we get there!

Conclusion

As you can see, there is quite a lot of things to keep in mind, but I think it would offer some really nice benefits to the library users. One thing that is important to note is that all five of the aforementioned tasks are not required and I/we could implement any subset of them.

If this use case makes sense to you I could make an attempt at solving task 1, 2, and 3 ☺️ Task 4 and 5 might come later in separate PRs.

I also totally understand that you would have to see how a PR would look like in practice in order to make the final decision.

@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 8, 2022

I have created a PR for tasks 1-3 as outlined above here: #2859

ritchie46 pushed a commit that referenced this issue Mar 9, 2022
This PR introduces the necessary changes such that all methods of DataFrame and LazyFrame which return a new DataFrame or LazyFrame objects, respectively, preserve the type when these classes have been inherited from.

This PR implements solutions for tasks 1-3 as outlined in #2846. Preservation of data types after roundtrips from DataFrame to GroupBy/LazyFrame and back still remains to be done as well. I have a solution in mind for task 4 and 5 as outlined in the issue, but I will introduce that one in a separate PR in order to make the review process a bit simpler relaxed
@tomasruizt
Copy link

Any updates on this?

@lfagliano
Copy link

Is there any workaround for this now?

@stinodego stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023
@stinodego
Copy link
Member

We have decided not to support subclassing our core data types.

You can work around this in a few ways.

  1. Extend the core classes with your own methods using custom namespaces.
  2. Wrap the core classes with your own class using composition, e.g.:
import polars as pl
from typing import Any

class MyDF:
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        self._df = pl.DataFrame(*args, **kwargs)

    def __getattr__(self, name: str) -> Any:
        """Redirect any calls to the underlying Polars DataFrame."""
        return getattr(self._df, name)

    def select(self, column: str) -> None:
        """Custom select function."""
        ...

@rhshadrach-8451
Copy link

rhshadrach-8451 commented Nov 23, 2024

I want to highlight the insufficiency of these approaches.

Extend the core classes with your own methods using custom namespaces.

This is not desirable when you want to have several different classes that are wrappers around Polars DataFrames, each with different sources of data where you want to expose methods on only those classes. E.g. three DataFrames of Customers, Products, Sales. You may want a convenience method Customers.top_buyers(sales: Sales) or Products.top_sellers(sales: Sales). Using this method, having code like:

customers = pl.DataFrame(...)
customers.customers.top_buyers(sales)

and to have this namespace exposed on a DataFrame where it does not make sense is, to me, a deal-breaker when it comes to design.

Wrap the core classes with your own class using composition, e.g.:

Because the way Python detects which dunders are available is not through getting an attribute, you need to also add all dunders to the subclass. This is an extensive list and (I think) prone to changing through various versions of Polars. If you wish to retain your subclass through operations, you must implement logic wrapping each method, so using __getattr__ does not work here as well. Keeping it up to date, or perhaps even functional across versions, appears to me to be a very intensive task.

We have decided not to support subclassing our core data types.

Completely respect Polars' decision, but I do think it would be helpful to understand why this decision was made. Is it:

  • Technical limitations.
  • Maintenance burden.
  • A perception that not many users would find it useful.
  • Something else?

To be sure, it is my opinion that the approach of using composition exposing a public DataFrame:

class MyDF:
    def __init__(self, df: pl.DataFrame) -> None:
        self.df = df

is slightly inconvenient, a little ugly, but overall a fine and working solution to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics
Projects
None yet
Development

No branches or pull requests

5 participants