Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve types of DataFrame subclasses #2859

Merged
merged 4 commits into from
Mar 9, 2022

Conversation

JakobGM
Copy link
Contributor

@JakobGM JakobGM commented Mar 8, 2022

Preserve types of DataFrame subclasses

TL;DR: This PR introduces the necessary changes such that all methods of DataFrame and LazyFrame which return a new DataFrame or LazyFrame objects, respectively, preserve the type when these classes have been inherited from.

This PR implements solutions for tasks 1-3 as outlined in #2846. Preservation of data types after roundtrips from DataFrame to GroupBy/LazyFrame and back still remains to be done as well. I have a solution in mind for task 4 and 5 as outlined in the issue, but I will introduce that one in a separate PR in order to make the review process a bit simpler ☺️

Here is an outline of what has been done in this PR:

Replacing the use of wrap_df

A lot of methods on polars.DataFrame that return new DataFrame objects follow the following pattern:

  1. Delegate to PyDataFrame method on self._df.
  2. Return new DataFrame object by wrapping the resulting PyDataFrame with wrap_df.

The problem is how wrap_df knows nothing about the original DataFrame type and must therefore use the DataFrame._from_df constructor. My solution is to replace the following call-stack:

  • self.X() -> wrap_df() -> DataFrame._from_pydf()

With the following call-stack:

  • self.X() -> self._from_pydf()

With other words, I have removed the use of wrap_df() altogether. The next step is then to make DataFrame._from_pydf() preserve the type of self with the following implementation.

class DataFrame:
    ...

    @classmethod
    def _from_pydf(cls: Type[DataFrameType], py_df: "PyDataFrame") -> DataFrameType:
        """
        Construct Polars DataFrame from FFI PyDataFrame object.
        """
        df = cls.__new__(cls)
        df._df = py_df
        return df

The same as ☝️ has been done for LazyFrame, replacing invocations of wrap_ldf with self._from_pyldf().

Replacing hard-coded return type annotations with dynamic ones

Take the following type annotation:

class DataFrame:
    ...
    def join(self, other: "DataFrame", ...) -> "DataFrame": ...

This type annotation becomes wrong for subclasses of DataFrame since the return type is hard-coded to DataFrame. Until PEP 673 -- Self Type is usable, the solution is to define a type variable:

# A type variable used to refer to a polars.DataFrame or any subclass of it.
# Used to annotate DataFrame methods which returns the same type as self.
DataFrameType = TypeVar("DataFrameType", bound="DataFrame")

This type, DataFrameType, references DataFrame, but also any sub-type of of DataFrame. We can now type annotate DataFrame.join() in the following way:

class DataFrame:
    ...
    def join(self: DataFrameType, other: "DataFrame", ...) -> DataFrameType: ...

This annotation says basically the following:

  • The return type is the exact same type as self.
  • The other parameter can have any type or sub-type of DataFrame, but not necessarily the same type as self.

This allows users to join sub-classes with regular polars.DataFrame objects, mixing and matching as desired. The return type will be the same as the "left object", i.e. x.join(y) will yield the type of x, not y. That is why DataFrame is used to annotate other, not DataFrameType.

Most documentation for the mypy library and the typing stdlib module use really short variable names for TypeVar annotations. See here for examples. I, on the other hand, have usually opted for longer names in my code, but we might consider using shorter ones here as well. I guess DataFrameType could be simply renamed to DF, and likewise for LazyFrameType to LDF, or something like that. I will leave that decision up to you, and possibly implement the necessary changes if required ☺️

Sorry for the wall of text; thanks for taking the time to review this PR 🙇

@github-actions github-actions bot added the python Related to Python Polars label Mar 8, 2022
@ritchie46
Copy link
Member

Thanks for the excellent write up! That really makes it easier for me to understand the rationale of the changes. 👍 And I learn something new about typing. :)

Most documentation for the mypy library and the typing stdlib module use really short variable names for TypeVar annotations. See here for examples. I, on the other hand, have usually opted for longer names in my code, but we might consider using shorter ones here as well. I guess DataFrameType could be simply renamed to DF, and likewise for LazyFrameType to LDF, or something like that. I will leave that decision up to you, and possibly implement the necessary changes if required relaxed

I like the DF and LDF one. Especially since we are going to write Type[DF] a lot, it will keep the function headers more readable I think.

1 similar comment
@ritchie46
Copy link
Member

Thanks for the excellent write up! That really makes it easier for me to understand the rationale of the changes. 👍 And I learn something new about typing. :)

Most documentation for the mypy library and the typing stdlib module use really short variable names for TypeVar annotations. See here for examples. I, on the other hand, have usually opted for longer names in my code, but we might consider using shorter ones here as well. I guess DataFrameType could be simply renamed to DF, and likewise for LazyFrameType to LDF, or something like that. I will leave that decision up to you, and possibly implement the necessary changes if required relaxed

I like the DF and LDF one. Especially since we are going to write Type[DF] a lot, it will keep the function headers more readable I think.

@JakobGM JakobGM force-pushed the preserve-subclasses branch from efaf2b5 to d1b0662 Compare March 9, 2022 07:56
@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 9, 2022

Thanks for the excellent write up! That really makes it easier for me to understand the rationale of the changes. 👍 And I learn something new about typing. :)

Happy to hear it ☺️

I like the DF and LDF one. Especially since we are going to write Type[DF] a lot, it will keep the function headers more readable I think.

Makes sense! I have renamed these two type variables in a40fce5 now ✅

PS: Fixed some import orderings in order to make isort happy 😅

@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 9, 2022

I just got make pre-commit to work on my machine, so I'm following up any issues there now, will post here again when the checks pass.

@JakobGM JakobGM force-pushed the preserve-subclasses branch from a40fce5 to aed0133 Compare March 9, 2022 08:16
JakobGM added 3 commits March 9, 2022 09:19
All methods on polars.DataFrame which return new DataFrame objects
now preserve the types of self in the case of subclasses of
DataFrame.
All methods on polars.LazyFrame which return new LazyFrame objects
now preserve the types of self in the case of subclasses of
LazyFrame.
@JakobGM JakobGM force-pushed the preserve-subclasses branch from aed0133 to 5544eb2 Compare March 9, 2022 08:19
@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 9, 2022

@ritchie46 I think make pre-commit should pass now, so perhaps you can try to restart the GitHub actions?

@ritchie46
Copy link
Member

Thanks a lot @JakobGM. Excellent PR.

@ritchie46 ritchie46 merged commit 3c1501a into pola-rs:master Mar 9, 2022
@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 9, 2022

Thanks a lot @JakobGM. Excellent PR.

Likewise; thanks for the review! I will get going with preserving the types of DataFrame.lazy().collect() then 🤓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants