-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve types of DataFrame subclasses #2859
Conversation
Thanks for the excellent write up! That really makes it easier for me to understand the rationale of the changes. 👍 And I learn something new about typing. :)
I like the |
1 similar comment
Thanks for the excellent write up! That really makes it easier for me to understand the rationale of the changes. 👍 And I learn something new about typing. :)
I like the |
efaf2b5
to
d1b0662
Compare
Happy to hear it
Makes sense! I have renamed these two type variables in a40fce5 now ✅ PS: Fixed some import orderings in order to make |
I just got |
a40fce5
to
aed0133
Compare
All methods on polars.DataFrame which return new DataFrame objects now preserve the types of self in the case of subclasses of DataFrame.
All methods on polars.LazyFrame which return new LazyFrame objects now preserve the types of self in the case of subclasses of LazyFrame.
aed0133
to
5544eb2
Compare
@ritchie46 I think |
Thanks a lot @JakobGM. Excellent PR. |
Likewise; thanks for the review! I will get going with preserving the types of |
Preserve types of DataFrame subclasses
TL;DR: This PR introduces the necessary changes such that all methods of
DataFrame
andLazyFrame
which return a newDataFrame
orLazyFrame
objects, respectively, preserve the type when these classes have been inherited from.This PR implements solutions for tasks 1-3 as outlined in #2846. Preservation of data types after roundtrips from☺️
DataFrame
toGroupBy
/LazyFrame
and back still remains to be done as well. I have a solution in mind for task 4 and 5 as outlined in the issue, but I will introduce that one in a separate PR in order to make the review process a bit simplerHere is an outline of what has been done in this PR:
Replacing the use of
wrap_df
A lot of methods on
polars.DataFrame
that return newDataFrame
objects follow the following pattern:PyDataFrame
method onself._df
.DataFrame
object by wrapping the resultingPyDataFrame
withwrap_df
.The problem is how
wrap_df
knows nothing about the originalDataFrame
type and must therefore use theDataFrame._from_df
constructor. My solution is to replace the following call-stack:self.X()
->wrap_df()
->DataFrame._from_pydf()
With the following call-stack:
self.X()
->self._from_pydf()
With other words, I have removed the use of
wrap_df()
altogether. The next step is then to makeDataFrame._from_pydf()
preserve the type ofself
with the following implementation.The same as ☝️ has been done for
LazyFrame
, replacing invocations ofwrap_ldf
withself._from_pyldf()
.Replacing hard-coded return type annotations with dynamic ones
Take the following type annotation:
This type annotation becomes wrong for subclasses of
DataFrame
since the return type is hard-coded toDataFrame
. Until PEP 673 -- Self Type is usable, the solution is to define a type variable:This type,
DataFrameType
, referencesDataFrame
, but also any sub-type of ofDataFrame
. We can now type annotateDataFrame.join()
in the following way:This annotation says basically the following:
self
.other
parameter can have any type or sub-type ofDataFrame
, but not necessarily the same type asself
.This allows users to join sub-classes with regular
polars.DataFrame
objects, mixing and matching as desired. The return type will be the same as the "left object", i.e.x.join(y)
will yield the type ofx
, noty
. That is whyDataFrame
is used to annotateother
, notDataFrameType
.Most documentation for the☺️
mypy
library and thetyping
stdlib module use really short variable names forTypeVar
annotations. See here for examples. I, on the other hand, have usually opted for longer names in my code, but we might consider using shorter ones here as well. I guessDataFrameType
could be simply renamed toDF
, and likewise forLazyFrameType
toLDF
, or something like that. I will leave that decision up to you, and possibly implement the necessary changes if requiredSorry for the wall of text; thanks for taking the time to review this PR 🙇