-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755
Comments
I think we must refactor |
See: #2788 |
The speed is already way better than before, but it is still way slower than df[list_of_columns].
|
Thanks, I have found another place where we still do an O(n) search. I will follow up on this. |
It also seems that |
You called that on a For a @property
def schema(self) -> Dict[str, Type[DataType]]:
return {c: self[c].dtype for c in self.columns} And now that I look at it the So we can improve that as well. |
I called it on DataFrame. |
#2795 seems to not make a big difference for the simple select query. In [11]: %time df2 = df.select([pl.col(x) for x in df.columns[:1000]])
CPU times: user 541 ms, sys: 175 ms, total: 716 ms
Wall time: 715 ms
In [12]: %time df2 = df.select([pl.col(x) for x in df.columns[:10000]])
CPU times: user 622 ms, sys: 1.22 s, total: 1.84 s
Wall time: 1.84 s
In [13]: %time df2 = df.select([pl.col(x) for x in df.columns[:100000]])
CPU times: user 1.03 s, sys: 11.4 s, total: 12.4 s
Wall time: 12.4 s
In [14]: %time df2 = df.select([pl.col(x) for x in df.columns[-100000:]])
CPU times: user 1.04 s, sys: 11.3 s, total: 12.4 s
Wall time: 12.4 s
In [8]: %time df2 = df.select([pl.col(x) for x in df.columns])
CPU times: user 7.25 s, sys: 2min 4s, total: 2min 11s
Wall time: 2min 12s |
That is strange. Then it is time for a flamegraph I think. :/ |
I am compiling polars now with the best optimisations to see if it makes a difference. |
|
Could you do a |
With In [4]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns[-10000:]]).collect()
CPU times: user 814 ms, sys: 261 ms, total: 1.08 s
Wall time: 1.03 s
In [5]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns[-100000:]]).collect()
CPU times: user 1.51 s, sys: 970 ms, total: 2.48 s
Wall time: 2.29 s
In [6]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns]).collect()
CPU times: user 12.5 s, sys: 12 s, total: 24.5 s
Wall time: 21.9 s |
Yes that will always be faster. We do a lot less work there. I think we must ensure now that querying columns at the end of the Dataframe is as fast as querying at the start. |
With the latest code querying the first X columns and the last X columns takes the same time now. |
What language are you using?
Python.
What version of polars are you using?
0.13.5
What operating system are you using polars on?
CentOS 7
What language version are you using
python 3.10.2.
Describe your bug.
Selecting or doing queries via expressions on columns further in the dataframe gets exponentially slower when there are a lot of columns.
What are the steps to reproduce the behavior?
What is the solution?
The location of the columns in the dataframe shouldn't matter.
When getting columns via expressions the same check as in #1028 should be implemented, so if you have a lot of expressions and a lot of columns, a hashmap is used to get the correct columns.
As seen here, the speed difference is big (both produce the same dataframe):
The text was updated successfully, but these errors were encountered: