Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

ghuls · 2022-02-24T01:09:44Z

What language are you using?

Python.

What version of polars are you using?

0.13.5

What operating system are you using polars on?

CentOS 7

What language version are you using

python 3.10.2.

Describe your bug.

Selecting or doing queries via expressions on columns further in the dataframe gets exponentially slower when there are a lot of columns.

What are the steps to reproduce the behavior?

# Load Feather file with 1.1M collumns.
In [2]: %time df = pl.read_ipc("1110655_columns.zstd.v2.feather")
CPU times: user 2min 12s, sys: 1min 9s, total: 3min 21s
Wall time: 43 s

# Get all column names with floating point dtype.
In [5]: region_names = [column_name for column_name in df.columns if column_name != "motifs"]

In [6]: len(region_names)
Out[6]: 1110655

# Run query on first hundred columns.
In [9]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[0:100]])
CPU times: user 124 ms, sys: 66.5 ms, total: 191 ms
Wall time: 176 ms

# Run query on last hundred columns.
In [10]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[-100:]])
CPU times: user 5.41 s, sys: 42.8 ms, total: 5.45 s
Wall time: 3.31 s

# Run query on first thousand columns.
In [11]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[:1000]])
CPU times: user 239 ms, sys: 242 ms, total: 482 ms
Wall time: 394 ms

# Run query on last thousand columns.
In [12]: %time df_sum = df.select([(pl.col(reg) > 0.0).sum().alias(reg + "_count") for reg in region_names[-1000:]])
CPU times: user 58.6 s, sys: 227 ms, total: 58.9 s
Wall time: 36.8 s

# Just selecting first thousand columns.
In [16]: %time df_select = df.select([pl.col(reg) for reg in region_names[:1000]])
CPU times: user 252 ms, sys: 103 ms, total: 355 ms
Wall time: 354 ms

# Just selecting last thousand columns.
In [17]: %time df_select = df.select([pl.col(reg) for reg in region_names[-1000:]])
CPU times: user 6.97 s, sys: 136 ms, total: 7.1 s
Wall time: 7.1 s

# Just selecting first thousand columns via indexing.
In [23]: %time df_select_via_indexing = dr[region_names[:1000]]
CPU times: user 109 ms, sys: 461 µs, total: 109 ms
Wall time: 108 ms

# Just selecting last thousand columns via indexing.
In [24]: %time df_select_via_indexing = df[region_names[-1000:]]
CPU times: user 99.7 ms, sys: 412 µs, total: 100 ms
Wall time: 99.6 ms

What is the solution?

The location of the columns in the dataframe shouldn't matter.

When getting columns via expressions the same check as in #1028 should be implemented, so if you have a lot of expressions and a lot of columns, a hashmap is used to get the correct columns.

As seen here, the speed difference is big (both produce the same dataframe):

# Just selecting last thousand columns.
In [17]: %time df_select = df.select([pl.col(reg) for reg in region_names[-1000:]])
CPU times: user 6.97 s, sys: 136 ms, total: 7.1 s
Wall time: 7.1 s

# Just selecting last thousand columns via indexing (using hashmap for column name lookup)
In [24]: %time df_select_via_indexing = df[region_names[-1000:]]
CPU times: user 99.7 ms, sys: 412 µs, total: 100 ms
Wall time: 99.6 ms

ritchie46 · 2022-02-26T18:06:26Z

I think we must refactor Schema to be backed by a hashmap that is ordered like https://github.com/bluss/indexmap instead of a Vec.

ritchie46 · 2022-02-27T14:43:39Z

See: #2788

ghuls · 2022-02-27T23:45:24Z

The speed is already way better than before, but it is still way slower than df[list_of_columns].
Most time seems to be spend in system time.

In [9]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[-1000:]])
CPU times: user 373 ms, sys: 181 ms, total: 554 ms
Wall time: 630 ms

In [10]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[-10000:]])
CPU times: user 402 ms, sys: 1.12 s, total: 1.52 s
Wall time: 1.52 s

In [12]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[-100000:]])
CPU times: user 1 s, sys: 10.6 s, total: 11.6 s
Wall time: 11.6 s

In [13]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[:100000]])
CPU times: user 1.02 s, sys: 10.6 s, total: 11.6 s
Wall time: 11.6 s

In [15]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns[:10000]])
CPU times: user 494 ms, sys: 1.15 s, total: 1.64 s
Wall time: 1.64 s

In [16]: %time df_select = dfs[region_columns[-100000:]]
CPU times: user 130 ms, sys: 22.9 ms, total: 153 ms
Wall time: 164 msy

In [17]: %time df_select = dfs.select([pl.col(reg) for reg in region_columns])
CPU times: user 7.6 s, sys: 1min 58s, total: 2min 5s
Wall time: 2min 6s

In [19]: %time df_select = dfs[region_columns[:]]
CPU times: user 413 ms, sys: 141 ms, total: 554 ms
Wall time: 551 ms


In [18]: dfs.shape
Out[18]: (14869, 1110656)

ritchie46 · 2022-02-28T08:06:41Z

Thanks, I have found another place where we still do an O(n) search. I will follow up on this.

ghuls · 2022-02-28T08:47:35Z

It also seems that df.schema got super slow now after the change to indexmap backed schema (not even finishing after several minutes).

ritchie46 · 2022-02-28T09:03:38Z

You called that on a DataFrame or on a LazyFrame?

For a DataFrame I think it already was slow. It is quite naive python code:

    @property
    def schema(self) -> Dict[str, Type[DataType]]:
        return {c: self[c].dtype for c in self.columns}

And now that I look at it the self[c] is a linear scan.

So we can improve that as well.

ghuls · 2022-02-28T09:16:41Z

I called it on DataFrame.

ghuls · 2022-03-01T15:05:52Z

#2795 seems to not make a big difference for the simple select query.

In [11]: %time df2 = df.select([pl.col(x) for x in df.columns[:1000]])
CPU times: user 541 ms, sys: 175 ms, total: 716 ms
Wall time: 715 ms

In [12]: %time df2 = df.select([pl.col(x) for x in df.columns[:10000]])
CPU times: user 622 ms, sys: 1.22 s, total: 1.84 s
Wall time: 1.84 s

In [13]: %time df2 = df.select([pl.col(x) for x in df.columns[:100000]])
CPU times: user 1.03 s, sys: 11.4 s, total: 12.4 s
Wall time: 12.4 s

In [14]: %time df2 = df.select([pl.col(x) for x in df.columns[-100000:]])
CPU times: user 1.04 s, sys: 11.3 s, total: 12.4 s
Wall time: 12.4 s

In [8]: %time df2 = df.select([pl.col(x) for x in df.columns])
CPU times: user 7.25 s, sys: 2min 4s, total: 2min 11s
Wall time: 2min 12s

ritchie46 · 2022-03-01T15:47:48Z

That is strange. Then it is time for a flamegraph I think. :/

ghuls · 2022-03-01T15:53:16Z

I am compiling polars now with the best optimisations to see if it makes a difference.

ritchie46 · 2022-03-01T15:58:27Z

target-cpu=native + release should be enough

ritchie46 · 2022-03-01T15:59:24Z

Could you do a cargo flamegraph though?

ghuls · 2022-03-01T18:04:28Z

With maturin develop --rustc-extra-args="-C target-cpu=native" --release the speed is a bit better.

In [4]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns[-10000:]]).collect()
CPU times: user 814 ms, sys: 261 ms, total: 1.08 s
Wall time: 1.03 s

In [5]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns[-100000:]]).collect()
CPU times: user 1.51 s, sys: 970 ms, total: 2.48 s
Wall time: 2.29 s

In [6]: %time df2 = df.lazy().select([pl.col(x) for x in df.columns]).collect()
CPU times: user 12.5 s, sys: 12 s, total: 24.5 s
Wall time: 21.9 s

ghuls · 2022-03-01T18:13:34Z

Maybe the remaining problem is related to: #1023

Which was partially fixed with running simple projection not on the threadpool: 4a2c0b6

ritchie46 · 2022-03-01T18:19:26Z

Yes that will always be faster. We do a lot less work there. I think we must ensure now that querying columns at the end of the Dataframe is as fast as querying at the start.

ghuls · 2022-03-01T18:29:47Z

With the latest code querying the first X columns and the last X columns takes the same time now.
How do I use cargo flamegraph with maturin?

ghuls added the performance Performance issues or improvements label Feb 24, 2022

ghuls mentioned this issue Feb 28, 2022

Speedup df.schema. #2792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

ghuls commented Feb 24, 2022

ritchie46 commented Feb 26, 2022

ritchie46 commented Feb 27, 2022

ghuls commented Feb 27, 2022

ritchie46 commented Feb 28, 2022

ghuls commented Feb 28, 2022

ritchie46 commented Feb 28, 2022 •

edited

Loading

ghuls commented Feb 28, 2022

ghuls commented Mar 1, 2022

ritchie46 commented Mar 1, 2022

ghuls commented Mar 1, 2022

ritchie46 commented Mar 1, 2022

ritchie46 commented Mar 1, 2022

ghuls commented Mar 1, 2022

ghuls commented Mar 1, 2022

ritchie46 commented Mar 1, 2022 •

edited

Loading

ghuls commented Mar 1, 2022

Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

Comments

ghuls commented Feb 24, 2022

What language are you using?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the solution?

ritchie46 commented Feb 26, 2022

ritchie46 commented Feb 27, 2022

ghuls commented Feb 27, 2022

ritchie46 commented Feb 28, 2022

ghuls commented Feb 28, 2022

ritchie46 commented Feb 28, 2022 • edited Loading

ghuls commented Feb 28, 2022

ghuls commented Mar 1, 2022

ritchie46 commented Mar 1, 2022

ghuls commented Mar 1, 2022

ritchie46 commented Mar 1, 2022

ritchie46 commented Mar 1, 2022

ghuls commented Mar 1, 2022

ghuls commented Mar 1, 2022

ritchie46 commented Mar 1, 2022 • edited Loading

ghuls commented Mar 1, 2022

ritchie46 commented Feb 28, 2022 •

edited

Loading

ritchie46 commented Mar 1, 2022 •

edited

Loading