Selecting thousands/2M of columns is slow #1023

ghuls · 2021-07-22T14:40:14Z

Are you using Python or Rust?

Python

What version of polars are you using?

0.8.12 (arrow2 branch)

What operating system are you using polars on?

CentOS 7

Describe your bug.

Selecting a huge number of columns from an existing dataframe seems to take more time than it should:

# polars from arrow2 branch to be able to load Feather files properly.
In [1]: import polars as pl
import numpy as np

In [2]: import numpy as np

In [3]: %time df = pl.read_ipc('test.feather')
CPU times: user 30.5 s, sys: 17 s, total: 47.5 s
Wall time: 5min 26s

In [12]: df.shape
Out[12]: (24453, 2208374)

In [4]: region_ids = np.random.permutation(df.columns)[0:2000]

In [5]: %time dfr = df.select(region_ids)
CPU times: user 1min 31s, sys: 5.69 s, total: 1min 37s
Wall time: 20.7 s

In [6]: region_ids = np.random.permutation(df.columns)[0:20000]

In [7]: %time dfr = df.select(region_ids)
CPU times: user 14min 48s, sys: 8.86 s, total: 14min 57s
Wall time: 3min 22s

In [8]: %time dfr_pd = dfr.to_pandas()
CPU times: user 3.09 s, sys: 16.5 s, total: 19.6 s
Wall time: 58.5 s

In [9]: dfr.shape
Out[9]: (24453, 20000)

In [10]: region_ids = np.random.permutation(df.columns)[0:200000]

In [11]: %time dfr = df.select(region_ids)
CPU times: user 2h 19min 23s, sys: 37.2 s, total: 2h 20min
Wall time: 33min 17s

In [15]: %time dfr_pd = dfr.to_pandas()
CPU times: user 27 s, sys: 2min 20s, total: 2min 47s
Wall time: 7min 14s

ritchie46 · 2021-07-22T14:48:46Z

What are your timings if you run

df[region_ids]

That should not be thrown on the threadpool.

ghuls · 2021-07-22T15:23:47Z

# For 20000 columns:
In [8]: %time dfr = df[list(region_ids)]
CPU times: user 5min 14s, sys: 7.56 s, total: 5min 22s
Wall time: 5min 21s

When giving a numpy array to df (df[np_array]) polars returns None.

ritchie46 · 2021-07-22T15:32:13Z

Oh, it hits the same branch.

polars/py-polars/polars/eager/frame.py

Line 915 in 305df5d

if isinstance(item, Sequence):

I will have a look.

ghuls · 2021-07-22T15:59:50Z

Also in the beginning of the function there is a:

        if isinstance(item, np.ndarray):
            item = pl.Series("", item)

So I assume it never hits the first if isinstance(item, np.ndarray): branch anymore:

polars/py-polars/polars/eager/frame.py

Lines 921 to 942 in 305df5d

    
           # select rows by mask or index 
        
           # df[[1, 2, 3]] 
        
           # df[true, false, true] 
        
           if isinstance(item, np.ndarray): 
        
               if item.dtype == int: 
        
                   return wrap_df(self._df.take(item)) 
        
               if isinstance(item[0], str): 
        
                   return wrap_df(self._df.select(item)) 
        
           if isinstance(item, (pl.Series, Sequence)): 
        
               if isinstance(item, Sequence): 
        
                   # only bool or integers allowed 
        
                   if type(item[0]) == bool: 
        
                       item = pl.Series("", item) 
        
                   else: 
        
                       return wrap_df( 
        
                           self._df.take([self._pos_idx(i, dim=0) for i in item]) 
        
                       ) 
        
               dtype = item.dtype 
        
               if dtype == Boolean: 
        
                   return wrap_df(self._df.filter(item.inner())) 
        
               if dtype == UInt32: 
        
                   return wrap_df(self._df.take_with_series(item.inner()))

but hits the Series branch (which doesn't support strings as far as I can tell).

jorgecarleitao · 2021-07-22T16:06:41Z

Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though.

ghuls · 2021-07-22T16:17:37Z

I am surprised it is able to load it relatively fast, but then later subsetting is so slow (I assume due to python function call overhead).

Also using a polars series with df[]returns None.

ritchie46 · 2021-07-22T16:27:42Z

Now that I think of it. You have random permutations and every named lookup is a linear search through column names. Given that there are 2 million columns this gets slow.

Maybe we should hash above a certain threshold.

Another thing is that every column name lookup has 3 layers of indirection. So yeah.. that's slow.

ritchie46 · 2021-07-22T16:34:05Z

Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though.

Yes.. that would be a valuable addition.

ghuls · 2021-07-22T16:37:05Z

Indeed hashing might help. I think theoretically a arrow table can have more than one time the same name (not that that is the case for me), so probably there should be a check for that.

A polars DataFrame does not allow that, so it would already have errored.

ghuls · 2021-07-22T16:44:43Z

I also managed to get a panick when trying to select columns via numerical indexes:

thread '/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsthread 'thread ':<unnamed>123<unnamed>' panicked at '<unnamed>' panicked at ':index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480' panicked at '', 20index out of bounds: the len is 24453 but the index is 262480', 
', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs::/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123123:123:20:20

20thread '<unnamed>' panicked at '
20
index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20123
::thread '20<unnamed>20' panicked at '
', 
/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsindex out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123::2020

thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:thread '20thread '
<unnamed><unnamed>' panicked at '' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20:20

index out of bounds: the len is 24453 but the index is 262480' panicked at '', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480:', ', /lustre1/project/stg_00002/lcb/ghuls/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs20:/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs123
::123:20
20thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:20

ritchie46 · 2021-07-22T16:47:59Z

I also managed to get a panick when trying to select columns via numerical indexes:

thread '/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsthread 'thread ':<unnamed>123<unnamed>' panicked at '<unnamed>' panicked at ':index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480' panicked at '', 20index out of bounds: the len is 24453 but the index is 262480', 
', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs::/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123123:123:20:20

20thread '<unnamed>' panicked at '
20
index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20123
::thread '20<unnamed>20' panicked at '
', 
/software/polars/polars/polars-core/src/chunked_array/kernels/take.rsindex out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123::2020

thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:thread '20thread '
<unnamed><unnamed>' panicked at '' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:123:20:20

index out of bounds: the len is 24453 but the index is 262480' panicked at '', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123' panicked at 'index out of bounds: the len is 24453 but the index is 262480index out of bounds: the len is 24453 but the index is 262480:', ', /lustre1/project/stg_00002/lcb/ghuls/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs20:/software/polars/polars/polars-core/src/chunked_array/kernels/take.rs123
::123:20
20thread '<unnamed>' panicked at 'index out of bounds: the len is 24453 but the index is 262480', /software/polars/polars/polars-core/src/chunked_array/kernels/take.rs:123:20

Could you create an issue with an example? I will take a look.

And I will add a hashing algorithm to the column selection code.

ghuls · 2021-07-25T20:55:36Z

Hashing for column selection improved performance enormously (arrow2 branch):

In [12]: region_ids = np.random.permutation(df.columns)[0:2000]

In [13]: %time dfr = df[list(region_ids)]
CPU times: user 384 ms, sys: 61.1 ms, total: 445 ms
Wall time: 444 ms

In [14]: dfr.shape
Out[14]: (24453, 2000)

In [15]: region_ids = np.random.permutation(df.columns)[0:20000]

In [16]: %time dfr = df[list(region_ids)]
CPU times: user 346 ms, sys: 157 ms, total: 503 ms
Wall time: 528 ms

In [17]: dfr.shape
Out[17]: (24453, 20000)

In [18]: region_ids = np.random.permutation(df.columns)[0:200000]

In [19]: %time dfr = df[list(region_ids)]
CPU times: user 575 ms, sys: 116 ms, total: 691 ms
Wall time: 708 ms

In [20]: dfr.shape
Out[20]: (24453, 200000)

ghuls · 2021-07-26T10:15:03Z

Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though.

@jorgecarleitao I will try to test the rust IPC reader now. I just saw that in this case it was actually using pyarrows IPC reader (this feather file was in Feather v1 format, and not v2 format (IPC on disk)).

pyarrow had problems with reading this file in Feather v2 format in the past (could write it), due Flatbuffer verification problems as it only could handle 1_000_000 columns (500 000 real data columns):
https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10056

Is there a way to get all column names from an IPC (Feather v2) file without reading the whole Feather file completely?

In pyarrow this is possible with the dataset API (at least for Feather v2 files):

        feather_v2_dataset = ds.dataset(feather_file, format="feather")
        column_names = feather_v2_dataset.schema.names

https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10344

ghuls · 2021-07-26T12:25:57Z

@jorgecarleitao
I think we hit the same bug that (py)arrow had (flatbuffer 1 milion hardcoded limit).

In (py)arrow, it was solved with this commit: apache/arrow#9447
(max_tables was increased, but the max number of tables is dependent on the size of the of the footer, to prevent denial of service by crafted IPC files).

In [33]: %time df2 = pl.read_ipc(''test.v2.feather'', use_pyarrow=False)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

/software/polars/py-polars/polars/io.py in read_ipc(file, use_pyarrow, storage_options)
    415             tbl = pa.feather.read_table(data)
    416             return pl.DataFrame.from_arrow(tbl)
--> 417         return pl.DataFrame.read_ipc(data)
    418 
    419 

/software/polars/py-polars/polars/eager/frame.py in read_ipc(file)
    606         """
    607         self = DataFrame.__new__(DataFrame)
--> 608         self._df = PyDataFrame.read_ipc(file)
    609         return self
    610 

RuntimeError: Any(ArrowError(Ipc("Unable to get root as footer: TooManyTables")))

ghuls · 2021-07-26T12:47:18Z

@jorgecarleitao
I think if your you replace gen::File::root_as_footer with gen::File::root_as_footer_with_opts, you can set the max_table option: At https://github.com/jorgecarleitao/arrow2/blob/16c089e0c4c2b079da10b6fb8fbec19cdabaeab3/src/io/ipc/read/reader.rs#L98

https://docs.rs/flatbuffers/2.0.0/src/flatbuffers/get_root.rs.html#39-49

ritchie46 · 2021-07-26T14:51:05Z

I will close this because the issue was resolved. Not to stop the discussion. Please go along. :)

ghuls · 2021-07-30T12:03:48Z

@jorgecarleitao I implemented it in the same way as the arrow C++ implementation solved it: jorgecarleitao/arrow2#240

In [1]: import polars as pl

# Feather v1 database "loaded" with pyarrow reader (memory mapped):
In [2]: %time df = pl.read_ipc('test.v1.feather', use_pyarrow=True))
CPU times: user 30.5 s, sys: 17 s, total: 47.5 s
Wall time: 5min 26s

# Feather v2 database (with lz4 compression) loaded wtih arrow2 reader.
In [3]: %time df2 = pl.read_ipc('test.v2.feather', use_pyarrow=False)
CPU times: user 1min 25s, sys: 3min 41s, total: 5min 7s
Wall time: 26min 20s

In [4]: df2.shape
Out[4]: (24453, 2208374)

@ritchie46 For reading Feather v2 files with compression polars/polars-io/Cargo.toml should enable features = ["io_ipc_compression"] for the arrow2 dependency.

ghuls · 2021-08-04T09:07:44Z

@ritchie46 My patch for reading Feather v2 files with arrow2 was merged. Can you add features = ["io_ipc_compression"] to polars/polars-io/Cargo.toml when you update the arrow2 branch of polars and use this commit of arrow2 (or newer) jorgecarleitao/arrow2@8e146a7

ritchie46 · 2021-08-04T13:11:41Z

@ritchie46 My patch for reading Feather v2 files with arrow2 was merged. Can you add features = ["io_ipc_compression"] to polars/polars-io/Cargo.toml when you update the arrow2 branch of polars and use this commit of arrow2 (or newer) jorgecarleitao/arrow2@8e146a7

Nice! Could you make a PR for that?

ghuls · 2021-08-04T14:07:31Z

Yes. PR in preparation.

ghuls · 2021-08-05T09:38:25Z

See: #1096 (failing tests due pyarrow==4.0,5.0 problem in the arrow-cherry-pick branch.

ritchie46 changed the title ~~Selecting thousands of columns is slow~~ Selecting thousands/2M of columns is slow Jul 22, 2021

ritchie46 mentioned this issue Jul 23, 2021

Improve performance of projection #1028

Merged

ritchie46 closed this as completed Jul 26, 2021

ghuls mentioned this issue Jul 26, 2021

Support loading Feather v2 (IPC) files with more than 1 million tables jorgecarleitao/arrow2#231

Closed

ghuls mentioned this issue Mar 1, 2022

Running queries on dataframe with a lot of columns is exponentionally slower for columns near the end of the dataframe. #2755

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting thousands/2M of columns is slow #1023

Selecting thousands/2M of columns is slow #1023

ghuls commented Jul 22, 2021

ritchie46 commented Jul 22, 2021

ghuls commented Jul 22, 2021

ritchie46 commented Jul 22, 2021

ghuls commented Jul 22, 2021

jorgecarleitao commented Jul 22, 2021

ghuls commented Jul 22, 2021

ritchie46 commented Jul 22, 2021 •

edited

Loading

ritchie46 commented Jul 22, 2021

ghuls commented Jul 22, 2021 •

edited by ritchie46

Loading

ghuls commented Jul 22, 2021 •

edited

Loading

ritchie46 commented Jul 22, 2021

ghuls commented Jul 25, 2021

ghuls commented Jul 26, 2021

ghuls commented Jul 26, 2021

ghuls commented Jul 26, 2021

ritchie46 commented Jul 26, 2021

ghuls commented Jul 30, 2021

ghuls commented Aug 4, 2021

ritchie46 commented Aug 4, 2021

ghuls commented Aug 4, 2021

ghuls commented Aug 5, 2021

Selecting thousands/2M of columns is slow #1023

Selecting thousands/2M of columns is slow #1023

Comments

ghuls commented Jul 22, 2021

Are you using Python or Rust?

What version of polars are you using?

What operating system are you using polars on?

Describe your bug.

ritchie46 commented Jul 22, 2021

ghuls commented Jul 22, 2021

ritchie46 commented Jul 22, 2021

ghuls commented Jul 22, 2021

jorgecarleitao commented Jul 22, 2021

ghuls commented Jul 22, 2021

ritchie46 commented Jul 22, 2021 • edited Loading

ritchie46 commented Jul 22, 2021

ghuls commented Jul 22, 2021 • edited by ritchie46 Loading

ghuls commented Jul 22, 2021 • edited Loading

ritchie46 commented Jul 22, 2021

ghuls commented Jul 25, 2021

ghuls commented Jul 26, 2021

ghuls commented Jul 26, 2021

ghuls commented Jul 26, 2021

ritchie46 commented Jul 26, 2021

ghuls commented Jul 30, 2021

ghuls commented Aug 4, 2021

ritchie46 commented Aug 4, 2021

ghuls commented Aug 4, 2021

ghuls commented Aug 5, 2021

ritchie46 commented Jul 22, 2021 •

edited

Loading

ghuls commented Jul 22, 2021 •

edited by ritchie46

Loading

ghuls commented Jul 22, 2021 •

edited

Loading