-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selecting thousands/2M of columns is slow #1023
Comments
What are your timings if you run
That should not be thrown on the threadpool. |
# For 20000 columns:
In [8]: %time dfr = df[list(region_ids)]
CPU times: user 5min 14s, sys: 7.56 s, total: 5min 22s
Wall time: 5min 21s When giving a numpy array to df (df[np_array]) polars returns None. |
Oh, it hits the same branch. polars/py-polars/polars/eager/frame.py Line 915 in 305df5d
I will have a look. |
Also in the beginning of the function there is a:
So I assume it never hits the first polars/py-polars/polars/eager/frame.py Lines 921 to 942 in 305df5d
but hits the Series branch (which doesn't support strings as far as I can tell). |
Uf, I am pleasantly surprised that the IPC reader can chew 2M columns. We would certainty benefit from projection push down, though. |
I am surprised it is able to load it relatively fast, but then later subsetting is so slow (I assume due to python function call overhead). Also using a polars series with |
Now that I think of it. You have random permutations and every named lookup is a linear search through column names. Given that there are 2 million columns this gets slow. Maybe we should hash above a certain threshold. Another thing is that every column name lookup has 3 layers of indirection. So yeah.. that's slow. |
Yes.. that would be a valuable addition. |
A polars DataFrame does not allow that, so it would already have errored. |
I also managed to get a panick when trying to select columns via numerical indexes:
|
Could you create an issue with an example? I will take a look. And I will add a hashing algorithm to the column selection code. |
Hashing for column selection improved performance enormously (arrow2 branch): In [12]: region_ids = np.random.permutation(df.columns)[0:2000]
In [13]: %time dfr = df[list(region_ids)]
CPU times: user 384 ms, sys: 61.1 ms, total: 445 ms
Wall time: 444 ms
In [14]: dfr.shape
Out[14]: (24453, 2000)
In [15]: region_ids = np.random.permutation(df.columns)[0:20000]
In [16]: %time dfr = df[list(region_ids)]
CPU times: user 346 ms, sys: 157 ms, total: 503 ms
Wall time: 528 ms
In [17]: dfr.shape
Out[17]: (24453, 20000)
In [18]: region_ids = np.random.permutation(df.columns)[0:200000]
In [19]: %time dfr = df[list(region_ids)]
CPU times: user 575 ms, sys: 116 ms, total: 691 ms
Wall time: 708 ms
In [20]: dfr.shape
Out[20]: (24453, 200000) |
@jorgecarleitao I will try to test the rust IPC reader now. I just saw that in this case it was actually using pyarrows IPC reader (this feather file was in Feather v1 format, and not v2 format (IPC on disk)). pyarrow had problems with reading this file in Feather v2 format in the past (could write it), due Flatbuffer verification problems as it only could handle 1_000_000 columns (500 000 real data columns): Is there a way to get all column names from an IPC (Feather v2) file without reading the whole Feather file completely? In pyarrow this is possible with the dataset API (at least for Feather v2 files): feather_v2_dataset = ds.dataset(feather_file, format="feather")
column_names = feather_v2_dataset.schema.names https://issues.apache.org/jira/projects/ARROW/issues/ARROW-10344 |
@jorgecarleitao In (py)arrow, it was solved with this commit: apache/arrow#9447 In [33]: %time df2 = pl.read_ipc(''test.v2.feather'', use_pyarrow=False)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<timed exec> in <module>
/software/polars/py-polars/polars/io.py in read_ipc(file, use_pyarrow, storage_options)
415 tbl = pa.feather.read_table(data)
416 return pl.DataFrame.from_arrow(tbl)
--> 417 return pl.DataFrame.read_ipc(data)
418
419
/software/polars/py-polars/polars/eager/frame.py in read_ipc(file)
606 """
607 self = DataFrame.__new__(DataFrame)
--> 608 self._df = PyDataFrame.read_ipc(file)
609 return self
610
RuntimeError: Any(ArrowError(Ipc("Unable to get root as footer: TooManyTables"))) |
@jorgecarleitao https://docs.rs/flatbuffers/2.0.0/src/flatbuffers/get_root.rs.html#39-49 |
I will close this because the issue was resolved. Not to stop the discussion. Please go along. :) |
@jorgecarleitao I implemented it in the same way as the arrow C++ implementation solved it: jorgecarleitao/arrow2#240
@ritchie46 For reading Feather v2 files with compression |
@ritchie46 My patch for reading Feather v2 files with arrow2 was merged. Can you add |
Nice! Could you make a PR for that? |
Yes. PR in preparation. |
See: #1096 (failing tests due |
Are you using Python or Rust?
Python
What version of polars are you using?
0.8.12 (arrow2 branch)
What operating system are you using polars on?
CentOS 7
Describe your bug.
Selecting a huge number of columns from an existing dataframe seems to take more time than it should:
The text was updated successfully, but these errors were encountered: