Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup df.schema. #2792

Merged
merged 1 commit into from
Feb 28, 2022
Merged

Speedup df.schema. #2792

merged 1 commit into from
Feb 28, 2022

Conversation

ghuls
Copy link
Collaborator

@ghuls ghuls commented Feb 28, 2022

No description provided.

@github-actions github-actions bot added the python Related to Python Polars label Feb 28, 2022
@ghuls
Copy link
Collaborator Author

ghuls commented Feb 28, 2022

This fixes very slow df.schema with 1M columns (now runs in 1 second) #2755

@ritchie46
Copy link
Member

Yes, much better. :)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 28, 2022

FYI: this is even faster still (~25% or so), as it avoids temporary assignment into "c" and "dtype" ;)

return dict( zip(self.columns, self.dtypes) )

(The difference between the two calls is quite notable if you look at the output after disassembly via dis)

@ghuls ghuls force-pushed the speedup_df_schema branch from b13d799 to d2493d3 Compare February 28, 2022 09:39
@ghuls
Copy link
Collaborator Author

ghuls commented Feb 28, 2022

In [24]: %time columns = dfs.columns
CPU times: user 0 ns, sys: 170 ms, total: 170 ms
Wall time: 169 ms

In [25]: %time dtypes = dfs.dtypes
CPU times: user 257 ms, sys: 451 ms, total: 709 ms
Wall time: 705 ms

In [26]: %time schema = dfs.schema
CPU times: user 905 ms, sys: 159 ms, total: 1.06 s
Wall time: 1.06 s

In [28]: %time s1 = { c: d for c, d in  zip(columns, dtypes) }
CPU times: user 245 ms, sys: 11.2 ms, total: 256 ms
Wall time: 254 ms

In [29]: %time s2 = dict(zip(columns, dtypes))
CPU times: user 174 ms, sys: 1.21 ms, total: 175 ms
Wall time: 174 ms

Getting dtypes is relatively slow compared wtih just getting column names.

@ritchie46
Copy link
Member

This is a good improvement for now. We can further improve by creating the schema on the rust bindings side. Then we don't have to allocate to lists/vecs and create a dict directly.

@ritchie46 ritchie46 merged commit 2d56037 into pola-rs:master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants