Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: prefer pyarrow when we can memory map the file #4182

Merged
merged 1 commit into from
Jul 29, 2022

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented Jul 29, 2022

This change prefers the pyarrow reader because we can memory map the file. This will only trigger an IO read syscall when we actually read the colums, and only the parts that we read.

In lazy we can do projection/predicate pushdown on the memory mapped columns. The columns that will be filtered out never trigger a read call.

Another advantage is that we might be able to work on larger than memory data as the OS will purge pages back that are not used when it needs more RAM.

If the files do easily fit in RAM, we can query from the file system. Subsequent queries on a file may be much faster as the file/or large parts of it can still be in RAM.

I have some ideas for being able to do this natively as well. Might follow up on that.

@github-actions github-actions bot added the python Related to Python Polars label Jul 29, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #4182 (6b0efa7) into master (08f6f73) will increase coverage by 0.01%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #4182      +/-   ##
==========================================
+ Coverage   78.75%   78.77%   +0.01%     
==========================================
  Files         458      458              
  Lines       75783    75785       +2     
==========================================
+ Hits        59685    59701      +16     
+ Misses      16098    16084      -14     
Impacted Files Coverage Δ
py-polars/polars/io.py 72.82% <50.00%> (+2.49%) ⬆️
...ars/polars-core/src/chunked_array/ops/any_value.rs 77.39% <0.00%> (-0.87%) ⬇️
...polars-time/src/chunkedarray/rolling_window/mod.rs 71.53% <0.00%> (-0.77%) ⬇️
py-polars/polars/testing.py 94.08% <0.00%> (-0.54%) ⬇️
...lars/polars-core/src/chunked_array/ops/take/mod.rs 63.12% <0.00%> (-0.34%) ⬇️
polars/polars-io/src/csv/buffer.rs 79.39% <0.00%> (-0.26%) ⬇️
polars/polars-arrow/src/kernels/take.rs 87.40% <0.00%> (+0.24%) ⬆️
polars/polars-core/src/vector_hasher.rs 77.15% <0.00%> (+0.26%) ⬆️
...olars/polars-core/src/frame/groupby/into_groups.rs 60.47% <0.00%> (+0.29%) ⬆️
...s/polars-core/src/chunked_array/ops/unique/rank.rs 96.19% <0.00%> (+0.34%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08f6f73...6b0efa7. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants