python: prefer pyarrow when we can memory map the file #4182

ritchie46 · 2022-07-29T18:59:15Z

This change prefers the pyarrow reader because we can memory map the file. This will only trigger an IO read syscall when we actually read the colums, and only the parts that we read.

In lazy we can do projection/predicate pushdown on the memory mapped columns. The columns that will be filtered out never trigger a read call.

Another advantage is that we might be able to work on larger than memory data as the OS will purge pages back that are not used when it needs more RAM.

If the files do easily fit in RAM, we can query from the file system. Subsequent queries on a file may be much faster as the file/or large parts of it can still be in RAM.

I have some ideas for being able to do this natively as well. Might follow up on that.

codecov-commenter · 2022-07-29T19:55:34Z

Codecov Report

Merging #4182 (6b0efa7) into master (08f6f73) will increase coverage by 0.01%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #4182      +/-   ##
==========================================
+ Coverage   78.75%   78.77%   +0.01%     
==========================================
  Files         458      458              
  Lines       75783    75785       +2     
==========================================
+ Hits        59685    59701      +16     
+ Misses      16098    16084      -14

Impacted Files	Coverage Δ
py-polars/polars/io.py	`72.82% <50.00%> (+2.49%)`	⬆️
...ars/polars-core/src/chunked_array/ops/any_value.rs	`77.39% <0.00%> (-0.87%)`	⬇️
...polars-time/src/chunkedarray/rolling_window/mod.rs	`71.53% <0.00%> (-0.77%)`	⬇️
py-polars/polars/testing.py	`94.08% <0.00%> (-0.54%)`	⬇️
...lars/polars-core/src/chunked_array/ops/take/mod.rs	`63.12% <0.00%> (-0.34%)`	⬇️
polars/polars-io/src/csv/buffer.rs	`79.39% <0.00%> (-0.26%)`	⬇️
polars/polars-arrow/src/kernels/take.rs	`87.40% <0.00%> (+0.24%)`	⬆️
polars/polars-core/src/vector_hasher.rs	`77.15% <0.00%> (+0.26%)`	⬆️
...olars/polars-core/src/frame/groupby/into_groups.rs	`60.47% <0.00%> (+0.29%)`	⬆️
...s/polars-core/src/chunked_array/ops/unique/rank.rs	`96.19% <0.00%> (+0.34%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08f6f73...6b0efa7. Read the comment docs.

python: prefer pyarrow when we can memory map the file

6b0efa7

github-actions bot added the python Related to Python Polars label Jul 29, 2022

ritchie46 merged commit 3e665fd into master Jul 29, 2022

ritchie46 deleted the mmap_ipc_read branch July 29, 2022 20:13

cbilot mentioned this pull request Jul 31, 2022

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python: prefer pyarrow when we can memory map the file #4182

python: prefer pyarrow when we can memory map the file #4182

ritchie46 commented Jul 29, 2022 •

edited

Loading

codecov-commenter commented Jul 29, 2022

python: prefer pyarrow when we can memory map the file #4182

python: prefer pyarrow when we can memory map the file #4182

Conversation

ritchie46 commented Jul 29, 2022 • edited Loading

codecov-commenter commented Jul 29, 2022

Codecov Report

ritchie46 commented Jul 29, 2022 •

edited

Loading