`read_ipc` and `scan_ipc` use more memory than needed. #17369

useredsa · 2024-07-02T16:37:18Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Prepare a big file:

import datetime as dt
import polars as pl

def MockupData(ts, num_rows):
  return pl.select(pl.lit(ts).alias('ts'), pl.arange(0, num_rows, eager=True).alias('val'))

if __name__ == '__main__':
  num_rows = 1 << 27 # For a df of 2 GB
  ts = dt.datetime(2024, 07, 1)
  df = MockupData(ts, num_rows)
  df.write_ipc('my_df', compression='zstd')

Then read the file and measure memory usage

import resource
import polars as pl

if name == '__name__':
  # This does not use extra memory
  # df = pl.read_ipc('my_df', memory_map=False, use_pyarrow=True, rechunk=False)
  df = pl.read_ipc('my_df', memory_map=False, rechunk=False)

  max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6
  print("max_mem: {:.5f} GB".format(max_mem))

Log output

No response

Issue description

The bug is that peak memory usage is way above 2 GB, the size of the loaded data.

If you test for different file sizes, you'll see that it is something around 1.5 x the dataframe size for polars 1.0.0. For a previous version I no longer run (thus I don't know the version of pyarrow), 0.20.6, it was 2 x more. This bug is related to #3360, an old issue which was supposedly solved.

Expected behavior

I would expect the memory usage to be equal to the file size plus some constant term.

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Linux-6.5.1-41-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-07-03T07:34:23Z

I would expect the memory usage to be equal to the file size plus some constant term.

You are writing compressed data? Why would you expect the memory usage to be equal?

useredsa · 2024-07-03T13:13:52Z

You are writing compressed data? Why would you expect the memory usage to be equal?

I'm not expecting it to be equal, there should be some constant overhead. But the overhead here is proportional to the dataframe size. Are we compressing in memory and only after writing? You can compress on-the-fly.

The implications of this is that to work with 64 GB dataframes you need a machine with 128 GB of RAM. Hence, it increases your costs considerably.

Nevertheless, the problem is there even with umcompressed files. Hence, it's maybe not related to that.

useredsa added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 2, 2024

useredsa changed the title ~~read_ipc~~ read_ipc and scan_ipc use more memory than needed. Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_ipc` and `scan_ipc` use more memory than needed. #17369

`read_ipc` and `scan_ipc` use more memory than needed. #17369

useredsa commented Jul 2, 2024

ritchie46 commented Jul 3, 2024

useredsa commented Jul 3, 2024 •

edited

Loading

read_ipc and scan_ipc use more memory than needed. #17369

read_ipc and scan_ipc use more memory than needed. #17369

Comments

useredsa commented Jul 2, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Jul 3, 2024

useredsa commented Jul 3, 2024 • edited Loading

`read_ipc` and `scan_ipc` use more memory than needed. #17369

`read_ipc` and `scan_ipc` use more memory than needed. #17369

useredsa commented Jul 3, 2024 •

edited

Loading