Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_ipc and scan_ipc use more memory than needed. #17369

Open
2 tasks done
useredsa opened this issue Jul 2, 2024 · 2 comments
Open
2 tasks done

read_ipc and scan_ipc use more memory than needed. #17369

useredsa opened this issue Jul 2, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@useredsa
Copy link

useredsa commented Jul 2, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Prepare a big file:

import datetime as dt
import polars as pl

def MockupData(ts, num_rows):
  return pl.select(pl.lit(ts).alias('ts'), pl.arange(0, num_rows, eager=True).alias('val'))

if __name__ == '__main__':
  num_rows = 1 << 27 # For a df of 2 GB
  ts = dt.datetime(2024, 07, 1)
  df = MockupData(ts, num_rows)
  df.write_ipc('my_df', compression='zstd')

Then read the file and measure memory usage

import resource
import polars as pl

if name == '__name__':
  # This does not use extra memory
  # df = pl.read_ipc('my_df', memory_map=False, use_pyarrow=True, rechunk=False)
  df = pl.read_ipc('my_df', memory_map=False, rechunk=False)

  max_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6
  print("max_mem: {:.5f} GB".format(max_mem))

Log output

No response

Issue description

The bug is that peak memory usage is way above 2 GB, the size of the loaded data.

If you test for different file sizes, you'll see that it is something around 1.5 x the dataframe size for polars 1.0.0. For a previous version I no longer run (thus I don't know the version of pyarrow), 0.20.6, it was 2 x more. This bug is related to #3360, an old issue which was supposedly solved.

Expected behavior

I would expect the memory usage to be equal to the file size plus some constant term.

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Linux-6.5.1-41-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@useredsa useredsa added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 2, 2024
@useredsa useredsa changed the title read_ipc read_ipc and scan_ipc use more memory than needed. Jul 2, 2024
@ritchie46
Copy link
Member

I would expect the memory usage to be equal to the file size plus some constant term.

You are writing compressed data? Why would you expect the memory usage to be equal?

@useredsa
Copy link
Author

useredsa commented Jul 3, 2024

You are writing compressed data? Why would you expect the memory usage to be equal?

I'm not expecting it to be equal, there should be some constant overhead. But the overhead here is proportional to the dataframe size. Are we compressing in memory and only after writing? You can compress on-the-fly.

The implications of this is that to work with 64 GB dataframes you need a machine with 128 GB of RAM. Hence, it increases your costs considerably.

Nevertheless, the problem is there even with umcompressed files. Hence, it's maybe not related to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants