scan_ipc is causing queries on large datasets to fail due to memory usage #3360

cbilot · 2022-05-10T21:46:48Z

polars-u64-idx 0.13.31
Python 3.10.4
Linux Mint 20.3

Describe your bug.

Using polars-u64-idx on some very large datasets, I found that many of my queries were failing. In each case, the operating system returned "Killed" due to my system running out of RAM. (I can clearly see this watching top in Linux.)

On a whim, I re-wrote a failed query in Eager mode - and surprisingly the query ran. I am now finding that many of my failed queries actually run in Eager mode - but not Lazy mode.

I think I've narrowed this down to scan_ipc.

What are the steps to reproduce the behavior?

Since this bug is related to exhausting the memory on my machine (512 GB) with some very large datasets (5.9 billion records), I cannot easily provide a MWE. But I think I can provide a strong clue where these Lazy-mode queries are failing.

Here's one clue. This is a very large dataset that I can successfully read into RAM using read_ipc.

import polars as pl
df = pl.read_ipc('transaction_records.ipc')
df.shape
>>> (5969118070, 6)
df.estimated_size()
>>> 280548549314

One copy of this dataset fits comfortably in RAM, but there is not enough RAM on my system for 2 copies. And I think this is the clue.

Now, if I quit and restart my Python interpreter (to release the RAM) and run the following:

import polars as pl
df = pl.scan_ipc("transaction_records.ipc").collect()

>>> df = pl.scan_ipc("transaction_records.ipc").collect()
Killed

In top, I can see the python process exhaust all RAM. It's as though scan_ipc is somehow attempting to create two copies of the dataset in RAM, whereas read_ipc does not.

Some other information that may help

Many of my queries on large datasets can be summarized to a pattern like the following. The operating system will kill these for running out of RAM.

query = (
    pl.scan_ipc("transaction_records.ipc")
    .join(
        pl.scan_ipc("header_records.ipc")
        .filter(< filter criteria >),
        on="key",
        how="semi",
    )
)

However, this workaround will succeed:

 query = (
    pl.read_ipc("transaction_records.ipc").lazy()
    .join(
        pl.scan_ipc("header_records.ipc")
        .filter(< filter criteria >),
        on="key",
        how="semi",
    )
)

Thus, the problem doesn't seem to be related to the rest of the query. Only to the scan_ipc method on a very large dataset.

`scan_parquet`

I am unable to test whether this occurs with scan_parquet because I am not able to create parquet files (or even avro files) on large datasets with Polars. The operating system kills any attempt to write large files with write_parquet or write_avro due to out-of-memory issues. (I suspect those methods are creating a copy of the dataset while writing.) I'm guessing this is related to #3120. Thus, IPC files are pretty much the only format available for read/writing/storing very large datasets.

For what it's worth, nothing I'm doing is mission-critical nor urgent. My goal is solely to work with polars-u64-idx on very large datasets, to test it and see how it performs.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-05-11T07:23:55Z

Maybe we can try to bring the problem down to small ram scale? Simply try to make a MWE and measure the peak memory load between eager and lazy?

Then it is something I can try to reproduce and do some tests on.

ghuls · 2022-05-31T22:43:30Z

Does it work with:

query = (
    pl.read_ipc("transaction_records.ipc", use_pyarrow=False).lazy()
    .join(
        pl.scan_ipc("header_records.ipc")
        .filter(< filter criteria >),
        on="key",
        how="semi",
    )
)

or does this also run out of RAM?

By default, read_ipc will use pyarrow (if installed) for reading IPC files, while scan_ipc() or read_ipc(..., use_pyarrow=False) will use the arrow2 reader.

cbilot · 2022-05-31T23:49:54Z

Wow, it looks like you've located the problem @ghuls . The query runs out of RAM with use_pyarrow=False.

Distilling it down, either of these two statements by itself runs out of RAM (outside of any query):

pl.read_ipc("transaction_records.ipc", use_pyarrow=False)

pl.scan_ipc("transaction_records.ipc").collect()

While this statement succeeds:

pl.read_ipc("transaction_records.ipc", use_pyarrow=True)

I do have pyarrow installed. And I just ran the above queries using polars-u64-idx 0.13.40.

ghuls · 2022-06-01T04:24:54Z

Can you try with. rechunk=False to see if the extra copy happens due rechunking or while reading.

For writing to IPC, arrow2 makes at least an extra copy in RAM:
jorgecarleitao/arrow2#928

cbilot · 2022-06-01T04:40:21Z

scan_ipc succeeds with rechunk=False

>>> pl.scan_ipc("transaction_records.ipc", rechunk=False).collect()
shape: (5969118070, 6)

read_ipc fails with use_pyarrow=False and rechunk=False

>>> pl.read_ipc("transaction_records.ipc", use_pyarrow=False, rechunk=False)
Killed

jorgecarleitao · 2022-06-01T06:09:41Z

I am investigating on arrow2 side - I do not think this is expected behavior - memory wise we should be using about the same.

ritchie46 · 2022-06-01T07:15:55Z

We need to default to arrow2 reader though. I am surprised we still favor pyarrow in read_ipc. I will change that.

jorgecarleitao · 2022-06-02T04:05:46Z

I could't repro this in arrow2 :/

Procedure:

write a file with 4Gb
read the file with 4Gb

cargo build --release --example ipc_file_write --features io_ipc
/bin/time -v ./target/release/examples/ipc_file_write test.arrow

# Maximum resident set size (kbytes): 8391072

ls -lh test.arrow

# 4.1G Jun  2 04:02 test.arrow

cargo build --release --example ipc_file_read --features io_ipc
/bin/time -v ./target/release/examples/ipc_file_read test.arrow

# Maximum resident set size (kbytes): 4196708

when writing we use double the amount as @ghuls mentions, when reading we use the same amount

cbilot · 2022-07-31T20:45:49Z

I'll close this. As of polars_u64_idx 0.13.59, the OOM is gone. For documentation:

import polars as pl
import time

start = time.perf_counter()
pl.scan_ipc("transaction_records.ipc").collect().shape
print(time.perf_counter() - start)

>>> pl.scan_ipc("transaction_records.ipc").collect().shape
(5969118070, 6)
>>> print(time.perf_counter() - start)
0.1289807640023355

Something that may be of interest: the changes in #4182 and #4193 play extremely well to a system with lots of RAM and fast I/O.

I'll clear the Linux caches:

sudo sysctl vm.drop_caches=3

Now, I'll time this query on nearly 6 billion records. The IPC file is stored on a RAID0 array of 4 Gen4 NVMe drives:

import polars as pl
import time

start = time.perf_counter()
(
    pl.scan_ipc("transaction_records.ipc")
    .select([
        pl.col('xtrct_dt').max().suffix('_max'),
        pl.col('xtrct_dt').min().suffix('_min'),
    ])
    .collect()
)
print(time.perf_counter() - start)

shape: (1, 2)
┌──────────────┬──────────────┐
│ xtrct_dt_max ┆ xtrct_dt_min │
│ ---          ┆ ---          │
│ date         ┆ date         │
╞══════════════╪══════════════╡
│ 2022-03-31   ┆ 2012-05-31   │
└──────────────┴──────────────┘
>>> print(time.perf_counter() - start)
5.242266568002378

And the same query after Linux has cached the file in RAM:

>>> print(time.perf_counter() - start)
0.7242601340003603

With the changes in #4182 and #4193, exploratory data analysis on large files is incredibly fast.

ritchie46 · 2022-08-01T06:06:59Z

Very interesting results @cbilot. The good news is that we are now exploring this upstream in arrow2 as well. @jorgecarleitao already has done a few tests.

cbilot added the bug Something isn't working label May 10, 2022

cbilot mentioned this issue Jun 27, 2022

pl.scan_parquet().head().collect() uses huge ram on 4GB file #3818

Closed

cbilot closed this as completed Jul 31, 2022

useredsa mentioned this issue Jul 2, 2024

read_ipc and scan_ipc use more memory than needed. #17369

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

cbilot commented May 10, 2022 •

edited

Loading

ritchie46 commented May 11, 2022

ghuls commented May 31, 2022 •

edited

Loading

cbilot commented May 31, 2022 •

edited

Loading

ghuls commented Jun 1, 2022

cbilot commented Jun 1, 2022

jorgecarleitao commented Jun 1, 2022

ritchie46 commented Jun 1, 2022

jorgecarleitao commented Jun 2, 2022 •

edited

Loading

cbilot commented Jul 31, 2022

ritchie46 commented Aug 1, 2022

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

Comments

cbilot commented May 10, 2022 • edited Loading

Describe your bug.

What are the steps to reproduce the behavior?

Some other information that may help

scan_parquet

ritchie46 commented May 11, 2022

ghuls commented May 31, 2022 • edited Loading

cbilot commented May 31, 2022 • edited Loading

ghuls commented Jun 1, 2022

cbilot commented Jun 1, 2022

jorgecarleitao commented Jun 1, 2022

ritchie46 commented Jun 1, 2022

jorgecarleitao commented Jun 2, 2022 • edited Loading

cbilot commented Jul 31, 2022

ritchie46 commented Aug 1, 2022

cbilot commented May 10, 2022 •

edited

Loading

`scan_parquet`

ghuls commented May 31, 2022 •

edited

Loading

cbilot commented May 31, 2022 •

edited

Loading

jorgecarleitao commented Jun 2, 2022 •

edited

Loading