Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

Closed
cbilot opened this issue May 10, 2022 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@cbilot
Copy link

cbilot commented May 10, 2022

polars-u64-idx 0.13.31
Python 3.10.4
Linux Mint 20.3

Describe your bug.

Using polars-u64-idx on some very large datasets, I found that many of my queries were failing. In each case, the operating system returned "Killed" due to my system running out of RAM. (I can clearly see this watching top in Linux.)

On a whim, I re-wrote a failed query in Eager mode - and surprisingly the query ran. I am now finding that many of my failed queries actually run in Eager mode - but not Lazy mode.

I think I've narrowed this down to scan_ipc.

What are the steps to reproduce the behavior?

Since this bug is related to exhausting the memory on my machine (512 GB) with some very large datasets (5.9 billion records), I cannot easily provide a MWE. But I think I can provide a strong clue where these Lazy-mode queries are failing.

Here's one clue. This is a very large dataset that I can successfully read into RAM using read_ipc.

import polars as pl
df = pl.read_ipc('transaction_records.ipc')
df.shape
>>> (5969118070, 6)
df.estimated_size()
>>> 280548549314

One copy of this dataset fits comfortably in RAM, but there is not enough RAM on my system for 2 copies. And I think this is the clue.

Now, if I quit and restart my Python interpreter (to release the RAM) and run the following:

import polars as pl
df = pl.scan_ipc("transaction_records.ipc").collect()
>>> df = pl.scan_ipc("transaction_records.ipc").collect()
Killed

In top, I can see the python process exhaust all RAM. It's as though scan_ipc is somehow attempting to create two copies of the dataset in RAM, whereas read_ipc does not.

Some other information that may help

Many of my queries on large datasets can be summarized to a pattern like the following. The operating system will kill these for running out of RAM.

query = (
    pl.scan_ipc("transaction_records.ipc")
    .join(
        pl.scan_ipc("header_records.ipc")
        .filter(< filter criteria >),
        on="key",
        how="semi",
    )
)

However, this workaround will succeed:

 query = (
    pl.read_ipc("transaction_records.ipc").lazy()
    .join(
        pl.scan_ipc("header_records.ipc")
        .filter(< filter criteria >),
        on="key",
        how="semi",
    )
)

Thus, the problem doesn't seem to be related to the rest of the query. Only to the scan_ipc method on a very large dataset.

scan_parquet

I am unable to test whether this occurs with scan_parquet because I am not able to create parquet files (or even avro files) on large datasets with Polars. The operating system kills any attempt to write large files with write_parquet or write_avro due to out-of-memory issues. (I suspect those methods are creating a copy of the dataset while writing.) I'm guessing this is related to #3120. Thus, IPC files are pretty much the only format available for read/writing/storing very large datasets.

For what it's worth, nothing I'm doing is mission-critical nor urgent. My goal is solely to work with polars-u64-idx on very large datasets, to test it and see how it performs.

@cbilot cbilot added the bug Something isn't working label May 10, 2022
@ritchie46
Copy link
Member

Maybe we can try to bring the problem down to small ram scale? Simply try to make a MWE and measure the peak memory load between eager and lazy?

Then it is something I can try to reproduce and do some tests on.

@ghuls
Copy link
Collaborator

ghuls commented May 31, 2022

Does it work with:

query = (
    pl.read_ipc("transaction_records.ipc", use_pyarrow=False).lazy()
    .join(
        pl.scan_ipc("header_records.ipc")
        .filter(< filter criteria >),
        on="key",
        how="semi",
    )
)

or does this also run out of RAM?

By default, read_ipc will use pyarrow (if installed) for reading IPC files, while scan_ipc() or read_ipc(..., use_pyarrow=False) will use the arrow2 reader.

@cbilot
Copy link
Author

cbilot commented May 31, 2022

Wow, it looks like you've located the problem @ghuls . The query runs out of RAM with use_pyarrow=False.

Distilling it down, either of these two statements by itself runs out of RAM (outside of any query):

pl.read_ipc("transaction_records.ipc", use_pyarrow=False)

pl.scan_ipc("transaction_records.ipc").collect()

While this statement succeeds:

pl.read_ipc("transaction_records.ipc", use_pyarrow=True)

I do have pyarrow installed. And I just ran the above queries using polars-u64-idx 0.13.40.

@ghuls
Copy link
Collaborator

ghuls commented Jun 1, 2022

Can you try with. rechunk=False to see if the extra copy happens due rechunking or while reading.

For writing to IPC, arrow2 makes at least an extra copy in RAM:
jorgecarleitao/arrow2#928

@cbilot
Copy link
Author

cbilot commented Jun 1, 2022

scan_ipc succeeds with rechunk=False

>>> pl.scan_ipc("transaction_records.ipc", rechunk=False).collect()
shape: (5969118070, 6)

read_ipc fails with use_pyarrow=False and rechunk=False

>>> pl.read_ipc("transaction_records.ipc", use_pyarrow=False, rechunk=False)
Killed

@jorgecarleitao
Copy link
Collaborator

I am investigating on arrow2 side - I do not think this is expected behavior - memory wise we should be using about the same.

@ritchie46
Copy link
Member

We need to default to arrow2 reader though. I am surprised we still favor pyarrow in read_ipc. I will change that.

@jorgecarleitao
Copy link
Collaborator

jorgecarleitao commented Jun 2, 2022

I could't repro this in arrow2 :/

Procedure:

  1. write a file with 4Gb
  2. read the file with 4Gb
cargo build --release --example ipc_file_write --features io_ipc
/bin/time -v ./target/release/examples/ipc_file_write test.arrow

# Maximum resident set size (kbytes): 8391072

ls -lh test.arrow

# 4.1G Jun  2 04:02 test.arrow

cargo build --release --example ipc_file_read --features io_ipc
/bin/time -v ./target/release/examples/ipc_file_read test.arrow

# Maximum resident set size (kbytes): 4196708

when writing we use double the amount as @ghuls mentions, when reading we use the same amount

@cbilot
Copy link
Author

cbilot commented Jul 31, 2022

I'll close this. As of polars_u64_idx 0.13.59, the OOM is gone. For documentation:

import polars as pl
import time

start = time.perf_counter()
pl.scan_ipc("transaction_records.ipc").collect().shape
print(time.perf_counter() - start)
>>> pl.scan_ipc("transaction_records.ipc").collect().shape
(5969118070, 6)
>>> print(time.perf_counter() - start)
0.1289807640023355

Something that may be of interest: the changes in #4182 and #4193 play extremely well to a system with lots of RAM and fast I/O.

I'll clear the Linux caches:

sudo sysctl vm.drop_caches=3

Now, I'll time this query on nearly 6 billion records. The IPC file is stored on a RAID0 array of 4 Gen4 NVMe drives:

import polars as pl
import time

start = time.perf_counter()
(
    pl.scan_ipc("transaction_records.ipc")
    .select([
        pl.col('xtrct_dt').max().suffix('_max'),
        pl.col('xtrct_dt').min().suffix('_min'),
    ])
    .collect()
)
print(time.perf_counter() - start)
shape: (1, 2)
┌──────────────┬──────────────┐
│ xtrct_dt_max ┆ xtrct_dt_min │
│ ---          ┆ ---          │
│ date         ┆ date         │
╞══════════════╪══════════════╡
│ 2022-03-31   ┆ 2012-05-31   │
└──────────────┴──────────────┘
>>> print(time.perf_counter() - start)
5.242266568002378

And the same query after Linux has cached the file in RAM:

>>> print(time.perf_counter() - start)
0.7242601340003603

With the changes in #4182 and #4193, exploratory data analysis on large files is incredibly fast.

@cbilot cbilot closed this as completed Jul 31, 2022
@ritchie46
Copy link
Member

Very interesting results @cbilot. The good news is that we are now exploring this upstream in arrow2 as well. @jorgecarleitao already has done a few tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants