-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_ipc is causing queries on large datasets to fail due to memory usage #3360
Comments
Maybe we can try to bring the problem down to small ram scale? Simply try to make a MWE and measure the peak memory load between eager and lazy? Then it is something I can try to reproduce and do some tests on. |
Does it work with: query = (
pl.read_ipc("transaction_records.ipc", use_pyarrow=False).lazy()
.join(
pl.scan_ipc("header_records.ipc")
.filter(< filter criteria >),
on="key",
how="semi",
)
) or does this also run out of RAM? By default, read_ipc will use pyarrow (if installed) for reading IPC files, while |
Wow, it looks like you've located the problem @ghuls . The query runs out of RAM with Distilling it down, either of these two statements by itself runs out of RAM (outside of any query): pl.read_ipc("transaction_records.ipc", use_pyarrow=False)
pl.scan_ipc("transaction_records.ipc").collect() While this statement succeeds: pl.read_ipc("transaction_records.ipc", use_pyarrow=True) I do have pyarrow installed. And I just ran the above queries using |
Can you try with. For writing to IPC, arrow2 makes at least an extra copy in RAM: |
|
I am investigating on arrow2 side - I do not think this is expected behavior - memory wise we should be using about the same. |
We need to default to arrow2 reader though. I am surprised we still favor pyarrow in read_ipc. I will change that. |
I could't repro this in arrow2 :/ Procedure:
when writing we use double the amount as @ghuls mentions, when reading we use the same amount |
I'll close this. As of polars_u64_idx import polars as pl
import time
start = time.perf_counter()
pl.scan_ipc("transaction_records.ipc").collect().shape
print(time.perf_counter() - start)
Something that may be of interest: the changes in #4182 and #4193 play extremely well to a system with lots of RAM and fast I/O. I'll clear the Linux caches:
Now, I'll time this query on nearly 6 billion records. The IPC file is stored on a RAID0 array of 4 Gen4 NVMe drives: import polars as pl
import time
start = time.perf_counter()
(
pl.scan_ipc("transaction_records.ipc")
.select([
pl.col('xtrct_dt').max().suffix('_max'),
pl.col('xtrct_dt').min().suffix('_min'),
])
.collect()
)
print(time.perf_counter() - start)
And the same query after Linux has cached the file in RAM:
With the changes in #4182 and #4193, exploratory data analysis on large files is incredibly fast. |
Very interesting results @cbilot. The good news is that we are now exploring this upstream in arrow2 as well. @jorgecarleitao already has done a few tests. |
polars-u64-idx 0.13.31
Python 3.10.4
Linux Mint 20.3
Describe your bug.
Using polars-u64-idx on some very large datasets, I found that many of my queries were failing. In each case, the operating system returned "Killed" due to my system running out of RAM. (I can clearly see this watching
top
in Linux.)On a whim, I re-wrote a failed query in Eager mode - and surprisingly the query ran. I am now finding that many of my failed queries actually run in Eager mode - but not Lazy mode.
I think I've narrowed this down to
scan_ipc
.What are the steps to reproduce the behavior?
Since this bug is related to exhausting the memory on my machine (512 GB) with some very large datasets (5.9 billion records), I cannot easily provide a MWE. But I think I can provide a strong clue where these Lazy-mode queries are failing.
Here's one clue. This is a very large dataset that I can successfully read into RAM using
read_ipc
.One copy of this dataset fits comfortably in RAM, but there is not enough RAM on my system for 2 copies. And I think this is the clue.
Now, if I quit and restart my Python interpreter (to release the RAM) and run the following:
In
top
, I can see the python process exhaust all RAM. It's as thoughscan_ipc
is somehow attempting to create two copies of the dataset in RAM, whereasread_ipc
does not.Some other information that may help
Many of my queries on large datasets can be summarized to a pattern like the following. The operating system will kill these for running out of RAM.
However, this workaround will succeed:
Thus, the problem doesn't seem to be related to the rest of the query. Only to the
scan_ipc
method on a very large dataset.scan_parquet
I am unable to test whether this occurs with
scan_parquet
because I am not able to create parquet files (or even avro files) on large datasets with Polars. The operating system kills any attempt to write large files withwrite_parquet
orwrite_avro
due to out-of-memory issues. (I suspect those methods are creating a copy of the dataset while writing.) I'm guessing this is related to #3120. Thus, IPC files are pretty much the only format available for read/writing/storing very large datasets.For what it's worth, nothing I'm doing is mission-critical nor urgent. My goal is solely to work with polars-u64-idx on very large datasets, to test it and see how it performs.
The text was updated successfully, but these errors were encountered: