-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write_parquet performs very badly on large files compared to write_csv #3845
Comments
@ritchie46 , the dataframe gets written in a single row group, right? In general we should split it in multiple groups or we need to hold the whole file in memory. |
Yes, we do. I shall add that.
Edit. That's not correct. I will have to investigate this one |
I see we heap alloc the pages for every write. I will see if we can improve that |
Updated the labels to reflect that the output is as expected, just not at the desired performance level. @vikigenius : could you provide feedback on whether the |
@zundertj I just ran import polars as pl
import numpy as np
df = pl.DataFrame({
'doc_id': np.random.randint(0, high=1000000, size=1000000000),
'passage_id': np.random.randint(0, high=100, size=1000000000),
'score': np.random.random(size=1000000000)
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))
df.write_parquet('test.bin', row_group_size=1000000) This basically runs out of memory too, just much slower than not passing |
@vikigenius can you try for a batch of |
@ritchie46 Tried again with |
Thanks for the additional info. I will take a look. |
I did a local test and can confirm the memory stay stable if we set the row group size to something like 1million. Is this the case for you/others as well? |
@ritchie46 Nope, still getting the same problem with 0.14.18. It ran out of memory after about 2 hours and it still didn't manage to finish writing the parquet. |
@ritchie46 Is there anything else I can help with reproducing the issue? The fact that it only happens for parquet and not csv is strange. Also it doesn't seem to matter what A very high |
Strange that I cannot reproduce it. We only keep around the metadata from the row groups. If your row groups are too small it could be that the amount of metadata we store goes OOM. How did your run go with a row group size of of 1million? That should only be 1000 metadata entries right? |
@ritchie46 I am testing the code again with row group size of 1 million, but this time with a memory profiler to track memory usage over time. Also I have switched to a 512 GB VM so that I don't get OOM and actually find out the max memory usage. I am using row group size of 1 million like you suggested. The whole snippet took > 10 hrs to execute. The final parquet size is 43GB. The max memory usage reported by the profiler is 234114MB. (> 200 GB) This is not good at all. This should not be taking 200GB of RAM. And it should not take 10 hours to write a 40 GB parquet file either. One thing I noticed was that this was using just one CPU. Here is the python script I used to reproduce this import polars as pl
import numpy as np
from memory_profiler import memory_usage
def ram_test():
df = pl.DataFrame({
'doc_id': np.random.randint(0, high=1000000, size=1000000000),
'passage_id': np.random.randint(0, high=100, size=1000000000),
'score': np.random.random(size=1000000000),
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))
df.write_parquet('test.bin', row_group_size=1000000)
mem_usage = memory_usage(ram_test, interval=10) # Measure memory usage every 10 seconds
print('Maximum memory usage: %s' % max(mem_usage)) |
Doing the same thing for CSV: import polars as pl
import numpy as np
from memory_profiler import memory_usage
def ram_test():
df = pl.DataFrame({
'doc_id': np.random.randint(0, high=1000000, size=1000000000),
'passage_id': np.random.randint(0, high=100, size=1000000000),
'score': np.random.random(size=1000000000),
}).with_column(pl.lit('40240gh32152n').alias('memo_id'))
df.write_csv('test.csv')
mem_usage = memory_usage(ram_test, interval=10)
print('Maximum memory usage: %s' % max(mem_usage)) Time Taken = 10s (Just 10 seconds?) Obviously the fact that writing a CSV takes 10s and uses 43GB of RAM compared to writing a parquet takes 10hrs and uses up 200GB of RAM is ridiculous I have no idea what's causing this ridiculous disparity. If you are not able to reproduce this? Maybe it has something to do with the arrow version or something? For reference I am using: 0.14.24 version of polars (latest in PyPi) |
Yes, it certainly is. I shall investigate when I can get my hands on a bigger machine. Seems something quadratic to me. Any insight from the profiler? |
Ok, I have been able to reproduce this on the edges of my ram capacity. Am exploring now. |
@jorgecarleitao could you help me a bit with this one? I ran this snippet: use polars::prelude::*;
fn main() -> PolarsResult<()> {
let n = 5_00_000_000u32;
let ca = Utf8Chunked::from_iter((0..n).map(|_| "40240gh32152n"));
let s = ca.into_series();
let s1 = Series::new("", (0..n).collect::<Vec<_>>());
let mut df = df![
"a" => s.clone(),
"b" => s1.clone(),
"c" => s1.clone(),
"d" => s1.clone(),
]?;
let f = std::fs::File::create("test.parquet").unwrap();
ParquetWriter::new(f)
.with_row_group_size(Some(1_000_000))
.finish(&mut df);
Ok(())
} And found that this code is fast. Until the The culprit seems to be the Here is the flamegraph: EDIT: Only the |
I did a run where I checked |
@vikigenius I found and fixed the issue upstream: jorgecarleitao/arrow2#1293. I will release a patch this weekend. Thanks for being patient and helping me remember this issue when it got stale. :) |
What language are you using?
Python
Which feature gates did you use?
This can be ignored by Python & JS users.
Have you tried latest version of polars?
If the problem was resolved, please update polars. :)
What version of polars are you using?
0.13.50
What operating system are you using polars on?
Linux data-vm 4.19.0-17-cloud-amd64 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux
What language version are you using
python 3.9
Describe your bug.
I have a large dataframe that I am computing and trying to write it as parquet. It has around 100 million rows and 4 columns.
I tried two approaches
write_csv
andwrite _parquet
write_csv
takes around 10s to write the file and uses around 10GB of RAM. The CSV file occupies 5GB or disk space.write_parquet
takes more than a minute and uses more than 50GB of RAM. The parquet file occupies 4.9GB of disk space.I tried to set the compression method to uncompressed but that did not improve things at all.
What are the steps to reproduce the behavior?
Here is an example to reproduce the behaviour.
Example
The dataframe occupies around 35 GB in RAM (not very reliable since I am relying on TOP to see process memory usage)
df.write_csv('../data/test_scores.csv')
takes around 120 S and does not seem to use any additional RAM.df.write_parquet('../data/test_scores.bin')
runs out of memory, I am using a 128 GB RAM machine.The text was updated successfully, but these errors were encountered: