write_parquet function in polars-u64-idx does not support large data frames #3120

jnthnhss · 2022-04-11T18:27:02Z

What language are you using?

Python

What version of polars are you using?

0.13.21

What operating system are you using polars on?

Ubuntu 20.04.1 LTS

What language version are you using

Python 3.8.5

Describe your bug.

I'm using the 64 bit version of Polars. However, the write_parquet function does not seem to support large data frames.

What are the steps to reproduce the behavior?

df = pl.select(pl.repeat(0,n=2**32).alias('col_1'))
df.write_parquet('tmp.parquet')

What is the actual behavior?

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("underlying snap error: snappy: input buffer (size = 34359738368) is larger than allowed (size = 4294967295)")

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-04-12T06:09:49Z

I think we must split these large dataframes up in different row groups. I already open a discussion/issue for this upstream: jorgecarleitao/arrow2#932

Maybe we should also default to another compression like lzo. 🤔

ritchie46 · 2022-04-12T07:38:02Z

@jnthnhss I wan to change the default compression algorithm to "LZ4". Could you give that one a try? I don't have the memory capacity to test these numbers out.

jnthnhss · 2022-04-12T14:28:14Z

I tried out all the options for compression below. With no compression I also get an error. I should mentioned that it does work with use_pyarrow=True.

import polars as pl

df = pl.select(pl.repeat(0,n=2**32).alias('col_1'))

df.write_parquet('tmp.parquet',compression='uncompressed')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("A page can only contain i32::MAX uncompressed bytes. This one contains 34896609285")

df.write_parquet('tmp.parquet',compression='snappy')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("underlying snap error: snappy: input buffer (size = 34359738368) is larger than allowed (size = 4294967295)")

df.write_parquet('tmp.parquet',compression='gzip')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("A page can only contain i32::MAX uncompressed bytes. This one contains 34896609285")

df.write_parquet('tmp.parquet',compression='lzo')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("Compression Lzo is not supported")

df.write_parquet('tmp.parquet',compression='brotli')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("A page can only contain i32::MAX uncompressed bytes. This one contains 34896609285")

df.write_parquet('tmp.parquet',compression='lz4')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("A page can only contain i32::MAX uncompressed bytes. This one contains 34896609285")

df.write_parquet('tmp.parquet',compression='zstd')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/polars/internals/frame.py", line 1400, in write_parquet
    self._df.to_parquet(file, compression, statistics)
exceptions.ArrowErrorException: ExternalFormat("A page can only contain i32::MAX uncompressed bytes. This one contains 34896609285")

ritchie46 · 2022-04-19T06:50:34Z

This should be fixed by #3181. We now split large pages before writing.

jnthnhss · 2022-04-19T14:07:08Z

Thank you so much. I will give it a try today.

sammysheep · 2022-04-22T14:22:45Z

Same error implementing in the Rust API using 0.20.0 yesterday. Is the mentioned patch only available on a pre-release version?

Amazing work, by the way.

For completeness:

Error: ArrowError(ExternalFormat("A page can only contain i32::MAX uncompressed bytes. This one contains 4913383091"))
GZIP ^^

Error: ArrowError(ExternalFormat("underlying snap error: snappy: input buffer (size = 4929419665) is larger than allowed (size = 4294967295)"))
SNAPPY ^^

FWIW, Snappy > Lzo as a default. At least Cloudera thinks so and they helped develop parquet.

sammysheep · 2022-04-22T14:28:22Z

Also, is there an option to control the page size?

ritchie46 · 2022-04-23T17:56:01Z

Also, is there an option to control the page size?

No, there is not. The fix upstream now halves the page size when it is estimated to hit the limit.

ritchie46 · 2022-04-23T17:56:44Z

Did you took the PR branch? It is not yet on master.

sammysheep · 2022-04-25T01:41:40Z

Apologies, that was a noob error. I have re-built using the PR branch this time. Used Snappy with this test on a 43GB TSV file being converted to parquet. I get a new error:

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
Aborted

Full code is just based on your example:

use polars::prelude::*;

fn main() -> Result<()> {
    let input_tsv = std::env::args().nth(1).expect("no input tsv given");
    let output_parquet = std::env::args().nth(2).expect("no output parquet given");

    let mut df = LazyCsvReader::new(input_tsv)
        .has_header(false)
        .with_delimiter(b'\t')
        .with_null_values(Some(NullValues::AllColumns("\\N".to_owned())))
        .finish()?
        .select([all()])
        .collect()?;

    dbg!(&df);

    if std::fs::metadata(&output_parquet).is_err() {
        let f = std::fs::File::create(&output_parquet).unwrap();
        ParquetWriter::new(f)
            .with_statistics(true)
            .with_compression(ParquetCompression::Snappy)
            .finish(&mut df)?;
    }

    Ok(())
}

Using CentOS Linux release 7.9.2009 (Core)

ritchie46 · 2022-04-26T14:07:30Z

Thanks for giving it another try. I will see if I can try to reproduce this in a few days.

ritchie46 · 2022-05-01T07:42:23Z

Hi @sammysheep, could you give it another try? There might still be a problem, but that would then mean you have a single value that is huge. Could that be the case?

sammysheep · 2022-05-01T11:26:01Z

I'll give it a shot and let you know. This file has been converted before via Spark, which I think also uses arrow, albeit a different implementation. The largest single values are about 32k bytes or so.

sammysheep · 2022-05-01T12:02:34Z

This time it does something different. It just says Killed. No SO on this test. :)

However, I watched the process and it died after it reached the memory limit of the server, which is about 187G. The file size this time was 56G, which was my mistake that I did not save the earlier version of the file. I can try to do something closer to the original test on Monday.

I converted the same file yesterday on a similar server using local mode and coalescing to one partition. This is the closest to what I am doing with polars. The peak memory in the history are much lower for the Spark job, about 5G. The jobs aren't exactly the same because the Spark one does a distinct() call, though most of the data if not all should be distinct. In any case, I think Spark may target block sizes similar to the HDFS block size, about 128 MB.

https://medium.com/analytics-vidhya/spark-parquet-file-cac4af92981d#:~:text=Parquet%3A%20Data%20Organization&text=Horizontal%20Partitioning(Row%20Groups)%20is,of%20128%20MB%20(default).

More on Monday with different sizes.

cbilot · 2022-05-01T15:08:59Z

df = pl.select(pl.repeat(0,n=2**32,eager=True).alias('col_1'))
df.write_parquet('tmp.parquet')

I installed polars-u64-idx (0.13.26), and ran the above code. No errors. (And reading the resultant parquet file showed no problems.)

If there's anything I can do to test/benchmark/whatever, please let me know. Just point me to the files. I have only 0.5TB of RAM, so my machine may not be able to read the absolute largest of files.

sammysheep · 2022-05-04T17:06:49Z

Sorry for the delay. I chunked the data by lines equally and it ran successfully*:

-rw-rw----. 1 vfn4 vfn4 56G May  1 07:36 CDS-202204.txt
-rw-rw----. 1 vfn4 vfn4 28G May  4 11:39 CDS-202204_chunkaa
-rw-rw----. 1 vfn4 vfn4 28G May  4 11:39 CDS-202204_chunkab
-rw-rw----. 1 vfn4 vfn4 65G May  4 11:57 CDS-202204_chunkab.parquet
-rw-rw----. 1 vfn4 vfn4 65G May  4 12:12 CDS-202204_chunkaa.parquet

The runtime stats show it is using around 150 to 160G memory for each file:

# First
        Command being timed: "cargo run --release CDS-202204_chunkaa CDS-202204_chunkaa.parquet"
        User time (seconds): 268.11
        System time (seconds): 552.91
        Percent of CPU this job got: 126%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 10:47.14
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 161270116
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 128
        Minor (reclaiming a frame) page faults: 119604642
        Voluntary context switches: 10822679
        Involuntary context switches: 2236
        Swaps: 0
        File system inputs: 58113944
        File system outputs: 134753416
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
# Then:
        Command being timed: "cargo run --release CDS-202204_chunkab CDS-202204_chunkab.parquet"
        User time (seconds): 274.96
        System time (seconds): 556.81
        Percent of CPU this job got: 124%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 11:06.36
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 157404136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 125
        Minor (reclaiming a frame) page faults: 124380936
        Voluntary context switches: 11137327
        Involuntary context switches: 1992
        Swaps: 0
        File system inputs: 57980432
        File system outputs: 134989248
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

My system didn't like the parquet files CDS-202204_chunkab.parquet is of an unsupported version. file version: 2. This is likely a setting I missed. But parquet-tools was able to decode it. I couldn't verify the parquet data further than that.

*Some oddities include that the compressed parquet is much larger than the original file (over 2x). I don't know if the behavior I am showing is expected or not, but if not expected they likely fit into separate issues. Thanks for everything and let me know if you have other stuff I can test down the road.

ritchie46 · 2022-05-04T17:13:38Z

I will ping @jorgecarleitao in here as he has more understanding of this.

What compression algorithm did you use? And do you see a difference with other ones?

sammysheep · 2022-05-04T17:18:32Z

I used Snappy because that is the cluster default we use. Happy to try another one. What do you recommend?

jorgecarleitao · 2022-05-26T17:53:05Z

I am sorry this slipped through the cracks on my triage. @ritchie46 din't you fixed this by with the split of the array in multiple pages?

ritchie46 · 2022-05-26T18:32:05Z

I am sorry this slipped through the cracks on my triage. @ritchie46 din't you fixed this by with the split of the array in multiple pages?

I believe so. My machine is only a mere 16GB, so I am not able to test this.

But @cbilot's successful run gives me trust that it is fixed now.

cbilot · 2022-05-27T04:23:47Z

Hmmm, when I ran that code, I was using polars-u64-idx (0.13.26), which was actually regular polars, due to #3279, a bug in the build CI.

So just to re-confirm the result, today, using polars-u64-idx (0.13.38), I created the following dataset of 10 billion records, wrote it as a parquet file, and then read it back in:

import polars as pl
import numpy as np

df = pl.DataFrame({
    'col1': np.random.normal(0, 1, 10_000_000_000),
}).with_row_count()
df

>>> df
shape: (10000000000, 2)
┌────────────┬───────────┐
│ row_nr     ┆ col1      │
│ ---        ┆ ---       │
│ u64        ┆ f64       │
╞════════════╪═══════════╡
│ 0          ┆ -0.212407 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1          ┆ 0.923423  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2          ┆ -0.199929 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3          ┆ -0.358613 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999996 ┆ 2.465729  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999997 ┆ -0.498744 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999998 ┆ -0.841243 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999999 ┆ -0.303019 │
└────────────┴───────────┘

df.write_parquet("tmp.parquet")

$ ls -lh *.parquet
-rw-rw-r-- 1 corey corey 115G May 26 23:01 tmp.parquet

df2 = pl.read_parquet('tmp.parquet')
df2

shape: (10000000000, 2)
┌────────────┬───────────┐
│ row_nr     ┆ col1      │
│ ---        ┆ ---       │
│ u64        ┆ f64       │
╞════════════╪═══════════╡
│ 0          ┆ -0.212407 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1          ┆ 0.923423  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2          ┆ -0.199929 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3          ┆ -0.358613 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999996 ┆ 2.465729  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999997 ┆ -0.498744 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999998 ┆ -0.841243 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999999999 ┆ -0.303019 │
└────────────┴───────────┘
>>> pl.__version__
'0.13.38'

I can't compare the two files using frame_equal due to #3511. But the the number of rows, and the head & tail of the dataset look equal.

sammysheep · 2022-06-06T15:23:32Z

Hi @jorgecarleitao ! Would you mind taking a look at the example run I posted above?

Questions:

Is there a way to specify parquet version for compatibility reasons in pyarrow2 that might not be exposed in polars?
is the memory usage for file conversion something that has to be tweaked at the pyarrow2 level, polars level, or something in the application code?

@ritchie46 I think it is "fixed" in terms of it running now but if you have memory like yours (16GB) or too large a file even on a server then when you try to convert a large file you will get a failure when the OS kills your process due to OOM.

For the alternative I am trying to replace, my hunch is that Spark is chunking the data as it is read and writing to a fixed page/block size that typically won't blow up memory (more than the JVM's usual). In Polars am I basically slurping everything to memory at once and trying to do the conversion in a couple (2?) pages?

In terms of large parquet being splittable in distributed systems (Cloud, Hadoop) it looks like you'd want to be able to set a smallish (128MB) page/block size in parquet, no? If it is something that is exposed in pyarrow2 it would be great to have available in polars.

Use-case / scenarios:

Do some highly customized processing in Polars and ship to the distributed FS for downstream analytics
Simple file conversion faster than alternatives using native code

ritchie46 · 2022-07-31T07:59:03Z

Is there a way to specify parquet version for compatibility reasons in pyarrow2 that might not be exposed in polars?

There is. We can/should expose that in polars.

is the memory usage for file conversion something that has to be tweaked at the pyarrow2 level, polars level, or something in the application code?

What do you mean by conversion? Polars is in-memory. So it will always read a file completely into memory (save for memory mapped ipc/feather files).

In terms of large parquet being splittable in distributed systems (Cloud, Hadoop) it looks like you'd want to be able to set a smallish (128MB) page/block size in parquet, no? If it is something that is exposed in pyarrow2 it would be great to have available in polars.

This possible for polars datasets that don't have struct data. Polars will write the chunks of your dataframe to parquet. If the dataframe is contiguous, it will write a single chunk.

I will close this issue as it has been resolved. Discussion may continue.

sammysheep · 2022-07-31T18:51:14Z

There is. We can/should expose that in polars.

Thanks, looking forward to this!

What do you mean by conversion? Polars is in-memory. So it will always read a file completely into memory (save for memory mapped ipc/feather files).

Conversion from CSV to parquet using POLARS.

I was wondering where one could set the parquet page size functionality that you added recently. If it exists, I'd play with the setting to try to get less peak memory allocated (assuming it re-uses a buffer or de-allocs after writing a page). My earlier analysis indicated that trying to convert a 28G CSV to parquet needed 150G+ of memory, which is OOM for most machines.

This possible for polars datasets that don't have struct data. Polars will write the chunks of your dataframe to parquet. If the dataframe is contiguous, it will write a single chunk.

So if I am understanding you, I have to chunk the data frame to possibly lower the page sizes being written?

Note: earlier I meant "arrow2" not "pyarrow2".

sammysheep · 2022-07-31T18:52:44Z

By the bye, I really enjoyed your talk at the Data and AI Summit 2022 this year. It was definitely one of the better ones, at least to me.

Thanks so much for your wonderful contributions to the community!

jnthnhss added the bug Something isn't working label Apr 11, 2022

ritchie46 mentioned this issue Apr 12, 2022

python parquet: default to lz4 compression #3129

Merged

ritchie46 mentioned this issue Apr 13, 2022

parquet: split pages by array size jorgecarleitao/arrow2#946

Merged

This was referenced Apr 30, 2022

Prevented SO in large parquet files jorgecarleitao/arrow2#973

Merged

upgrade arrow #3269

Merged

cbilot mentioned this issue May 10, 2022

scan_ipc is causing queries on large datasets to fail due to memory usage #3360

Closed

ritchie46 closed this as completed Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_parquet function in polars-u64-idx does not support large data frames #3120

write_parquet function in polars-u64-idx does not support large data frames #3120

jnthnhss commented Apr 11, 2022

ritchie46 commented Apr 12, 2022

ritchie46 commented Apr 12, 2022

jnthnhss commented Apr 12, 2022

ritchie46 commented Apr 19, 2022

jnthnhss commented Apr 19, 2022

sammysheep commented Apr 22, 2022

sammysheep commented Apr 22, 2022

ritchie46 commented Apr 23, 2022

ritchie46 commented Apr 23, 2022

sammysheep commented Apr 25, 2022 •

edited

Loading

ritchie46 commented Apr 26, 2022

ritchie46 commented May 1, 2022

sammysheep commented May 1, 2022

sammysheep commented May 1, 2022

cbilot commented May 1, 2022

sammysheep commented May 4, 2022 •

edited

Loading

ritchie46 commented May 4, 2022

sammysheep commented May 4, 2022

jorgecarleitao commented May 26, 2022

ritchie46 commented May 26, 2022

cbilot commented May 27, 2022

sammysheep commented Jun 6, 2022 •

edited

Loading

ritchie46 commented Jul 31, 2022

sammysheep commented Jul 31, 2022

sammysheep commented Jul 31, 2022

write_parquet function in polars-u64-idx does not support large data frames #3120

write_parquet function in polars-u64-idx does not support large data frames #3120

Comments

jnthnhss commented Apr 11, 2022

What language are you using?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

ritchie46 commented Apr 12, 2022

ritchie46 commented Apr 12, 2022

jnthnhss commented Apr 12, 2022

ritchie46 commented Apr 19, 2022

jnthnhss commented Apr 19, 2022

sammysheep commented Apr 22, 2022

sammysheep commented Apr 22, 2022

ritchie46 commented Apr 23, 2022

ritchie46 commented Apr 23, 2022

sammysheep commented Apr 25, 2022 • edited Loading

ritchie46 commented Apr 26, 2022

ritchie46 commented May 1, 2022

sammysheep commented May 1, 2022

sammysheep commented May 1, 2022

cbilot commented May 1, 2022

sammysheep commented May 4, 2022 • edited Loading

ritchie46 commented May 4, 2022

sammysheep commented May 4, 2022

jorgecarleitao commented May 26, 2022

ritchie46 commented May 26, 2022

cbilot commented May 27, 2022

sammysheep commented Jun 6, 2022 • edited Loading

ritchie46 commented Jul 31, 2022

sammysheep commented Jul 31, 2022

sammysheep commented Jul 31, 2022

sammysheep commented Apr 25, 2022 •

edited

Loading

sammysheep commented May 4, 2022 •

edited

Loading

sammysheep commented Jun 6, 2022 •

edited

Loading