-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extreme memory usage when reading Parquet files written by Polars in PySpark #4393
Comments
A fix is coming up in #4406. Note that the size unit is number of rows, so I think you want something smaller than Regarding the large memory usage. Smaller row groups should fix that I assume. Polars' defaults to single chunks as it often can read that faster. |
@ritchie46 Do you have any recommendation for this value? |
@ritchie46 I do not think the row group size is the issue.
As you can see, I still have all data in a single row group. |
Can you show the output of pyarrow's parquet metadata for one written by polars and one written by pyarrow? You can set |
Now this gets interresting:
For the record, the files were written like that: (
aggregated_df
.sort(["chrom", "start", "end", "ref", "alt", "gene", "subtissue"])
.collect()
.write_parquet(f"{os.environ.get('TMP')}/pyarrow.parquet", compression="snappy", statistics=True, use_pyarrow=True)
)
(
aggregated_df
.sort(["chrom", "start", "end", "ref", "alt", "gene", "subtissue"])
.collect()
.write_parquet(f"{os.environ.get('TMP')}/polars.parquet", compression="snappy", statistics=True, use_pyarrow=False)
) |
There are quite large differences between the schemas. |
Thanks a lot for sharing these. I suspect that the root cause is that by default Native:
Pyarrow
It is likely that your data is very suitable for dictionary encoding (i.e. this column has like 417/8~50 or so distinct values). |
@jorgecarleitao do you know if pyarrow does this data-driven? I have been thinking about encodings today, as we keep a For categoricals we also know the amount of unique values per columns without doing any compute. Any idea what good heuristics would be? Another thought; If we write with |
pyarrow does not use an heuristics afaik - it always writes as dict encoded unless specified. For example: import io
import pyarrow as pa
import pyarrow.parquet
a = pa.array(range(0, 10**8))
t = pa.table(data=[a], schema=pa.schema([pa.field("a", a.type, False)]))
bytes = io.BytesIO()
pa.parquet.write_table(t, bytes, write_statistics=True, compression=None)
bytes.seek(0)
meta = pa.parquet.read_metadata(bytes)
print(meta.row_group(0).column(0)) prints
i.e. it does create and write a dictionary, even though every value is unique. That would be awesome - if we have some information about the cardinality, we could use it. The API to write to parquet dictionaries in arrow2 is that we convert to a |
So for sorted integer data: I read that the plain dictionary encoding is deprecated. So we should go for the RLE I guess? |
Yes, but Yes, |
I tried to add this but I cannot seem to encode an |
Not sure what is going on here, but just as an FYI I hit this issue today:
When I say the file starts behaving strange....
I'm using polars 0.15.2 and pyarrow 8.0.0
|
Can you send the dataframe? It's hard to fix if we cannot reproduce it. Can you remove the columns that you can read? Those are clutter. |
It's probably the utf8 columns Here is a minimal example. import os
import polars as pl
df = pl.DataFrame({'textCol': ['Hello World!', 'and', 'Happy', 'Holidays']*int(1e6)})
df.write_parquet('test-T1.parquet', use_pyarrow=True)
df.write_parquet('test-T2.parquet', use_pyarrow=False)
t1 = os.path.getsize('test-T1.parquet') / 1000
t2 = os.path.getsize('test-T2.parquet') / 1000
print(f'{t1=:,.0f}, {t2=:,.0f}')
>>> t1=3, t2=510 #kb |
This still doesn't show the bug right? You reported end of stream? |
Oh, found it: import os
import polars as pl
df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(1e6)})
df.write_parquet('test-T1.parquet', use_pyarrow=True)
df.write_parquet('test-T2.parquet', use_pyarrow=False)
t1 = os.path.getsize('test-T1.parquet') / 1000
t2 = os.path.getsize('test-T2.parquet') / 1000
print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')
>>> t1=1 kb, t2=264,645 kb <<< 💣💣💣💣 |
pd.read_parquet('test-T2.parquet') results in |
Thanks. I will investigate. |
What language are you using?
Python
Have you tried latest version of polars?
yes
What version of polars are you using?
'0.13.57'
What operating system are you using polars on?
Rocky Linux 8
What language version are you using
python 3.8
Describe your bug.
When reading a parquet file written with Polars in PySpark, I can observe a very high peak memory usage.
Also, printing the first 10 lines of the file is surprisingly slow.
What is the actual behavior?
Slow reading and high memory usage when one is using the native Parquet file writer.
What is the expected behavior?
Equal high-speed, low-memory reading as when the file would have been written with PyArrow.
What are the steps to reproduce the behavior?
I am writing a large dataframe with
19464707
rows to parquet:Then I try reading it with PySpark:
This gives me a peak memory usage of 20GB and takes quite a while.
Next, I tried writing it with limited row group size:
aggregated_df.collect().write_parquet("test.parquet", compression="snappy", statistics=True, row_group_size=64*1024*1024)
This results in a panic:
As a workaround, one can set
use_pyarrow=True
.Writing a file like this works as intended:
aggregated_df.collect().write_parquet("test.parquet", compression="snappy", statistics=True, row_group_size=64*1024*1024), use_pyarrow=True
Such a file can be immediately read by PySpark with low memory usage and very fast
.limit(10)
.The text was updated successfully, but these errors were encountered: