This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 224
Parquet writes all values of sliced arrays? #1323
Labels
bug
Something isn't working
Comments
ritchie46
changed the title
Parquet writes all values of sliced arrays
Parquet writes all values of sliced arrays?
Dec 8, 2022
2 tasks
ritchie46
changed the title
Parquet writes all values of sliced arrays?
Parquet writes all values of sliced arrays? Nested columns cannot be read by pyarrow.
Dec 10, 2022
ritchie46
changed the title
Parquet writes all values of sliced arrays? Nested columns cannot be read by pyarrow.
Parquet writes all values of sliced arrays?
Dec 10, 2022
Exponential sizeThe parquet size also seems to grow exponential after a certain row number. import os
import polars as pl
import numpy as np
x = np.arange(100, step=10)
pas = []
pls = []
for i in x:
df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(i * 1e4)})
df.write_parquet('test-T1.parquet', use_pyarrow=True)
df.write_parquet('test-T2.parquet', use_pyarrow=False)
t1 = os.path.getsize('test-T1.parquet') / 1000
t2 = os.path.getsize('test-T2.parquet') / 1000
pas.append(t1)
pls.append(t2)
print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')
plt.plot(x, pas, label="pyarrow")
plt.plot(x, pls, label="arrow2")
plt.title("mem usage")
plt.xlabel("df size")
plt.ylabel("parquet size")
plt.legend() LinearThis doesn't seem to be the case for really small row numbers: import os
import polars as pl
import numpy as np
x = np.arange(100, step=10)
pas = []
pls = []
for i in x:
df = pl.DataFrame({'listIntCol': [[1,1,1], [1,2,3], [None,2,None]]*int(i * 1e2)})
df.write_parquet('test-T1.parquet', use_pyarrow=True)
df.write_parquet('test-T2.parquet', use_pyarrow=False)
t1 = os.path.getsize('test-T1.parquet') / 1000
t2 = os.path.getsize('test-T2.parquet') / 1000
pas.append(t1)
pls.append(t2)
print(f'{t1=:,.0f} kb, {t2=:,.0f} kb')
plt.plot(x, pas, label="pyarrow")
plt.plot(x, pls, label="arrow2")
plt.title("mem usage")
plt.xlabel("df size")
plt.ylabel("parquet size")
plt.legend() |
This still appears to be a problem, see #1356 (comment) |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
During parquet writes we write smaller pages because of the
i32::limit
and because of performance improvements when writing smaller pages during reading.Nested structures such as lists and utf8 are then sliced by their offsets, but the whole values are then send to the pages writer. I haven't yet confirmed this, but I believe this is what happens and is the reason for:
1: invalid parquet files
2: extreme memory usage
3: extreme file sizes
all reported in: pola-rs/polars#4393
The text was updated successfully, but these errors were encountered: