You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ java -jar parquet-tools-1.10.99.7.2.15.0-147.jar meta uk_cities_rust.parquet
....
row group 1: RC:37 TS:1595 OFFSET:4
--------------------------------------------------------------------------------
city: BINARY SNAPPY DO:4 FPO:710 SZ:815/1115/1.37 VC:37 ENC:PLAIN,RLE_DICTIONARY,RLE ST:[min: Aberdeen, Aberdeen City, UK, max: Worthing, West Sussex, UK, num_nulls not defined]
lat: DOUBLE SNAPPY DO:907 FPO:1224 SZ:390/383/0.98 VC:37 ENC:PLAIN,RLE_DICTIONARY,RLE ST:[min: 50.376289, max: 57.653484, num_nulls not defined]
lng: DOUBLE SNAPPY DO:1349 FPO:1666 SZ:390/383/0.98 VC:37 ENC:PLAIN,RLE_DICTIONARY,RLE ST:[min: -7.318268, max: 0.573453, num_nulls not defined]
Expected behavior
The total size is expected to be sum of the uncompressed column sizes 1115 + 383 + 383 = 1881 and not compressed size 815 + 390 + 390 = 1595
Additional context
Same csv converted to parquet using python pyarrow shows
Describe the bug
The row group total_byte_size currently written to the parquet file is the compressed size and not the uncompressed size as expected
To Reproduce
For example uk_cities_with_headers.csv converted to parquet with schema
shows the following stats
Expected behavior
The total size is expected to be sum of the uncompressed column sizes 1115 + 383 + 383 = 1881 and not compressed size 815 + 390 + 390 = 1595
Additional context
Same csv converted to parquet using python pyarrow shows
Here the total size matches the columns uncompressed size
1945 = 1123 + 411 + 411
The text was updated successfully, but these errors were encountered: