Finding approximation of "compression ratio" for parquet storage #3774
Replies: 3 comments 1 reply
-
A few things tha may shed some light on this. First, I believe Second, are you running in distributed mode with RF3? In this case Tempo is writing ~3x the data to object storage. We atttempt to dedupe this data using the compactors but have recently discovered that deduplication is not as effective as previously thought. cc @mapno. Also know we are currently investigating an RF1 architecture to reduce Tempo TCO and improve performance. I have not used personally python pandas to load our parquet files so I can't say what the issue may be. I use this tool when I want to investigate a file: https://github.com/stoewer/parquet-cli It has a number of options and is easy to debug and modify for custom investigations of a file. |
Beta Was this translation helpful? Give feedback.
-
It's difficult to compare compressed proto, uncompressed proto, and parquet formats, but the fact that it's larger does help us understand our system a bit better. Our backend size falling somewhere inbetween these values is promising.
Perhaps consider more compactors or a larger time window to see if it will push down storage size
This counts all instances of data and ignores cardinality. It is meant to approximate the popularity of a given column and if it's worth pulling out. cc @stoewer
You can move them into their own columns using dedicated columns to improve performance: https://grafana.com/docs/tempo/latest/operations/dedicated_columns/
My impression is that Arrow only specifies an in memory format. We'd need a serialized binary format, a performant library and an implementation. |
Beta Was this translation helpful? Give feedback.
-
For completeness I will add column usage analysis on the same block of about 300Mb done with
Python script (helped generate it with chatgpt with tempo parquet schema) import pyarrow.parquet as pq
parquet_file_path = '/home/ovilinux/git/tempo/cmd/data.parquet'
parquet_file = pq.ParquetFile(parquet_file_path)
# Function to get the size of each column
def get_column_sizes(parquet_file):
column_sizes = {}
for i in range(parquet_file.num_row_groups):
row_group = parquet_file.metadata.row_group(i)
for j in range(row_group.num_columns):
column = row_group.column(j)
column_name = column.path_in_schema
if column_name not in column_sizes:
column_sizes[column_name] = 0
column_sizes[column_name] += column.total_compressed_size
return column_sizes
# Get column sizes
column_sizes = get_column_sizes(parquet_file)
# Sort columns by size in descending order
sorted_column_sizes = sorted(column_sizes.items(), key=lambda x: x[1], reverse=True)
# Print the columns and their sizes
s=0
for column, size in sorted_column_sizes:
s+=size
print(f"{column}: {size/1024/1024} Mb")
print(f"\n\nTotal: {s/1024/1024} Mb") |
Beta Was this translation helpful? Give feedback.
-
One way I tried to find out was to divide the total received network trafic by the
distributor
(since it is all OTLP) , by the size of the storage bucket size .Something doesn't look right since the total storage for10
days is larger than the distributor received network trafic in10
days .OTLP Network traffic received in 10 days is
430Gb
while total storage in the bucket is~1.5 terabytes
for 10 days retention (includes meta.json and index.json but they are small)Interestingly trying to load a parquet(v2) file of
300Mb
from a sample block in python pandas gives out of memory error .I would expect the total size of the bucket to be less than the raw OTLP spans received for the same time frame . Does someone know why this is not the case ?
Beta Was this translation helpful? Give feedback.
All reactions