Finding approximation of "compression ratio" for parquet storage #3774

ovidiubuligan · 2024-06-14T11:10:11Z

ovidiubuligan
Jun 14, 2024

One way I tried to find out was to divide the total received network trafic by the distributor (since it is all OTLP) , by the size of the storage bucket size .Something doesn't look right since the total storage for 10 days is larger than the distributor received network trafic in 10 days .

sum(
   increase(container_network_receive_bytes_total{namespace="tempo" , pod=~"grafana-tempo-distributor.*" }[10d]
)/1024/1024/1024) #Gb

OTLP Network traffic received in 10 days is 430Gb while total storage in the bucket is ~1.5 terabytes for 10 days retention (includes meta.json and index.json but they are small)

Interestingly trying to load a parquet(v2) file of 300Mb from a sample block in python pandas gives out of memory error .

# Read the Parquet file into a DataFrame
df = pd.read_parquet(parquet_file_path)   # <-- out of memory

I would expect the total size of the bucket to be less than the raw OTLP spans received for the same time frame . Does someone know why this is not the case ?

joe-elliott · 2024-06-14T17:32:15Z

joe-elliott
Jun 14, 2024
Maintainer

A few things tha may shed some light on this.

First, I believe container_network_receive_bytes_total is compressed proto. Can you also try review tempo_distributor_bytes_received_total. That is size in uncompressed proto and will give us more information about what may be going on.

Second, are you running in distributed mode with RF3? In this case Tempo is writing ~3x the data to object storage. We atttempt to dedupe this data using the compactors but have recently discovered that deduplication is not as effective as previously thought. cc @mapno. Also know we are currently investigating an RF1 architecture to reduce Tempo TCO and improve performance.

I have not used personally python pandas to load our parquet files so I can't say what the issue may be. I use this tool when I want to investigate a file:

https://github.com/stoewer/parquet-cli

It has a number of options and is easy to debug and modify for custom investigations of a file.

1 reply

ovidiubuligan Jun 17, 2024
Author

Using tempo_distributor_bytes_received_total is 7 times larger than container_network_receive_bytes_total . I guess this is good?
We are running the distributed version (helm chart) with mostly default settings which looks to have RF3 . We only run a single compactor instance since it looks to be enough.

compactor:
  compaction:
    block_retention: 240h
    compacted_block_retention: 1h
    compaction_cycle: 30s
    compaction_window: 15m
    max_block_bytes: 107374182400
    max_compaction_objects: 6000000
    max_time_per_tenant: 5m
    retention_concurrency: 10
    v2_in_buffer_bytes: 5242880
    v2_out_buffer_bytes: 20971520
    v2_prefetch_traces_count: 1000
  ring:
    kvstore:
      store: memberlist
distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
  ring:
    kvstore:
      store: memberlist
ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
    tokens_file_path: /var/tempo/tokens.json
  max_block_bytes: 524288000

I also get some quite high stats on the k8s attributes with low cardinality using the tempo cli (for about 100 blocks):

Top 15 span attributes by size
name: db.statement           size: 350 MB   (23.76%)
name: graphql.json_request   size: 222 MB   (15.07%)
name: aaaaaaaaaaaaaaa        size: 143 MB   (9.68%)
name: graphql.query          size: 121 MB   (8.23%)
name: peer.service           size: 92 MB    (6.28%)
name: CCCCCCCCCCC            size: 92 MB    (6.27%)
name: bbbbbbbbb              size: 68 MB    (4.61%)
name: http.host              size: 44 MB    (2.95%)
name: net.peer.name          size: 31 MB    (2.08%)
name: http.user_agent        size: 26 MB    (1.79%)
name: db.name                size: 26 MB    (1.77%)
name: user_agent.original    size: 26 MB    (1.76%)
name: peer.address           size: 22 MB    (1.53%)
name: http.route             size: 20 MB    (1.37%)
name: upstream.address       size: 17 MB    (1.14%)
Top 8 resource attributes by size
name: service.instance.id   size: 725 MB   (27.47%)     # <--  !!!  we have a max of ~400 pods and  and a churn rate of 1,2 pods per hour
name: k8s.pod.uid           size: 592 MB   (22.42%)       # <--  
name: k8s.pod.start_time    size: 477 MB   (18.06%)
name: k8s.deployment.name   size: 421 MB   (15.94%)      # we only only have ~30 deployments 
name: k8s.node.name         size: 250 MB   (9.48%)
name: k8s.pod.ip            size: 172 MB   (6.50%)
name: service.version       size: 3.2 MB   (0.12%)
name: <null>                size: 0 B      (0.00%)

Kind of wierd that k8s.pod.uid takes up that much , or even k8s.deployment.name with 421 MB which we have only ~30 that are constant strings. Can I index these resource columns somehow ?

Thoughts on Apache Arrow and it's querying engines ? I saw there are some efforts with OTEL Arrow to make OTLP even more compressed dynamic arrow schema blog) . Also arrow format creator critiques parquet for it's complexity and even suggest to use some form of arrow as storage Wes McKinney (creator of pandas)

joe-elliott · 2024-06-17T17:50:29Z

joe-elliott
Jun 17, 2024
Maintainer

Using tempo_distributor_bytes_received_total is 7 times larger than container_network_receive_bytes_total . I guess this is good?

It's difficult to compare compressed proto, uncompressed proto, and parquet formats, but the fact that it's larger does help us understand our system a bit better. Our backend size falling somewhere inbetween these values is promising.

We are running the distributed version (helm chart) with mostly default settings which looks to have RF3 . We only run a single compactor instance since it looks to be enough.

Perhaps consider more compactors or a larger time window to see if it will push down storage size

I also get some quite high stats on the k8s attributes with low cardinality using the tempo cli (for about 100 blocks):

This counts all instances of data and ignores cardinality. It is meant to approximate the popularity of a given column and if it's worth pulling out. cc @stoewer

Can I index these resource columns somehow ?

You can move them into their own columns using dedicated columns to improve performance: https://grafana.com/docs/tempo/latest/operations/dedicated_columns/

Thoughts on Apache Arrow and it's querying engines ?

My impression is that Arrow only specifies an in memory format. We'd need a serialized binary format, a performant library and an implementation.

0 replies

ovidiubuligan · 2024-06-18T11:45:19Z

ovidiubuligan
Jun 18, 2024
Author

For completeness I will add column usage analysis on the same block of about 300Mb done with pyarrow (sorted by biggest to smallest) :

rs.list.element.ss.list.element.Spans.list.element.Events.list.element.Attrs.list.element.Value:    62.74718761444092 Mb
rs.list.element.ss.list.element.Spans.list.element.SpanID:    25.31434440612793 Mb
rs.list.element.ss.list.element.Spans.list.element.ParentSpanID:    25.172832489013672 Mb
rs.list.element.Resource.Attrs.list.element.Value:    20.091941833496094 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Value:    19.40624237060547 Mb
rs.list.element.ss.list.element.Spans.list.element.StartTimeUnixNano:    14.921125411987305 Mb
rs.list.element.ss.list.element.Spans.list.element.DurationNano:    11.936434745788574 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key:    10.09111213684082 Mb
rs.list.element.Resource.Attrs.list.element.Key:    8.540848731994629 Mb
rs.list.element.ss.list.element.Spans.list.element.Events.list.element.TimeUnixNano:    7.607699394226074 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueInt:    7.0779314041137695 Mb
TraceIDText:    5.883201599121094 Mb
rs.list.element.ss.list.element.Spans.list.element.Events.list.element.Attrs.list.element.Key:    4.541683197021484 Mb
rs.list.element.ss.list.element.Spans.list.element.Name:    3.739956855773926 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueBool:    3.728569984436035 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueKVList:    3.7264022827148438 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueArray:    3.7264022827148438 Mb
rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueDouble:    3.7259864807128906 Mb
TraceID:    3.624588966369629 Mb
rs.list.element.Resource.Attrs.list.element.ValueKVList:    3.415562629699707 Mb
rs.list.element.Resource.Attrs.list.element.ValueArray:    3.415562629699707 Mb
rs.list.element.Resource.Attrs.list.element.ValueInt:    3.414905548095703 Mb
rs.list.element.Resource.Attrs.list.element.ValueDouble:    3.414905548095703 Mb
rs.list.element.Resource.Attrs.list.element.ValueBool:    3.414905548095703 Mb
rs.list.element.ss.list.element.Spans.list.element.NestedSetRight:    2.69858455657959 Mb
rs.list.element.ss.list.element.Spans.list.element.NestedSetLeft:    2.5243473052978516 Mb
rs.list.element.ss.list.element.Spans.list.element.ParentID:    2.3846826553344727 Mb
rs.list.element.ss.list.element.Spans.list.element.Events.list.element.Name:    2.261432647705078 Mb
rs.list.element.ss.list.element.Spans.list.element.Links:    2.187450408935547 Mb
rs.list.element.ss.list.element.Spans.list.element.Kind:    1.752028465270996 Mb
rs.list.element.ss.list.element.Spans.list.element.HttpUrl:    1.6571969985961914 Mb
rs.list.element.ss.list.element.Spans.list.element.Events.list.element.DroppedAttributesCount:    1.579599380493164 Mb
rs.list.element.ss.list.element.Spans.list.element.Events.list.element.Test:    1.5765914916992188 Mb
rs.list.element.ss.list.element.Spans.list.element.DroppedAttributesCount:    1.185807228088379 Mb
rs.list.element.ss.list.element.Spans.list.element.DroppedEventsCount:    1.185807228088379 Mb
rs.list.element.ss.list.element.Spans.list.element.DroppedLinksCount:    1.185807228088379 Mb
rs.list.element.Resource.K8sPodName:    1.1632204055786133 Mb
EndTimeUnixNano:    1.146768569946289 Mb
StartTimeUnixNano:    1.146754264831543 Mb
rs.list.element.ss.list.element.Spans.list.element.HttpMethod:    1.1458263397216797 Mb
rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode:    1.0842018127441406 Mb
rs.list.element.ss.list.element.Scope.Name:    0.9698286056518555 Mb
DurationNano:    0.8870563507080078 Mb
rs.list.element.ss.list.element.Spans.list.element.StatusCode:    0.8402976989746094 Mb
rs.list.element.Resource.ServiceName:    0.70440673828125 Mb
rs.list.element.ss.list.element.Scope.Version:    0.6983213424682617 Mb
rs.list.element.ss.list.element.Spans.list.element.StatusMessage:    0.5778770446777344 Mb
rs.list.element.ss.list.element.Spans.list.element.TraceState:    0.5768232345581055 Mb
rs.list.element.Resource.K8sNamespaceName:    0.29371070861816406 Mb
RootSpanName:    0.25588130950927734 Mb
rs.list.element.Resource.Cluster:    0.18984127044677734 Mb
rs.list.element.Resource.Namespace:    0.18984127044677734 Mb
rs.list.element.Resource.Pod:    0.18984127044677734 Mb
rs.list.element.Resource.Container:    0.18984127044677734 Mb
rs.list.element.Resource.K8sClusterName:    0.18984127044677734 Mb
rs.list.element.Resource.K8sContainerName:    0.18984127044677734 Mb
rs.list.element.Resource.Test:    0.18984127044677734 Mb
RootServiceName:    0.10123634338378906 Mb


Total:   297.78076934814453 Mb

Python script (helped generate it with chatgpt with tempo parquet schema)

import pyarrow.parquet as pq

parquet_file_path = '/home/ovilinux/git/tempo/cmd/data.parquet'
parquet_file = pq.ParquetFile(parquet_file_path)

# Function to get the size of each column
def get_column_sizes(parquet_file):
    column_sizes = {}
    for i in range(parquet_file.num_row_groups):
        row_group = parquet_file.metadata.row_group(i)
        for j in range(row_group.num_columns):
            column = row_group.column(j)
            column_name = column.path_in_schema
            if column_name not in column_sizes:
                column_sizes[column_name] = 0
            column_sizes[column_name] += column.total_compressed_size
    return column_sizes

# Get column sizes
column_sizes = get_column_sizes(parquet_file)

# Sort columns by size in descending order
sorted_column_sizes = sorted(column_sizes.items(), key=lambda x: x[1], reverse=True)

# Print the columns and their sizes
s=0
for column, size in sorted_column_sizes:
    s+=size
    print(f"{column}:    {size/1024/1024} Mb")
print(f"\n\nTotal:   {s/1024/1024} Mb")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding approximation of "compression ratio" for parquet storage #3774

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Finding approximation of "compression ratio" for parquet storage #3774

ovidiubuligan Jun 14, 2024

Replies: 3 comments · 1 reply

joe-elliott Jun 14, 2024 Maintainer

ovidiubuligan Jun 17, 2024 Author

joe-elliott Jun 17, 2024 Maintainer

ovidiubuligan Jun 18, 2024 Author

ovidiubuligan
Jun 14, 2024

Replies: 3 comments 1 reply

joe-elliott
Jun 14, 2024
Maintainer

ovidiubuligan Jun 17, 2024
Author

joe-elliott
Jun 17, 2024
Maintainer

ovidiubuligan
Jun 18, 2024
Author