Ingester memory keeps growing until OOM #2859

itheodoro · 2023-08-28T21:29:39Z

itheodoro
Aug 28, 2023

Hello,

We are using Tempo (version 2.2.0, microservices mode) deployed on a k8s cluster. We are using S3 as backend storage, and we have persistent volume enabled on ingesters.
We’ve been running load tests to achieve an ingestion rate of nearly 30 million spans per minute. While we were successful in reaching this volume, we observed that the ingester’s memory keeps growing as the throughput increases, eventually leading to the pods being terminated due to OOM. However, after the test ends, we could see that the memory consumption decreases and remains stable. Our goal is to maintain this high throughput while ensuring stable memory usage.

We've already tried out a few configuration options. Our latest attempt involved reducing max_block_duration to 5m and max_block_bytes to ~50mb, which provided some relief by delaying the OOM occurrence. However, the issue of memory growth persisted.

Here are the resources allocated to our k8s replicas and the configuration settings we've applied to Tempo:

distributor: X 35, 2 CPU, 4GB
ingester: x 40, 3 CPU, 12GB
compactor: x 4, 1 CPU, 6GB

compactor:
  compaction:
    block_retention: 2h
    compacted_block_retention: 1h
    compaction_cycle: 30s
    compaction_window: 1h
    max_block_bytes: 107374182400
    max_compaction_objects: 6000000
    max_time_per_tenant: 5m
    retention_concurrency: 10
    v2_in_buffer_bytes: 5242880
    v2_out_buffer_bytes: 20971520
    v2_prefetch_traces_count: 1000
  ring:
    kvstore:
      store: memberlist
distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: endpoint:4317
        http:
          endpoint: endpoint:4318
  ring:
    kvstore:
      store: memberlist
ingester:
  complete_block_timeout: 11m
  concurrent_flushes: 10
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
    tokens_file_path: /var/tempo/tokens.json
  max_block_bytes: 52428800
  max_block_duration: 5m
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
  - tempo-gossip-ring
metrics_generator:
  processor:
    service_graphs:
      dimensions:
      - http.status_code
      - service.tier
      - host.name
      histogram_buckets:
      - 0.1
      - 0.2
      - 0.4
      - 0.8
      - 1.6
      - 3.2
      - 6.4
      - 12.8
      max_items: 1000000
      wait: 10s
      workers: 20
    span_metrics:
      dimensions:
      - http.status_code
      - service.tier
      - host.name
      histogram_buckets:
      - 0.002
      - 0.004
      - 0.008
      - 0.016
      - 0.032
      - 0.064
      - 0.128
      - 0.256
      - 0.512
      - 1.02
      - 2.05
      - 4.1
  registry:
    collection_interval: 15s
    external_labels: {}
    stale_duration: 5m
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal
    remote_write:
    - queue_config:
        batch_send_deadline: 2s
        capacity: 0
        max_backoff: 5s
        max_samples_per_send: 1200
        max_shards: 200
        min_backoff: 30ms
        min_shards: 1
        retry_on_http_429: false
      send_exemplars: true
      url: url
    remote_write_flush_deadline: 1m
multitenancy_enabled: false
overrides:
  ingestion_burst_size_bytes: 200000000
  ingestion_rate_limit_bytes: 350000000
  max_bytes_per_tag_values_query: 10000000
  max_bytes_per_trace: 200000000
  max_traces_per_user: 3000000
  metrics_generator_forwarder_queue_size: 100000
  metrics_generator_forwarder_workers: 5
  metrics_generator_processors:
  - service-graphs
  - span-metrics
  per_tenant_override_config: /conf/overrides.yaml
querier:
  frontend_worker:
    frontend_address: query-frontend:9095
  max_concurrent_queries: 200
  search:
    external_endpoints: []
    external_hedge_requests_at: 8s
    external_hedge_requests_up_to: 2
    prefer_self: 10
    query_timeout: 30s
  trace_by_id:
    query_timeout: 30s
query_frontend:
  max_retries: 2
  search:
    concurrent_jobs: 2000
    target_bytes_per_job: 104857600
  trace_by_id:
    hedge_requests_at: 2s
    hedge_requests_up_to: 2
    query_shards: 50
server:
  grpc_server_max_recv_msg_size: 4194304
  grpc_server_max_send_msg_size: 4194304
  http_listen_port: 3100
  http_server_read_timeout: 2m
  http_server_write_timeout: 2m
  log_format: logfmt
  log_level: debug

And here is our throughput and the ingesters' memory usage in one of the load tests:
Throughput (per second):

Ingester memory usage:

Do you think we could make any configuration changes to achieve better results and optimize resource usage?Any input you have would be really helpful. Thanks!

zalegrala · 2023-08-28T21:42:09Z

zalegrala
Aug 28, 2023
Maintainer

Hello, Some of this will depend on the shape of your traces, so it might be hard to make generalizations. You might try to reduce ingester.flush_check_period and ingester.trace_idle_period from their defaults to get the spans out of memory and onto disk faster. Naturally, this will create more blocks and may increase pressure on your compactors, but you might have an easier time keeping control over the memory use in the ingester. Given that it is a load test, do you know if you have long running traces? These two settings may need to be low enough that spans arriving for long running traces are flushed to disk before the next spans arrive.

https://grafana.com/docs/tempo/latest/configuration/#ingester

That may be a reasonable starting place. Test again and perhaps we can see if that is an improvement. Perhaps others will chime in also.

2 replies

itheodoro Aug 29, 2023
Author

Hello @zalegrala, thanks for your insights! We're planning to dive into the settings you mentioned and test them to see how they work with our setup. The only thing we're considering is that this approach might lead to fractured traces, as mentioned by @joe-elliott.

Quick question – you mentioned 'long running traces'. Does that mean traces with significant delays between spans?

zalegrala Aug 30, 2023
Maintainer

That's right. I just mean that traces can sometimes run for quite a while and is workload-dependent.

joe-elliott · 2023-08-29T17:39:05Z

joe-elliott
Aug 29, 2023
Maintainer

We've already tried out a few configuration options. Our latest attempt involved reducing max_block_duration to 5m and max_block_bytes to ~50mb

This is quite small and will put a lot of pressure on your compactors. Cutting blocks quickly though does smooth out CPU and memory consumption in the ingester as you've found.

You might try to reduce ingester.flush_check_period and ingester.trace_idle_period from their defaults to get the spans out of memory and onto disk faster. Naturally, this will create more blocks and may increase pressure on your compactors,

Reducing these two settings should also reduce memory usage, but not at the cost of creating more blocks. Doing this will create more fractured traces. This is important if you value queries that assert conditions across the trace such as structural operators.

Both of these overrides are quite high. Reducing these will better protect your ingesters. 200MB is very large for a trace and can put memory and CPU pressure on Tempo. I'd consider reducing that to something like 50MB.

  max_bytes_per_trace: 200000000
  max_traces_per_user: 3000000

nearly 30 million spans per minute

This is ~500k/second? Those resources do seem a bit high for that span count. Do you know how many bytes/second you are ingesting?

6 replies

joe-elliott Aug 30, 2023
Maintainer

Assuming RF3 that means you are receiving roughly 253MB/s at the distributors? tempo_distributor_bytes_received_total. That is putting your average span size around 500 bytes which is quite reasonable. Given those numbers I would expect lower resource usage for sure.

Let's see what happens when you lower those limits. For the record internally we run:

        ingestion_rate_limit_bytes: 600_000_000     // generally see ~450MB/s and 1.75M spans/s in our largest tenants
        max_bytes_per_trace: 50 _000_000
        max_traces_per_user: 100_000

joe-elliott Aug 31, 2023
Maintainer

Another thing occurred to me. It's possible you are running into disk I/O saturation. Can you double check your metrics on your disks?

itheodoro Sep 5, 2023
Author

Assuming RF3 that means you are receiving roughly 253MB/s at the distributors? tempo_distributor_bytes_received_total. That is putting your average span size around 500 bytes which is quite reasonable. Given those numbers I would expect lower resource usage for sure.

That's right, we are seeing about 253MB/s at the distributors.

Another thing occurred to me. It's possible you are running into disk I/O saturation. Can you double check your metrics on your disks?

Regarding the disk I/O, we didn't find anything unusual with the metrics that could suggest any saturation.

But we also have some good news. We ran another load test, but this time we've used a different strategy by using our OTel Collector exporters intead of k6 script with xk6-distributed-tracing. Given our existing ingestion rate of approximately 50k spans/s on the OTel Collector, we multiplied the exporters to export approximately 300k spans/s to Tempo.

Surprisingly, the tests ran smoothly, and we didn't encounter any issues with ingesters being terminated due to OOM. This suggests that the load test script might be generating excessive pressure on the environment, potentially causing the previous OOM incidents.

joe-elliott Sep 5, 2023
Maintainer

Nice investigation. I have no idea why the k6 extension would create cause such issues, but I will make an issue to track. Thank you for reporting back.

itheodoro Sep 8, 2023
Author

Thanks! If you need more information, please let me know.

dhanvsagar · 2023-09-01T13:24:05Z

dhanvsagar
Sep 1, 2023

Observed a similar behavior while doing a load test in our environment using xk6-client-tracing with the standard template file.
My test environment is with 7 distributors and 21 ingesters with r5.4xlarge's.
Whenever the load crosses around ~350k spans/sec, seeing a sudden increase in the live traces count, reaching few millions.
Live trace count increases for all the tenants at this point irrespective of the tenant where the load is increased.
Also at this point onwards, the ingester memory starts growing rapidly and results in OOM killed (for few of the ingesters).

Till ~350k the the memory utilization is around ~7 to 9Gi and grows drastically after that
Tempo app version: 2.1.1

Ingester resources
  resources:
    limits:
      cpu: "8"
      memory: 100Gi

config

    replication_factor: 3
    trace_idle_period: 5s
    flush_check_period: 5s
    max_block_bytes: 524288000
    max_block_duration: 5m
    complete_block_timeout: 10m

I have also tried setting GOMEMLIMIT, but did not help much

  extraEnv:
    - name: GOMEMLIMIT 
      value: 70GiB

The ingester cpu utilization looks normal and the persistent volumes (EBS Provisioned IOPS SSD (io1)) are 30Gis and their usage is well under the limit.

11 replies

itheodoro Sep 5, 2023
Author

@dhanvsagar, we hit some spikes of 510k spans/sec during the test, and everything went fine.

dhanvsagar Sep 5, 2023

Nice..! Thanks. I'll also try the same

dhanvsagar Sep 6, 2023

Tried to generated the load with Locust, but could not generate sufficient load with consistent request per seconds.
Retried sending traces from xk6-client-tracing -> otel collectors -> Tempo, instead of directly sending them.
But observing the same behavior after 350k spans/sec, live trace count suddenly increased to millions and the ingester memory starts growing rapidly.
Please let me know if there are any other load test utility to generate traces that I can utilize to test the same behavior

itheodoro Sep 8, 2023
Author

@dhanvsagar, as you're using OTel Collector, maybe you could instrument a sample app, generate the load in this app using k6 (or another tool), export the data to Collector, and then change the Collector's OTLP config to export it multiple times to Tempo. For example, if you are able to generate about 2M spans/s in the app, you could export it 10 times to Tempo, and then you'll have about 20M spans/s.

dhanvsagar Sep 11, 2023

Thanks @itheodoro I will try that!

mapno · 2023-09-06T09:31:18Z

mapno
Sep 6, 2023
Maintainer

Hi @itheodoro, @dhanvsagar, could you share the scripts your using for load testing? The code that the extension uses to send traces is from the OTel collector, so I suspect that the issue is in the trace generation. Being able to reproduce this problem would be very helpful. Thanks!!

2 replies

dhanvsagar Sep 6, 2023

I used the example template.js provided in the repository.
https://github.com/grafana/xk6-client-tracing/blob/main/examples/template/template.js

itheodoro Sep 8, 2023
Author

Thanks for your help! I used a script based on the same template that @dhanvsagar provided. The only things I've changed are the sleep time, which is now controlled by an environment variable, and the template apps, which I've changed a bit. I've uploaded the code here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester memory keeps growing until OOM #2859

{{title}}

Replies: 4 comments 21 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ingester memory keeps growing until OOM #2859

itheodoro Aug 28, 2023

Replies: 4 comments · 21 replies

zalegrala Aug 28, 2023 Maintainer

itheodoro Aug 29, 2023 Author

zalegrala Aug 30, 2023 Maintainer

joe-elliott Aug 29, 2023 Maintainer

joe-elliott Aug 30, 2023 Maintainer

joe-elliott Aug 31, 2023 Maintainer

itheodoro Sep 5, 2023 Author

joe-elliott Sep 5, 2023 Maintainer

itheodoro Sep 8, 2023 Author

dhanvsagar Sep 1, 2023

itheodoro Sep 5, 2023 Author

dhanvsagar Sep 5, 2023

dhanvsagar Sep 6, 2023

itheodoro Sep 8, 2023 Author

dhanvsagar Sep 11, 2023

mapno Sep 6, 2023 Maintainer

dhanvsagar Sep 6, 2023

itheodoro Sep 8, 2023 Author

itheodoro
Aug 28, 2023

Replies: 4 comments 21 replies

zalegrala
Aug 28, 2023
Maintainer

itheodoro Aug 29, 2023
Author

zalegrala Aug 30, 2023
Maintainer

joe-elliott
Aug 29, 2023
Maintainer

joe-elliott Aug 30, 2023
Maintainer

joe-elliott Aug 31, 2023
Maintainer

itheodoro Sep 5, 2023
Author

joe-elliott Sep 5, 2023
Maintainer

itheodoro Sep 8, 2023
Author

dhanvsagar
Sep 1, 2023

itheodoro Sep 5, 2023
Author

itheodoro Sep 8, 2023
Author

mapno
Sep 6, 2023
Maintainer

itheodoro Sep 8, 2023
Author