Replies: 4 comments 21 replies
-
Hello, Some of this will depend on the shape of your traces, so it might be hard to make generalizations. You might try to reduce https://grafana.com/docs/tempo/latest/configuration/#ingester That may be a reasonable starting place. Test again and perhaps we can see if that is an improvement. Perhaps others will chime in also. |
Beta Was this translation helpful? Give feedback.
-
This is quite small and will put a lot of pressure on your compactors. Cutting blocks quickly though does smooth out CPU and memory consumption in the ingester as you've found.
Reducing these two settings should also reduce memory usage, but not at the cost of creating more blocks. Doing this will create more fractured traces. This is important if you value queries that assert conditions across the trace such as structural operators. Both of these overrides are quite high. Reducing these will better protect your ingesters. 200MB is very large for a trace and can put memory and CPU pressure on Tempo. I'd consider reducing that to something like 50MB.
This is ~500k/second? Those resources do seem a bit high for that span count. Do you know how many bytes/second you are ingesting? |
Beta Was this translation helpful? Give feedback.
-
Observed a similar behavior while doing a load test in our environment using xk6-client-tracing with the standard template file. Till ~350k the the memory utilization is around ~7 to 9Gi and grows drastically after that
config
I have also tried setting GOMEMLIMIT, but did not help much
The ingester cpu utilization looks normal and the persistent volumes (EBS Provisioned IOPS SSD (io1)) are 30Gis and their usage is well under the limit. |
Beta Was this translation helpful? Give feedback.
-
Hi @itheodoro, @dhanvsagar, could you share the scripts your using for load testing? The code that the extension uses to send traces is from the OTel collector, so I suspect that the issue is in the trace generation. Being able to reproduce this problem would be very helpful. Thanks!! |
Beta Was this translation helpful? Give feedback.
-
Hello,
We are using Tempo (version 2.2.0, microservices mode) deployed on a k8s cluster. We are using S3 as backend storage, and we have persistent volume enabled on ingesters.
We’ve been running load tests to achieve an ingestion rate of nearly 30 million spans per minute. While we were successful in reaching this volume, we observed that the ingester’s memory keeps growing as the throughput increases, eventually leading to the pods being terminated due to OOM. However, after the test ends, we could see that the memory consumption decreases and remains stable. Our goal is to maintain this high throughput while ensuring stable memory usage.
We've already tried out a few configuration options. Our latest attempt involved reducing
max_block_duration
to 5m andmax_block_bytes
to ~50mb, which provided some relief by delaying the OOM occurrence. However, the issue of memory growth persisted.Here are the resources allocated to our k8s replicas and the configuration settings we've applied to Tempo:
distributor: X 35, 2 CPU, 4GB
ingester: x 40, 3 CPU, 12GB
compactor: x 4, 1 CPU, 6GB
And here is our throughput and the ingesters' memory usage in one of the load tests:
Throughput (per second):
Ingester memory usage:
Do you think we could make any configuration changes to achieve better results and optimize resource usage?Any input you have would be really helpful. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions