Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills #4412

Open
adhinneupane opened this issue Dec 4, 2024 · 1 comment

Comments

@adhinneupane
Copy link

adhinneupane commented Dec 4, 2024

Is your feature request related to a problem? Please describe.

We want to enforce rate limit to protect our tempo infrastructure. When testing Tempo (version: 2.5) running in a k8's Cluster, managed via tanka in order to determine a supportable global rate limit; we have our ingesters (10 GiB Memory) OOM killed even at low write volumes (18 MiB/s) during load with x6-client-tracing extension with incoming traces ranging from (5 KiB to 250 KiB).

We observed that despite setting a low global rate limit and discarding many spans; we cannot protect ingesters from OOMing if the incoming trace sizes were large. (table below) We are able to support much higher write volumes in production without a rate limit in place where our average trace size does not exceed 20KiB

We want to understand the logic behind this behavior and determine a global rate limit for our distributors. With OOM kills happening at low write volumes, we are unable to enforce a rate limit and that can protect our infrastructure.

Describe the solution you'd like
Being able to enforce a rate limit that protects tempo-infrastructure.

Load Test Results at set burst_size_bytes and rate_limit_bytes:

OOM Kills burst_size_bytes rate_limit_bytes Average Trace Size (Bytes) Live Traces (30k) Distributor bytes limit  (burst + rate) Distributor (N) x Ingester (N) Ingester Memory (Max) Rate Limit Strategy Time Under Test Average Trace Size * Live Traces (MiB)
0 17 MiB 14 MiB 57000 15000 29MiB 3 x 3 80% Global 25m 815.3915405
0 17 MiB 14 MiB 48000 18000 29 MiB 3 x 3 70% Global 25m 823.9746094
0 17 MiB 14 MiB 38000 25000 28 MiB 3 x 3 60% Global 25m 905.9906006
1 17 MiB 14 MiB 187000 2000 18 MiB 3 x 3 N/A Global < 10m 356.6741943
1 17 MiB 14 MiB 219000 1200 18.9 MiB 3 x 3 N/A Global < 10m 250.6256104

To get an idea about the average Trace Size we used: [Replication factor: 3]

( sum (rate (tempo_distributor_bytes_received_total{cluster=""}[$__interval]) ) by (cluster) / ( sum (rate (tempo_ingester_traces_created_total{cluster=""}[$__interval])) by (cluster)/3) ) / 1024 / 1024

Image

Additional Context

xk6-client-tracing param.js

import { sleep } from 'k6';
import tracing from 'k6/x/tracing';

export const options = {
    vus: 120,
    stages: [
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    ]
};

const endpoint = __ENV.ENDPOINT || "https://<>:443"
const client = new tracing.Client({
    endpoint,
    exporter: tracing.EXPORTER_OTLP,
    tls: {
      insecure: true,
    }
});

export default function () {
    let pushSizeTraces = 50;
    let pushSizeSpans = 0;
    let t = [];
    for (let i = 0; i < pushSizeTraces; i++) {
        let c = 100
        pushSizeSpans += c;
        t.push({
            random_service_name: false,
            spans: {
                count: c,
                size: 400, // changed with each load test run from 100 to 1200 for average trace size. 
                random_name: true,
                fixed_attrs: {
                    "test": "test",
                },
            }
        });
    }

    let gen = new tracing.ParameterizedGenerator(t)
    let traces = gen.traces()
    sleep(5)
    console.log(traces);
    client.push(traces);
}

export function teardown() {
    client.shutdown();
}
@adhinneupane adhinneupane changed the title Rightsizing Tempo Ingesters when trace sizes vary Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills Dec 4, 2024
@joe-elliott
Copy link
Member

Is this the same as this issue? #4424

Can we keep the conversation in one place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants