Provide automated load testing of KEDA #4411

tomkerkhove · 2023-03-29T13:10:51Z

A frequent ask during our CNCF Graduation discussions are if we have load testing / performance benchmarking of KEDA which we do not have today.

There are a few reasons for that:

KEDA integrates with a plethora of external dependencies which can all influence our performance and give inconsistent/false impression
KEDA relies on Kubernetes' HPA so we want to avoid doing load testing of the HPA
We know KEDA is used in large deployments of 1000's of ScaledObjects

However, the latter is purely informative and not something which we manage so I have opened #4410 so that end-users can chime in and share if they are comfortable.

I do believe this is a reasonable ask but it is not something that we can easily do; however, I like challenges.

Let's brainstorm approaches!

tomkerkhove · 2023-03-29T13:11:04Z

First idea:

Benchmark metric server by cutting out our scalers / HPA

Scenario: We can introduce a new external scaler for testing purposes. By doing so, we could create 1000(s) of ScaledObjects for 1000(s) of test apps and trigger scaling while we do load testing on metric server as if it would be Kubernetes attempting to fetch metrics.

An automated service such as k9/artillery/Azure Load Testing can be used to generate traffic on our metric server and give a report of its performance. I have used Azure Load Testing which is based on JMeter and offers good report/comparison between runs but I'm open to ideas.

Pro:

We should have an understanding of the performance of our metric server
We don't depend on performance of HPA nor external scalers

Con:

We'll need to use a lot of compute, but is it really worth it? (🌳🌳)
More work as we need to build all of this, along with a test external scaler
Is KEDA slow, or the test external scaler?

JorTurFer · 2023-03-29T16:13:56Z

I have been talking about this with @MrsDaehin (she is our performance expert, and she is really top):

As we have a single endpoint to be tested (/apis/external.metrics.k8s.io/v1beta1/namespaces/X/Y), the proposal is to generate some test/benchmark cases based on 3 parameters: ScaledObject count, Triggers per ScaledObject, concurrent requests (now the HPA controller is single thread but multithread will be released soon). This could give as a vision about how KEDA performs in different scenarios.

To achieve this, we have talked about using a tool like go-wrk and a golang script as it's all written in golang (to have a single language for everything), but other tools like Grafana K6 could be more useful than go-wrk if we want to extend those test cases in the future to cover other features like admission webhooks, chaos tests, etc. In any case, as we want to test the metrics server, we proposed to execute it inside the cluster and not from any external service (because in that case, we'd have to deal with the authentication, and internally we can use a service account token for that)

To not be affected by external scalers (and measure only KEDA, not the deps), we have some options:

Use only internal scalers like cron
Use a mock API to simulate external scaler with different behaviors (this could be useful to benchmark the cache)

Depending on the scope that we want to cover, options can be different.

About the computing, I think that we can reduce the impact of creating/destroying the infrastructure on demand and performing the test daily/weekly/biweekly, but as you said, the performance test has been a frequent ask and I guess that having the metrics are useful for everybody.

Maybe I have left something from our conversation @MrsDaehin so feel free to correct me where you think that I'm wrong or to extend whatever you think is necessary.

MrsDaehin · 2023-03-29T16:51:32Z

Two ways of approaching the problem

1.- Benchmarking: the performance of a component under load comparing different configurations
2.- Load Testing to find bottlenecks/limits in order to dimension a certain system in a "fix" status ( configuration, system under test etc .. )

So standards:

For Benchmarking:
httperf old benchmark tool more or less like apache benchmark.
WRK/WRK2 is a "modern benchmark tool" and all the ones developed after it like the go-wrk i told @JorTurFer

For Load Testing:

Jmeter and its friends: Blazemeter Taurus, Jmeter-DSL etc .. all based in java :( I am more fan of Taurus because it is not XML. Jmeter is a really popular tool but it is not code-friendly.
K6 is a disrupting tool ( before known as Load Impact ) that is really developer friendly. The tests are written in Javascript BUT the modules are distributed by k6. No npm package works in k6 ( at least not all of them ). we can sort that out creating extensions and those are in GoLang. So we can contribute to the community too :)

advantages K6 vs Jmeter: K6 code friendly. Chaos Engineering integrated ( such as Litmus http-faults ). Highly scalable way more than Jmeter as Go Routines >> JVM.

So consider this and we can setup a test whenever :)

Eldarrin · 2023-03-29T17:52:36Z

Just an idea, what about jobs rather than objects as they avoid the HPA? Job can be a simple stamper to interface with the metric gatherer. Last time I did smoke/load was about 20 years ago lol so I may be off base.

tomkerkhove · 2023-03-30T07:58:01Z

K6 is nice, but the only managed offering of it that I know is https://k6.io/cloud. If we go with JMeter, I'm sure we can use a managed offering for it that gives the same level of insights/reporting which is not part of CLI output.

I really want to keep the load testing infra reduced to the minimum - The less we have to manage, the better. Hence why I suggested Azure Load Testing as that's a managed offering and we already have an Azure subscription (and this is unrelated from me working for Microsoft).

javaducky · 2023-03-30T12:46:01Z

K6 is nice, but the only managed offering of it that I know is https://k6.io/cloud.

We do now have Grafana Cloud k6 as a managed offering so you can have the metrics results there in your Grafana dashboards.

MrsDaehin · 2023-03-30T13:09:39Z

I am using Azure Load Testing and Jmeter in a daily basis but i am not really sure i can run an oss sampler ( well i have tried with a python script and i wasnt able ) to run the az cli to access the keyvault? Not really sure how to solve that access from Azure Test engines tbf. It is the only problem i see about using JMeter/AZ Load Testing.
But again, i am not really sure you need a full load test rather than a benchmark.

tomkerkhove · 2023-03-31T11:44:30Z

K6 is nice, but the only managed offering of it that I know is k6.io/cloud.

We do now have Grafana Cloud k6 as a managed offering so you can have the metrics results there in your Grafana dashboards.

Correct, @javaducky, but we do not have a subscription for it unless Grafana wants to sponsor one for us?

I am using Azure Load Testing and Jmeter in a daily basis but i am not really sure i can run an oss sampler ( well i have tried with a python script and i wasnt able ) to run the az cli to access the keyvault? Not really sure how to solve that access from Azure Test engines tbf.

Can you elaborate on what you mean please? I'm not sure I get what limitation you are facing.

JorTurFer · 2023-03-31T13:15:16Z

AFAIK, we can use K6 as open source cli tool, exporting the results to other place from Grafana Cloud.
I guess that's enough for us, am I right @MrsDaehin ?

MrsDaehin · 2023-03-31T16:19:29Z

We are using a Grafana Dashboard to show the results of K6 on real time. So it should be more than enough if we have a grafana available.

Can you elaborate on what you mean please? I'm not sure I get what limitation you are facing.

So as Jorge told me there is a moment in order to authenticate we need to get a value from a keyvault? And for that we need to run a script "somehow". In jmeter the solution to run commands is to use oss sampler but if you are running them in Azure Load Testing there is no way to reach the OS to execute a command in the test engine.

JorTurFer · 2023-03-31T16:35:47Z

So as Jorge told me there is a moment in order to authenticate we need to get a value from a keyvault? And for that we need to run a script "somehow". In jmeter the solution to run commands is to use oss sampler but if you are running them in Azure Load Testing there is no way to reach the OS to execute a command in the test engine.

No no, we don't need any value from any Key Vault xD. We need a token (from a service account) with enough permissions in the cluster RBAC to request metrics. We can get that token using kubectl or however, but that token isn't static so it needs to be recovered on every execution as test arrange. If we run as a pod inside the cluster, we can get bind the required role to the service account and get the service account token from the file system.

Other option could be (if Azure Load Test supports it) a bash script for getting AKS credentials using any kind of Azure authentication (if Azure Load Test supports it) and using it, execute the tests or get the required token

JorTurFer · 2023-03-31T16:49:11Z

About Azure Load Testing, is something that we can automatize somehow? we cannot assume that the portal will be available (it isn't indeed) and we need to create/manage/execute the tests using an API, also for getting the results as only MSFT folks can access to the portal.

tomkerkhove · 2023-04-01T08:34:29Z

Everything is possible, but the main important point for me is - It has to be simple and as minimal infrastructure as we can.

Using one tool and sending output to another tool that we have to spin up and have Grafana is already something I want to avoid because that means the data has to be stored somewhere so probably we'll need Prometheus as well which are all constantly running resources that we can't use only when we need them - No?

Hence the proposal to keep it simple and use a PaaS/SaaS such as Azure Load Testing, if we can get cloud-based K9 that's fine for me as well.

About Azure Load Testing, is something that we can automatize somehow? we cannot assume that the portal will be available (it isn't indeed) and we need to create/manage/execute the tests using an API, also for getting the results as only MSFT folks can access to the portal.

This gives reporting in the Azure Portal indeed, but this can be called from GitHub Actions so the integration should be simple - https://github.com/Azure/load-testing

Not sure what customization you want to have because everything is based on JMeter configuration + defined thresholds.

Results can be exported and we would only do load testing/benchmarking once a week or month so I think that should be fine.

JorTurFer · 2023-04-01T16:59:09Z

Not sure what customization you want to have because everything is based on JMeter configuration + defined thresholds.

My only concern about using it is that we should be able to do whatever we need without Azure Portal, if we can achieve it using terraform + gh actions, it's totally okey for me 😄
I want to avoid bottlenecks related with having to access to something in the portal

JorTurFer · 2023-04-01T17:00:50Z

BTW, I have seen that Grafana has a free tier that could cover our requirements:

javaducky · 2023-04-03T13:41:32Z

we do not have a subscription for it unless Grafana wants to sponsor one for us?

Thanks to @JorTurFer for providing the info about the Grafana Free Tier. My bad for not being explicit @tomkerkhove, as this Free Tier is what I meant to convey.

ppcano · 2023-04-03T17:17:39Z

Another alternative is detailed in this post; It stores the k6 test summary using the AzureTask/PublishTestResults.

JorTurFer · 2023-04-23T21:43:22Z

I have been talking with Nicole (from k6 team in Grafana Labs) about our use case during KubeCon and she told me that Grafana Labs has an open source program that we can request if we face with limits using the free tier, and they will provide us more resources.
I asked her too if we can run the agents in our own infrastructure and push the results to Grafana Cloud (to have a place where to store the information) and she told me that it's possible, so I'd explore k6 instead of building our own system from scratch.
There are multiple tools we can use for running the benchmarks, but we need to use the outcome from them easily

tomkerkhove · 2023-04-24T10:59:49Z

What would be the value of running our own agents? Can you expand on what agents you mean here?

JorTurFer · 2023-04-24T18:11:10Z

What would be the value of running our own agents? Can you expand on what agents you mean here?

The principal value I can see of running our own agents, is that we can use a cluster service account to access to the cluster, making the things easier because we don't need to expose anything, something running in the cluster has access to the KEDA endpoints, but the point I wanted to share is that I had been talking with people from k6 team and the account for using k6 is not a problem (the free tier should be enough, but we can request to increase as open source project).

I have been checking Azure Load Tests and I'm not totally sure about how we could access to the metrics server from Azure without exposing it externally. Do you have any idea @MrsDaehin ?

We started the issue a month ago and we haven't decided anything yet, I'd not like to see this stale as it's important information about how KEDA performs IMO, maybe we can discuss this during the standup...

tomkerkhove · 2023-05-09T15:15:59Z

Most probably we need to run an agent to be able to access the metric server (or use VNET-based service) but we will start with k6 and use Grafana Cloud k6 to get started.

We will ensure that there is enough docs for contributors to use it as well.

stale · 2023-07-09T04:20:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

JorTurFer · 2023-07-09T10:01:28Z

@MrsDaehin and I are working on this

arvinder06 · 2023-10-29T23:54:18Z

This will be a great feature to have. :)

JorTurFer · 2023-10-30T07:37:50Z

FYI https://github.com/kedacore/keda-performance
The work is in progress 😄

akhilnr92 · 2025-02-03T10:26:34Z

Were there any observations from the perf tests or any information on how many scaled objects can Keda handle ?

We have around 2500 scaled objects in our cluster using Azure Service Bus external scaler and the requests for external metrics are getting timed out at Keda Operator (we tried increasing the timeout values, CPU and Memory limits etc. but to no avail. )

So any details on max scaled objects people have working in a single cluster would be helpful.

JorTurFer · 2025-02-03T23:20:28Z

Do you see any error in KEDA operator logs? With large clusters, you could see messages announce the client throttling because of the kube-client parameters -> https://keda.sh/docs/latest/operate/cluster/#kubernetes-client-parameters

I know about clusters with 4,5K ScaledObjects working without issues

akhilnr92 · 2025-02-04T12:44:13Z

@JorTurFer Thanks for the reply.

This is the error we are seeing in KEDA operator -

ERROR azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "testnamespace", "name": "testscaledobject", "error": "Get "https://testsb.servicebus.windows.net/test/Subscriptions/test?api-version=2021-05\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
/workspace/pkg/scalers/azure_servicebus_scaler.go:267
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
/workspace/pkg/scaling/cache/scalers_cache.go:151
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
/workspace/pkg/scaling/scale_handler.go:758
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
/workspace/pkg/scaling/scale_handler.go:633

We are not seeing any throttling related error messages in the logs so have not tried anything related to kube client parameters.

We have confirmed that when the error is occurring, these requests are not reaching the azure service bus side. At the same time, the requests are reaching service bus endpoint. if we try from another pod within the same cluster and node.
So we believe the issue has to do with KEDA operator pod, where the requests are getting timedout even before leaving the operator pod.

We have also tried following settings but with no luck.

setting KEDA_HTTP_DISABLE_KEEP_ALIVE to true
increasing KEDA_HTTP_DEFAULT_TIMEOUT from default 3 seconds to 30 seconds
increasing the polling interval setting inside our scaledobject to 90 seconds (Have not tried useCachedMetrics yet.)

JorTurFer · 2025-02-04T15:48:35Z

Are you scrapping prometheus metrics generated by KEDA?
could you share the average value of keda_internal_scale_loop_latency_seconds ? KEDA also exposes it via OTEL keda.internal.scale.loop.latency.seconds

akhilnr92 · 2025-02-06T05:35:08Z

We are currently not using Prometheus or OTel. Are there any guides or docs available to implement this to get the metrics ?

zroubalik · 2025-02-06T10:14:15Z

It's a standard Metrics colletion, nothing specific to KEDA, if you follow any guides on Prometheus or OpenTelemetry you should be able to get them easily.

These are all exposed Prometheus metrics: https://keda.sh/docs/2.16/integrations/prometheus/ and OTel: https://keda.sh/docs/2.16/integrations/opentelemetry/

akhilnr92 · 2025-02-07T12:00:09Z

This is the graph for avg(keda_internal_scale_loop_latency_seconds)

JorTurFer · 2025-02-07T12:27:27Z

That chart shows that your KEDA operator is overloaded and each check loop is delayed some seconds from the expected time. This could happen because of 2 reasons:

KEDA operator doesn't have enough CPU to process all the things in time
The upstream is responding in more time that the expected pollingInterval

Which is the value of the metric 'keda_scaler_metrics_latency_seconds'? (this measures the upstream response times). Do you see CPU throttling affecting the operator pod?

akhilnr92 · 2025-02-07T13:09:13Z

Below is the graph showing average of keda_scaler_metrics_latency_seconds:

We had initially seen the KEDA operator pod getting crashloopbackoff due to OOM and we had increased both Memory as well as CPU to below values in the KEDA operator deployment -

resources:
limits:
cpu: "2"
memory: 3000Mi
requests:
cpu: "1"
memory: 2000Mi

Currently we don't see any high CPU or Memory usage in the operator pod:

JorTurFer · 2025-02-08T23:15:55Z

Did the issue happen during the time of the charts? I mean, do you have metrics during the timeouts?

keda-automation added this to Roadmap - KEDA Core Mar 29, 2023

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Mar 29, 2023

tomkerkhove moved this from To Triage to Proposed in Roadmap - KEDA Core Mar 29, 2023

tomkerkhove moved this from Proposed to To Do in Roadmap - KEDA Core May 9, 2023

This was referenced Jun 16, 2023

Load testing #3832

Closed

Add a Prometheus metric for measuring the scalable object loop processing deviation #4702

Closed

stale bot added the stale All issues that are marked as stale due to inactivity label Jul 9, 2023

tomkerkhove removed the stale All issues that are marked as stale due to inactivity label Jul 9, 2023

tomkerkhove added the feature All issues for new features that have been committed to label Jul 9, 2023

JorTurFer added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Jul 9, 2023

JorTurFer assigned JorTurFer and MrsDaehin Jul 9, 2023

Provide automated load testing of KEDA #4411

Provide automated load testing of KEDA #4411

Comments

tomkerkhove commented Mar 29, 2023

tomkerkhove commented Mar 29, 2023

Benchmark metric server by cutting out our scalers / HPA

JorTurFer commented Mar 29, 2023 • edited Loading

MrsDaehin commented Mar 29, 2023

Eldarrin commented Mar 29, 2023

tomkerkhove commented Mar 30, 2023

javaducky commented Mar 30, 2023 • edited Loading

MrsDaehin commented Mar 30, 2023

tomkerkhove commented Mar 31, 2023

JorTurFer commented Mar 31, 2023

MrsDaehin commented Mar 31, 2023

JorTurFer commented Mar 31, 2023 • edited Loading

JorTurFer commented Mar 31, 2023 • edited Loading

tomkerkhove commented Apr 1, 2023

JorTurFer commented Apr 1, 2023

JorTurFer commented Apr 1, 2023

javaducky commented Apr 3, 2023 • edited Loading

ppcano commented Apr 3, 2023

JorTurFer commented Apr 23, 2023 • edited Loading

tomkerkhove commented Apr 24, 2023

JorTurFer commented Apr 24, 2023

tomkerkhove commented May 9, 2023 • edited Loading

stale bot commented Jul 9, 2023

JorTurFer commented Jul 9, 2023

arvinder06 commented Oct 29, 2023

JorTurFer commented Oct 30, 2023 • edited Loading

akhilnr92 commented Feb 3, 2025 • edited Loading

JorTurFer commented Feb 3, 2025

akhilnr92 commented Feb 4, 2025

JorTurFer commented Feb 4, 2025

akhilnr92 commented Feb 6, 2025

zroubalik commented Feb 6, 2025

akhilnr92 commented Feb 7, 2025

JorTurFer commented Feb 7, 2025

akhilnr92 commented Feb 7, 2025

JorTurFer commented Feb 8, 2025

JorTurFer commented Mar 29, 2023 •

edited

Loading

javaducky commented Mar 30, 2023 •

edited

Loading

JorTurFer commented Mar 31, 2023 •

edited

Loading

JorTurFer commented Mar 31, 2023 •

edited

Loading

javaducky commented Apr 3, 2023 •

edited

Loading

JorTurFer commented Apr 23, 2023 •

edited

Loading

tomkerkhove commented May 9, 2023 •

edited

Loading

JorTurFer commented Oct 30, 2023 •

edited

Loading

akhilnr92 commented Feb 3, 2025 •

edited

Loading