Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide automated load testing of KEDA #4411

Open
tomkerkhove opened this issue Mar 29, 2023 · 35 comments
Open

Provide automated load testing of KEDA #4411

tomkerkhove opened this issue Mar 29, 2023 · 35 comments
Assignees
Labels
feature All issues for new features that have been committed to stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@tomkerkhove
Copy link
Member

A frequent ask during our CNCF Graduation discussions are if we have load testing / performance benchmarking of KEDA which we do not have today.

There are a few reasons for that:

  • KEDA integrates with a plethora of external dependencies which can all influence our performance and give inconsistent/false impression
  • KEDA relies on Kubernetes' HPA so we want to avoid doing load testing of the HPA
  • We know KEDA is used in large deployments of 1000's of ScaledObjects

However, the latter is purely informative and not something which we manage so I have opened #4410 so that end-users can chime in and share if they are comfortable.

I do believe this is a reasonable ask but it is not something that we can easily do; however, I like challenges.

Let's brainstorm approaches!

@tomkerkhove
Copy link
Member Author

First idea:

Benchmark metric server by cutting out our scalers / HPA

Scenario: We can introduce a new external scaler for testing purposes. By doing so, we could create 1000(s) of ScaledObjects for 1000(s) of test apps and trigger scaling while we do load testing on metric server as if it would be Kubernetes attempting to fetch metrics.

An automated service such as k9/artillery/Azure Load Testing can be used to generate traffic on our metric server and give a report of its performance. I have used Azure Load Testing which is based on JMeter and offers good report/comparison between runs but I'm open to ideas.

Pro:

  • We should have an understanding of the performance of our metric server
  • We don't depend on performance of HPA nor external scalers

Con:

  • We'll need to use a lot of compute, but is it really worth it? (🌳🌳)
  • More work as we need to build all of this, along with a test external scaler
  • Is KEDA slow, or the test external scaler?

@tomkerkhove tomkerkhove moved this from To Triage to Proposed in Roadmap - KEDA Core Mar 29, 2023
@JorTurFer
Copy link
Member

JorTurFer commented Mar 29, 2023

I have been talking about this with @MrsDaehin (she is our performance expert, and she is really top):

As we have a single endpoint to be tested (/apis/external.metrics.k8s.io/v1beta1/namespaces/X/Y), the proposal is to generate some test/benchmark cases based on 3 parameters: ScaledObject count, Triggers per ScaledObject, concurrent requests (now the HPA controller is single thread but multithread will be released soon). This could give as a vision about how KEDA performs in different scenarios.

To achieve this, we have talked about using a tool like go-wrk and a golang script as it's all written in golang (to have a single language for everything), but other tools like Grafana K6 could be more useful than go-wrk if we want to extend those test cases in the future to cover other features like admission webhooks, chaos tests, etc. In any case, as we want to test the metrics server, we proposed to execute it inside the cluster and not from any external service (because in that case, we'd have to deal with the authentication, and internally we can use a service account token for that)

To not be affected by external scalers (and measure only KEDA, not the deps), we have some options:

  • Use only internal scalers like cron
  • Use a mock API to simulate external scaler with different behaviors (this could be useful to benchmark the cache)

Depending on the scope that we want to cover, options can be different.

About the computing, I think that we can reduce the impact of creating/destroying the infrastructure on demand and performing the test daily/weekly/biweekly, but as you said, the performance test has been a frequent ask and I guess that having the metrics are useful for everybody.

Maybe I have left something from our conversation @MrsDaehin so feel free to correct me where you think that I'm wrong or to extend whatever you think is necessary.

@MrsDaehin
Copy link

Two ways of approaching the problem

1.- Benchmarking: the performance of a component under load comparing different configurations
2.- Load Testing to find bottlenecks/limits in order to dimension a certain system in a "fix" status ( configuration, system under test etc .. )

So standards:

For Benchmarking:
httperf old benchmark tool more or less like apache benchmark.
WRK/WRK2 is a "modern benchmark tool" and all the ones developed after it like the go-wrk i told @JorTurFer

For Load Testing:

  • Jmeter and its friends: Blazemeter Taurus, Jmeter-DSL etc .. all based in java :( I am more fan of Taurus because it is not XML. Jmeter is a really popular tool but it is not code-friendly.
  • K6 is a disrupting tool ( before known as Load Impact ) that is really developer friendly. The tests are written in Javascript BUT the modules are distributed by k6. No npm package works in k6 ( at least not all of them ). we can sort that out creating extensions and those are in GoLang. So we can contribute to the community too :)

advantages K6 vs Jmeter: K6 code friendly. Chaos Engineering integrated ( such as Litmus http-faults ). Highly scalable way more than Jmeter as Go Routines >> JVM.

So consider this and we can setup a test whenever :)

@Eldarrin
Copy link
Contributor

Just an idea, what about jobs rather than objects as they avoid the HPA? Job can be a simple stamper to interface with the metric gatherer. Last time I did smoke/load was about 20 years ago lol so I may be off base.

@tomkerkhove
Copy link
Member Author

K6 is nice, but the only managed offering of it that I know is https://k6.io/cloud. If we go with JMeter, I'm sure we can use a managed offering for it that gives the same level of insights/reporting which is not part of CLI output.

I really want to keep the load testing infra reduced to the minimum - The less we have to manage, the better. Hence why I suggested Azure Load Testing as that's a managed offering and we already have an Azure subscription (and this is unrelated from me working for Microsoft).

@javaducky
Copy link

javaducky commented Mar 30, 2023

K6 is nice, but the only managed offering of it that I know is https://k6.io/cloud.

We do now have Grafana Cloud k6 as a managed offering so you can have the metrics results there in your Grafana dashboards.

@MrsDaehin
Copy link

I am using Azure Load Testing and Jmeter in a daily basis but i am not really sure i can run an oss sampler ( well i have tried with a python script and i wasnt able ) to run the az cli to access the keyvault? Not really sure how to solve that access from Azure Test engines tbf. It is the only problem i see about using JMeter/AZ Load Testing.
But again, i am not really sure you need a full load test rather than a benchmark.

@tomkerkhove
Copy link
Member Author

K6 is nice, but the only managed offering of it that I know is k6.io/cloud.

We do now have Grafana Cloud k6 as a managed offering so you can have the metrics results there in your Grafana dashboards.

Correct, @javaducky, but we do not have a subscription for it unless Grafana wants to sponsor one for us?

I am using Azure Load Testing and Jmeter in a daily basis but i am not really sure i can run an oss sampler ( well i have tried with a python script and i wasnt able ) to run the az cli to access the keyvault? Not really sure how to solve that access from Azure Test engines tbf.

Can you elaborate on what you mean please? I'm not sure I get what limitation you are facing.

@JorTurFer
Copy link
Member

AFAIK, we can use K6 as open source cli tool, exporting the results to other place from Grafana Cloud.
I guess that's enough for us, am I right @MrsDaehin ?

@MrsDaehin
Copy link

We are using a Grafana Dashboard to show the results of K6 on real time. So it should be more than enough if we have a grafana available.

Can you elaborate on what you mean please? I'm not sure I get what limitation you are facing.

So as Jorge told me there is a moment in order to authenticate we need to get a value from a keyvault? And for that we need to run a script "somehow". In jmeter the solution to run commands is to use oss sampler but if you are running them in Azure Load Testing there is no way to reach the OS to execute a command in the test engine.

@JorTurFer
Copy link
Member

JorTurFer commented Mar 31, 2023

So as Jorge told me there is a moment in order to authenticate we need to get a value from a keyvault? And for that we need to run a script "somehow". In jmeter the solution to run commands is to use oss sampler but if you are running them in Azure Load Testing there is no way to reach the OS to execute a command in the test engine.

No no, we don't need any value from any Key Vault xD. We need a token (from a service account) with enough permissions in the cluster RBAC to request metrics. We can get that token using kubectl or however, but that token isn't static so it needs to be recovered on every execution as test arrange. If we run as a pod inside the cluster, we can get bind the required role to the service account and get the service account token from the file system.

Other option could be (if Azure Load Test supports it) a bash script for getting AKS credentials using any kind of Azure authentication (if Azure Load Test supports it) and using it, execute the tests or get the required token

@JorTurFer
Copy link
Member

JorTurFer commented Mar 31, 2023

About Azure Load Testing, is something that we can automatize somehow? we cannot assume that the portal will be available (it isn't indeed) and we need to create/manage/execute the tests using an API, also for getting the results as only MSFT folks can access to the portal.

@tomkerkhove
Copy link
Member Author

Everything is possible, but the main important point for me is - It has to be simple and as minimal infrastructure as we can.

Using one tool and sending output to another tool that we have to spin up and have Grafana is already something I want to avoid because that means the data has to be stored somewhere so probably we'll need Prometheus as well which are all constantly running resources that we can't use only when we need them - No?

Hence the proposal to keep it simple and use a PaaS/SaaS such as Azure Load Testing, if we can get cloud-based K9 that's fine for me as well.

About Azure Load Testing, is something that we can automatize somehow? we cannot assume that the portal will be available (it isn't indeed) and we need to create/manage/execute the tests using an API, also for getting the results as only MSFT folks can access to the portal.

This gives reporting in the Azure Portal indeed, but this can be called from GitHub Actions so the integration should be simple - https://github.com/Azure/load-testing

Not sure what customization you want to have because everything is based on JMeter configuration + defined thresholds.

Results can be exported and we would only do load testing/benchmarking once a week or month so I think that should be fine.

@JorTurFer
Copy link
Member

Not sure what customization you want to have because everything is based on JMeter configuration + defined thresholds.

My only concern about using it is that we should be able to do whatever we need without Azure Portal, if we can achieve it using terraform + gh actions, it's totally okey for me 😄
I want to avoid bottlenecks related with having to access to something in the portal

@JorTurFer
Copy link
Member

BTW, I have seen that Grafana has a free tier that could cover our requirements:
image

@javaducky
Copy link

javaducky commented Apr 3, 2023

we do not have a subscription for it unless Grafana wants to sponsor one for us?

Thanks to @JorTurFer for providing the info about the Grafana Free Tier. My bad for not being explicit @tomkerkhove, as this Free Tier is what I meant to convey.

@ppcano
Copy link

ppcano commented Apr 3, 2023

Another alternative is detailed in this post; It stores the k6 test summary using the AzureTask/PublishTestResults.

@JorTurFer
Copy link
Member

JorTurFer commented Apr 23, 2023

I have been talking with Nicole (from k6 team in Grafana Labs) about our use case during KubeCon and she told me that Grafana Labs has an open source program that we can request if we face with limits using the free tier, and they will provide us more resources.
I asked her too if we can run the agents in our own infrastructure and push the results to Grafana Cloud (to have a place where to store the information) and she told me that it's possible, so I'd explore k6 instead of building our own system from scratch.
There are multiple tools we can use for running the benchmarks, but we need to use the outcome from them easily

@tomkerkhove
Copy link
Member Author

What would be the value of running our own agents? Can you expand on what agents you mean here?

@JorTurFer
Copy link
Member

What would be the value of running our own agents? Can you expand on what agents you mean here?

The principal value I can see of running our own agents, is that we can use a cluster service account to access to the cluster, making the things easier because we don't need to expose anything, something running in the cluster has access to the KEDA endpoints, but the point I wanted to share is that I had been talking with people from k6 team and the account for using k6 is not a problem (the free tier should be enough, but we can request to increase as open source project).

I have been checking Azure Load Tests and I'm not totally sure about how we could access to the metrics server from Azure without exposing it externally. Do you have any idea @MrsDaehin ?

We started the issue a month ago and we haven't decided anything yet, I'd not like to see this stale as it's important information about how KEDA performs IMO, maybe we can discuss this during the standup...

@tomkerkhove tomkerkhove moved this from Proposed to To Do in Roadmap - KEDA Core May 9, 2023
@tomkerkhove
Copy link
Member Author

tomkerkhove commented May 9, 2023

Most probably we need to run an agent to be able to access the metric server (or use VNET-based service) but we will start with k6 and use Grafana Cloud k6 to get started.

We will ensure that there is enough docs for contributors to use it as well.

@stale
Copy link

stale bot commented Jul 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jul 9, 2023
@tomkerkhove tomkerkhove removed the stale All issues that are marked as stale due to inactivity label Jul 9, 2023
@tomkerkhove tomkerkhove added the feature All issues for new features that have been committed to label Jul 9, 2023
@JorTurFer JorTurFer added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Jul 9, 2023
@JorTurFer
Copy link
Member

@MrsDaehin and I are working on this

@arvinder06
Copy link

This will be a great feature to have. :)

@JorTurFer
Copy link
Member

JorTurFer commented Oct 30, 2023

FYI https://github.com/kedacore/keda-performance
The work is in progress 😄

@akhilnr92
Copy link

akhilnr92 commented Feb 3, 2025

Were there any observations from the perf tests or any information on how many scaled objects can Keda handle ?

We have around 2500 scaled objects in our cluster using Azure Service Bus external scaler and the requests for external metrics are getting timed out at Keda Operator (we tried increasing the timeout values, CPU and Memory limits etc. but to no avail. )

So any details on max scaled objects people have working in a single cluster would be helpful.

@JorTurFer
Copy link
Member

Do you see any error in KEDA operator logs? With large clusters, you could see messages announce the client throttling because of the kube-client parameters -> https://keda.sh/docs/latest/operate/cluster/#kubernetes-client-parameters

I know about clusters with 4,5K ScaledObjects working without issues

@akhilnr92
Copy link

@JorTurFer Thanks for the reply.

This is the error we are seeing in KEDA operator -

ERROR azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "testnamespace", "name": "testscaledobject", "error": "Get "https://testsb.servicebus.windows.net/test/Subscriptions/test?api-version=2021-05\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
/workspace/pkg/scalers/azure_servicebus_scaler.go:267
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
/workspace/pkg/scaling/cache/scalers_cache.go:151
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
/workspace/pkg/scaling/scale_handler.go:758
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
/workspace/pkg/scaling/scale_handler.go:633

We are not seeing any throttling related error messages in the logs so have not tried anything related to kube client parameters.

We have confirmed that when the error is occurring, these requests are not reaching the azure service bus side. At the same time, the requests are reaching service bus endpoint. if we try from another pod within the same cluster and node.
So we believe the issue has to do with KEDA operator pod, where the requests are getting timedout even before leaving the operator pod.

We have also tried following settings but with no luck.

  • setting KEDA_HTTP_DISABLE_KEEP_ALIVE to true
  • increasing KEDA_HTTP_DEFAULT_TIMEOUT from default 3 seconds to 30 seconds
  • increasing the polling interval setting inside our scaledobject to 90 seconds (Have not tried useCachedMetrics yet.)

@JorTurFer
Copy link
Member

Are you scrapping prometheus metrics generated by KEDA?
could you share the average value of keda_internal_scale_loop_latency_seconds ? KEDA also exposes it via OTEL keda.internal.scale.loop.latency.seconds

@akhilnr92
Copy link

We are currently not using Prometheus or OTel. Are there any guides or docs available to implement this to get the metrics ?

@zroubalik
Copy link
Member

It's a standard Metrics colletion, nothing specific to KEDA, if you follow any guides on Prometheus or OpenTelemetry you should be able to get them easily.

These are all exposed Prometheus metrics: https://keda.sh/docs/2.16/integrations/prometheus/ and OTel: https://keda.sh/docs/2.16/integrations/opentelemetry/

@akhilnr92
Copy link

This is the graph for avg(keda_internal_scale_loop_latency_seconds)

Image

@JorTurFer
Copy link
Member

That chart shows that your KEDA operator is overloaded and each check loop is delayed some seconds from the expected time. This could happen because of 2 reasons:

  • KEDA operator doesn't have enough CPU to process all the things in time
  • The upstream is responding in more time that the expected pollingInterval

Which is the value of the metric 'keda_scaler_metrics_latency_seconds'? (this measures the upstream response times). Do you see CPU throttling affecting the operator pod?

@akhilnr92
Copy link

Below is the graph showing average of keda_scaler_metrics_latency_seconds:

Image

We had initially seen the KEDA operator pod getting crashloopbackoff due to OOM and we had increased both Memory as well as CPU to below values in the KEDA operator deployment -

resources:
limits:
cpu: "2"
memory: 3000Mi
requests:
cpu: "1"
memory: 2000Mi

Currently we don't see any high CPU or Memory usage in the operator pod:

Image

Image

@JorTurFer
Copy link
Member

Did the issue happen during the time of the charts? I mean, do you have metrics during the timeouts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature All issues for new features that have been committed to stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Status: To Do
Development

No branches or pull requests

9 participants