-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploy tektoncd/pipeline in high concurrent request scenario #1281
Comments
I'm definitely also very interested in exploring the scaling options/limits for the controller. I've run a number of load tests to measure queue latency under load (specifically TaskRuns), and the improvements in v0.6 were very welcome. It could be helpful to define a scaling target we'd like to reach, then identify steps to take to get there. In my tests creating 500 TaskRuns at 2 QPS ( Increasing the threads per controller should help too, but we might want to also increase the resources we request for the controller. I don't think we should increase the defaults by default, but rather document how they can be increased. Users who have lower scaling needs and smaller clusters won't need the increased performance, and might pay for it. |
Glad to see the optimization brought about by the upgrade of the version. We want to use tekton to pull up the build task (such as code2image). Dozens of tasks will be submitted concurrently at peak business hours. At the same time, the time from apply tekton crds to k8s pod enter running state is sensitive for us. So I think parameter configuration maybe a better solution. |
Can you provide more information about what peak QPS you're sending, and what performance you're seeing that you'd like to see improved? It would be helpful to get real-world data, to help set a sort of informal SLO for TaskRun queue performance. IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations, so short of writing our own scheduler (which we could do...), that's probably going to be a theoretical ceiling on how quickly we can start TaskRuns. IOW, if you're creating TaskRuns at a rate >2 QPS, even with an optimally-tuned Tekton controller you'll see Pod creation back up at a linear rate. |
Please reply for insufficiency in time. At first, in a k8s cluster built with minikube, I submitted the pod create request in batches, and reproduced the description of "IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations". The qps is between 2~3. But in the k8s version of our internal production environment, we have done some optimizations for the pod scheduler to submit its qps, here we assume that its speed is still ok. Since the v0.5.x version we used before and has not been upgraded to v0.6.0, the following description is based on v0.5.x. When I submit a simple task yaml without inputs and ouputs, and then submit taskrun concurrently (taskrun number is 20 or 40 or 60 ), observe the interval between the start of pod startTime and apply request, and find that the pod scheduling speed is qps<1. In the processNextWorkItem method, when the workitem is consumed, there will be a "Time taken" style log output. I found out that the consumption time of a reconcile may be greater than 100ms, and some even exceed one second. At this time, since the number of threads is too small, will the controller's io for qps become a bottleneck? |
For example: |
/kind question |
Hi all. From pipeline 0.0.8 version, I found that Prometheus feature was introduced and As mentioned above, I found sometimes there are logs with reconcile reaching several hundred millsecond or even several second. I guess when the concurrency of pipelinerun/taskrun is large, the throughput of tekton controller will reach the bottleneck. I modified the I deploy tekton in two different k8s cluster. First one receive very small traffic and I found most of reconcile action cost time less than 100ms.
But the second cluster receive big traffic (almost 30~40 running pipelinerun) and the cost time is so high.
I'd like to ask how I can improve Tekton's throughput. It seems that multiple tekton / pipeline controller cannot be deployed because CRD will be consumed repeatedly. Sincerely hope to get your reply. @imjasonh @vdemeester |
The performance at 30-40 concurrent pipelines is actively killing our ability to set up a CI service @ scacle as well @cccfeng how much better is the performance for 64 threads vs the default 2 (based on searching the vendored knative dep in this repo) |
Maybe you can look at The default qps of kubeClient When the pipelineRun/taskRun is submitted simultaneously, we found that the first bottleneck is kubeClient. The QPS allocated to each controller by default is only 5. When the reconcile code relies on k8s apiserver, it will be restricted on the client side. So I suggest that you try to increase this value by 10-20 times first, and observe whether the controller work queue is no longer stacked, and the reconcile delay will keep less than 100ms. Keep reconcile delay low and reconcile count will increase. |
Hi @cccfeng and all, I also would like to increase Can you please tell me how you modify your Is there any ENV or Property exported for Tekton, so that I can set in for the tekton deployment? Thanks! |
Hi all, We are running the concurrent test to test the tekton controller performance by using 100 taskruns. But we found there is a big delay if there are lots of taskruns are created. And after I changed below two settings and rebuilt a test image:
The tekton performance is improved very much! The average taskrun execution time will be reduced from 100s to 30s, that is really cool! But there is a problem, the tekton uses Does anyone know is there a way to allow us to pass a custom value for these settings? such as use ENV or property for the tekton deployment. It is bad if there is no way to customize these settings and tekton only supports 2 threads... it is not good for a production level usage. Thanks! |
cc @mattmoor who has done some work scaling the knative/pkg code that Tekton uses, and might have insights about how to proceed. If these are the right settings to tweak to improve performance it makes sense to me to expose them as either envs or ConfigMap values. |
Generally the caution I'd give when adjusting these things would be: you may be treating the symptoms instead of the disease. I'd make sure that the reconciliations are properly using lister caches and carefully auditing client requests (including K8s events) before just jacking these values up. |
Thanks and I agree, but the problem here is the user cannot customize these settings now. Maybe 100 is too high, maybe it should based on the cache/request, etc.... But if I found the default values are not good to me, I should be able to change by myself. |
With knative/pkg at HEAD it's possible to override these at the rest config level, but I agree that exposing flags to override these is also probably reasonable (certainly to treat the "symptoms" until the "disease" is cured). I'd probably define the flags here and if they are set, override the |
Cool thanks Matt! BTW,
Thanks! |
We cut knative releases every 6 weeks, PKG always cuts a week before serving/eventing. The next PKG cut is Tuesday, so if you are in by then you'll be in |
Cool, thanks matt! Hope Tekton can pick it up this new change in the new release soon. Thanks! |
v0.15.0 resolves one issue where the pipelinerun controller would fetch configmaps via API instead of using the config store: #2947. It would be interesting to see if that had an impact on performance. @imjasonh @zhangtbj I think it would be valuable to have your performance tests executed against nightly releases, and results collected and graphed, so that we could monitor changes over time. @mrutkows @NavidZ for info - since you've been looking into metrics and tracing. |
@afrittoli Do you have any idea when pipelines code base will pick up this change in knative.dev/pkg so that these parameters can be externally configured? |
As I know the QPS, Burst and Thread can be configured since new Tekton
|
Expected Behavior
When I deploy tektoncd/pipeline in high concurrent request scenario(the taskruns/pipelineruns are submitted concurrently), it may be better to increased the threads for controller to process queue workitems faster.
Actual Behavior
From v0.5.x --> v0.0.6 , the code in cmd/controller/main.go has been moved to github.com/knative/pkg and the param threadsPerController is change to DefaultThreadsPerController, which is default equal to 2.
Additional Info
Now I have some questions about how to increased the thread number.
The text was updated successfully, but these errors were encountered: