Deploy tektoncd/pipeline in high concurrent request scenario #1281

cccfeng · 2019-09-06T09:46:20Z

Expected Behavior

When I deploy tektoncd/pipeline in high concurrent request scenario(the taskruns/pipelineruns are submitted concurrently), it may be better to increased the threads for controller to process queue workitems faster.

Actual Behavior

From v0.5.x --> v0.0.6 , the code in cmd/controller/main.go has been moved to github.com/knative/pkg and the param threadsPerController is change to DefaultThreadsPerController, which is default equal to 2.

Additional Info

Now I have some questions about how to increased the thread number.

imjasonh · 2019-09-06T11:52:43Z

I'm definitely also very interested in exploring the scaling options/limits for the controller. I've run a number of load tests to measure queue latency under load (specifically TaskRuns), and the improvements in v0.6 were very welcome.

It could be helpful to define a scaling target we'd like to reach, then identify steps to take to get there.

In my tests creating 500 TaskRuns at 2 QPS (kubectl create in a loop, with ample spare capacity available) resulted in median queue time of 4 minutes (max 5:45), down from 10+ minutes with v0.5.

Increasing the threads per controller should help too, but we might want to also increase the resources we request for the controller.

I don't think we should increase the defaults by default, but rather document how they can be increased. Users who have lower scaling needs and smaller clusters won't need the increased performance, and might pay for it.

cccfeng · 2019-09-06T16:16:14Z

Glad to see the optimization brought about by the upgrade of the version.

We want to use tekton to pull up the build task (such as code2image). Dozens of tasks will be submitted concurrently at peak business hours. At the same time, the time from apply tekton crds to k8s pod enter running state is sensitive for us.

So I think parameter configuration maybe a better solution.

imjasonh · 2019-09-06T16:28:40Z

Can you provide more information about what peak QPS you're sending, and what performance you're seeing that you'd like to see improved? It would be helpful to get real-world data, to help set a sort of informal SLO for TaskRun queue performance.

IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations, so short of writing our own scheduler (which we could do...), that's probably going to be a theoretical ceiling on how quickly we can start TaskRuns. IOW, if you're creating TaskRuns at a rate >2 QPS, even with an optimally-tuned Tekton controller you'll see Pod creation back up at a linear rate.

cccfeng · 2019-09-09T08:00:55Z

Please reply for insufficiency in time.

At first, in a k8s cluster built with minikube, I submitted the pod create request in batches, and reproduced the description of "IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations". The qps is between 2~3.

But in the k8s version of our internal production environment, we have done some optimizations for the pod scheduler to submit its qps, here we assume that its speed is still ok.

Since the v0.5.x version we used before and has not been upgraded to v0.6.0, the following description is based on v0.5.x. When I submit a simple task yaml without inputs and ouputs, and then submit taskrun concurrently (taskrun number is 20 or 40 or 60 ), observe the interval between the start of pod startTime and apply request, and find that the pod scheduling speed is qps<1.

In the processNextWorkItem method, when the workitem is consumed, there will be a "Time taken" style log output. I found out that the consumption time of a reconcile may be greater than 100ms, and some even exceed one second. At this time, since the number of threads is too small, will the controller's io for qps become a bottleneck?

cccfeng · 2019-09-09T08:03:52Z

Please reply for insufficiency in time.

At first, in a k8s cluster built with minikube, I submitted the pod create request in batches, and reproduced the description of "IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations". The qps is between 2~3.

But in the k8s version of our internal production environment, we have done some optimizations for the pod scheduler to submit its qps, here we assume that its speed is still ok.

Since the v0.5.x version we used before and has not been upgraded to v0.6.0, the following description is based on v0.5.x. When I submit a simple task yaml without inputs and ouputs, and then submit taskrun concurrently (taskrun number is 20 or 40 or 60 ), observe the interval between the start of pod startTime and apply request, and find that the pod scheduling speed is qps<1.

In the processNextWorkItem method, when the workitem is consumed, there will be a "Time taken" style log output. I found out that the consumption time of a reconcile may be greater than 100ms, and some even exceed one second. At this time, since the number of threads is too small, will the controller's io for qps become a bottleneck?

For example:
{"level":"info","logger":"controller.taskrun-controller","caller":"controller/controller.go:339","msg":"Reconcile succeeded. Time taken: 599.764148ms.","knative.dev/controller":"taskrun-controller","knative.dev/traceid":"0b1a1eca-28c2-4762-a457-47dbc7627da4","knative.dev/key":"tekton-pipelines/tekton-load-task-run-2"}

vdemeester · 2019-12-09T10:25:55Z

/kind question

cccfeng · 2020-02-11T09:26:58Z

Hi all. From pipeline 0.0.8 version, I found that Prometheus feature was introduced and tekton_reconcile_latency_bucket in the knative.dev/pkg recorded the time data of controller reconcile.

As mentioned above, I found sometimes there are logs with reconcile reaching several hundred millsecond or even several second. I guess when the concurrency of pipelinerun/taskrun is large, the throughput of tekton controller will reach the bottleneck. I modified the DefaultThreadsPerController to 64 and give the container the specification of 48C32Gi. However, I found that the tekton_reconcile_latency_bucket is not optimistic.

I deploy tekton in two different k8s cluster. First one receive very small traffic and I found most of reconcile action cost time less than 100ms.

tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="10"} 22
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="100"} 37
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="1000"} 38
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="10000"} 38
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="30000"} 38

But the second cluster receive big traffic (almost 30~40 running pipelinerun) and the cost time is so high.

tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="10"} 8
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="100"} 9
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="1000"} 10
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="10000"} 24
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="30000"} 24

I'd like to ask how I can improve Tekton's throughput. It seems that multiple tekton / pipeline controller cannot be deployed because CRD will be consumed repeatedly.

Sincerely hope to get your reply. @imjasonh @vdemeester

aeweidne · 2020-06-10T17:37:31Z

The performance at 30-40 concurrent pipelines is actively killing our ability to set up a CI service @ scacle as well

@cccfeng how much better is the performance for 64 threads vs the default 2 (based on searching the vendored knative dep in this repo)

cccfeng · 2020-06-13T07:13:42Z

The performance at 30-40 concurrent pipelines is actively killing our ability to set up a CI service @ scacle as well

@cccfeng how much better is the performance for 64 threads vs the default 2 (based on searching the vendored knative dep in this repo)

Maybe you can look at The default qps of kubeClient

When the pipelineRun/taskRun is submitted simultaneously, we found that the first bottleneck is kubeClient. The QPS allocated to each controller by default is only 5. When the reconcile code relies on k8s apiserver, it will be restricted on the client side. So I suggest that you try to increase this value by 10-20 times first, and observe whether the controller work queue is no longer stacked, and the reconcile delay will keep less than 100ms. Keep reconcile delay low and reconcile count will increase.

zhangtbj · 2020-08-05T06:22:14Z

Hi @cccfeng and all,

I also would like to increase DefaultThreadsPerController to more, but I found it is hardcode to 2 in the knative.dev/pkg:
https://github.com/tektoncd/pipeline/blob/release-v0.14.x/vendor/knative.dev/pkg/controller/controller.go#L56

Can you please tell me how you modify your DefaultThreadsPerController to 64?

Is there any ENV or Property exported for Tekton, so that I can set in for the tekton deployment?

Thanks!

zhangtbj · 2020-08-05T13:50:52Z

Hi all,

We are running the concurrent test to test the tekton controller performance by using 100 taskruns.

But we found there is a big delay if there are lots of taskruns are created.
I got an issue:#1281

And after I changed below two settings and rebuilt a test image:

the DefaultThreadsPerController from default 2 to 32: https://github.com/tektoncd/pipeline/blob/release-v0.14.x/vendor/knative.dev/pkg/controller/controller.go#L56
The QPS and Burst to 100: https://github.com/tektoncd/pipeline/blob/release-v0.14.x/vendor/knative.dev/pkg/injection/sharedmain/main.go#L144-L145

The tekton performance is improved very much! The average taskrun execution time will be reduced from 100s to 30s, that is really cool!

But there is a problem, the tekton uses knative.dev/pkg vendor which hardcode the DefaultThreadsPerController = 2 , QPS and Burst are also hardcode a small value.

Does anyone know is there a way to allow us to pass a custom value for these settings? such as use ENV or property for the tekton deployment.

It is bad if there is no way to customize these settings and tekton only supports 2 threads... it is not good for a production level usage.

Thanks!

imjasonh · 2020-08-05T14:43:36Z

cc @mattmoor who has done some work scaling the knative/pkg code that Tekton uses, and might have insights about how to proceed.

If these are the right settings to tweak to improve performance it makes sense to me to expose them as either envs or ConfigMap values.

mattmoor · 2020-08-05T16:34:02Z

Generally the caution I'd give when adjusting these things would be: you may be treating the symptoms instead of the disease.

I'd make sure that the reconciliations are properly using lister caches and carefully auditing client requests (including K8s events) before just jacking these values up.

zhangtbj · 2020-08-06T03:20:41Z

Thanks and I agree, but the problem here is the user cannot customize these settings now. Maybe 100 is too high, maybe it should based on the cache/request, etc.... But if I found the default values are not good to me, I should be able to change by myself.

mattmoor · 2020-08-06T03:45:35Z

With knative/pkg at HEAD it's possible to override these at the rest config level, but I agree that exposing flags to override these is also probably reasonable (certainly to treat the "symptoms" until the "disease" is cured).

I'd probably define the flags here and if they are set, override the cfg.QPS value a few lines below.

zhangtbj · 2020-08-06T03:52:02Z

Cool thanks Matt!

BTW,

Is it possible to have a flag to override the DefaultThreadsPerController here?:
https://github.com/knative/pkg/blob/master/controller/controller.go#L58
And is there any plan when to include it in an official knative.dev/pkg release?

Thanks!

mattmoor · 2020-08-06T04:55:23Z

We cut knative releases every 6 weeks, PKG always cuts a week before serving/eventing. The next PKG cut is Tuesday, so if you are in by then you'll be in release-0.17.

zhangtbj · 2020-08-06T06:31:41Z

Cool, thanks matt!

Hope Tekton can pick it up this new change in the new release soon.

Thanks!

afrittoli · 2020-08-10T11:31:06Z

Generally the caution I'd give when adjusting these things would be: you may be treating the symptoms instead of the disease.

I'd make sure that the reconciliations are properly using lister caches and carefully auditing client requests (including K8s events) before just jacking these values up.

v0.15.0 resolves one issue where the pipelinerun controller would fetch configmaps via API instead of using the config store: #2947. It would be interesting to see if that had an impact on performance.

@imjasonh @zhangtbj I think it would be valuable to have your performance tests executed against nightly releases, and results collected and graphed, so that we could monitor changes over time.
It would be helpful to have Tekton instrumented for tracing (#2814) so that we could find out where the controller is spending its time.

@mrutkows @NavidZ for info - since you've been looking into metrics and tracing.

rakhbari · 2020-10-30T02:00:23Z

@afrittoli Do you have any idea when pipelines code base will pick up this change in knative.dev/pkg so that these parameters can be externally configured?

zhangtbj · 2020-10-30T02:07:27Z

As I know the QPS, Burst and Thread can be configured since new Tekton 0.17.0 release:
https://github.com/tektoncd/pipeline/releases/tag/v0.17.0

Make DefaultThreadsPerController, QPS and Burst configurable via flags (Make DefaultThreadsPerController, QPS and Burst configurable via flags #3156)

tekton-robot added the kind/question Issues or PRs that are questions around the project or a particular feature label Dec 9, 2019

afrittoli added the area/performance Issues or PRs that are related to performance aspects. label Aug 10, 2020

cccfeng closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy tektoncd/pipeline in high concurrent request scenario #1281

Deploy tektoncd/pipeline in high concurrent request scenario #1281

cccfeng commented Sep 6, 2019

imjasonh commented Sep 6, 2019

cccfeng commented Sep 6, 2019

imjasonh commented Sep 6, 2019

cccfeng commented Sep 9, 2019

cccfeng commented Sep 9, 2019

vdemeester commented Dec 9, 2019

cccfeng commented Feb 11, 2020

aeweidne commented Jun 10, 2020 •

edited

Loading

cccfeng commented Jun 13, 2020

zhangtbj commented Aug 5, 2020 •

edited

Loading

zhangtbj commented Aug 5, 2020 •

edited

Loading

imjasonh commented Aug 5, 2020

mattmoor commented Aug 5, 2020

zhangtbj commented Aug 6, 2020 •

edited

Loading

mattmoor commented Aug 6, 2020

zhangtbj commented Aug 6, 2020 •

edited

Loading

mattmoor commented Aug 6, 2020

zhangtbj commented Aug 6, 2020

afrittoli commented Aug 10, 2020

rakhbari commented Oct 30, 2020

zhangtbj commented Oct 30, 2020

Deploy tektoncd/pipeline in high concurrent request scenario #1281

Deploy tektoncd/pipeline in high concurrent request scenario #1281

Comments

cccfeng commented Sep 6, 2019

Expected Behavior

Actual Behavior

Additional Info

imjasonh commented Sep 6, 2019

cccfeng commented Sep 6, 2019

imjasonh commented Sep 6, 2019

cccfeng commented Sep 9, 2019

cccfeng commented Sep 9, 2019

vdemeester commented Dec 9, 2019

cccfeng commented Feb 11, 2020

aeweidne commented Jun 10, 2020 • edited Loading

cccfeng commented Jun 13, 2020

zhangtbj commented Aug 5, 2020 • edited Loading

zhangtbj commented Aug 5, 2020 • edited Loading

imjasonh commented Aug 5, 2020

mattmoor commented Aug 5, 2020

zhangtbj commented Aug 6, 2020 • edited Loading

mattmoor commented Aug 6, 2020

zhangtbj commented Aug 6, 2020 • edited Loading

mattmoor commented Aug 6, 2020

zhangtbj commented Aug 6, 2020

afrittoli commented Aug 10, 2020

rakhbari commented Oct 30, 2020

zhangtbj commented Oct 30, 2020

aeweidne commented Jun 10, 2020 •

edited

Loading

zhangtbj commented Aug 5, 2020 •

edited

Loading

zhangtbj commented Aug 5, 2020 •

edited

Loading

zhangtbj commented Aug 6, 2020 •

edited

Loading

zhangtbj commented Aug 6, 2020 •

edited

Loading