Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy tektoncd/pipeline in high concurrent request scenario #1281

Closed
cccfeng opened this issue Sep 6, 2019 · 21 comments
Closed

Deploy tektoncd/pipeline in high concurrent request scenario #1281

cccfeng opened this issue Sep 6, 2019 · 21 comments
Labels
area/performance Issues or PRs that are related to performance aspects. kind/question Issues or PRs that are questions around the project or a particular feature

Comments

@cccfeng
Copy link
Contributor

cccfeng commented Sep 6, 2019

Expected Behavior

When I deploy tektoncd/pipeline in high concurrent request scenario(the taskruns/pipelineruns are submitted concurrently), it may be better to increased the threads for controller to process queue workitems faster.

Actual Behavior

From v0.5.x --> v0.0.6 , the code in cmd/controller/main.go has been moved to github.com/knative/pkg and the param threadsPerController is change to DefaultThreadsPerController, which is default equal to 2.

Additional Info

Now I have some questions about how to increased the thread number.

@imjasonh
Copy link
Member

imjasonh commented Sep 6, 2019

I'm definitely also very interested in exploring the scaling options/limits for the controller. I've run a number of load tests to measure queue latency under load (specifically TaskRuns), and the improvements in v0.6 were very welcome.

It could be helpful to define a scaling target we'd like to reach, then identify steps to take to get there.

In my tests creating 500 TaskRuns at 2 QPS (kubectl create in a loop, with ample spare capacity available) resulted in median queue time of 4 minutes (max 5:45), down from 10+ minutes with v0.5.

Increasing the threads per controller should help too, but we might want to also increase the resources we request for the controller.

I don't think we should increase the defaults by default, but rather document how they can be increased. Users who have lower scaling needs and smaller clusters won't need the increased performance, and might pay for it.

@cccfeng
Copy link
Contributor Author

cccfeng commented Sep 6, 2019

Glad to see the optimization brought about by the upgrade of the version.

We want to use tekton to pull up the build task (such as code2image). Dozens of tasks will be submitted concurrently at peak business hours. At the same time, the time from apply tekton crds to k8s pod enter running state is sensitive for us.

So I think parameter configuration maybe a better solution.

@imjasonh
Copy link
Member

imjasonh commented Sep 6, 2019

Can you provide more information about what peak QPS you're sending, and what performance you're seeing that you'd like to see improved? It would be helpful to get real-world data, to help set a sort of informal SLO for TaskRun queue performance.

IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations, so short of writing our own scheduler (which we could do...), that's probably going to be a theoretical ceiling on how quickly we can start TaskRuns. IOW, if you're creating TaskRuns at a rate >2 QPS, even with an optimally-tuned Tekton controller you'll see Pod creation back up at a linear rate.

@cccfeng
Copy link
Contributor Author

cccfeng commented Sep 9, 2019

Please reply for insufficiency in time.

At first, in a k8s cluster built with minikube, I submitted the pod create request in batches, and reproduced the description of "IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations". The qps is between 2~3.

But in the k8s version of our internal production environment, we have done some optimizations for the pod scheduler to submit its qps, here we assume that its speed is still ok.

Since the v0.5.x version we used before and has not been upgraded to v0.6.0, the following description is based on v0.5.x. When I submit a simple task yaml without inputs and ouputs, and then submit taskrun concurrently (taskrun number is 20 or 40 or 60 ), observe the interval between the start of pod startTime and apply request, and find that the pod scheduling speed is qps<1.

In the processNextWorkItem method, when the workitem is consumed, there will be a "Time taken" style log output. I found out that the consumption time of a reconcile may be greater than 100ms, and some even exceed one second. At this time, since the number of threads is too small, will the controller's io for qps become a bottleneck?

@cccfeng
Copy link
Contributor Author

cccfeng commented Sep 9, 2019

Please reply for insufficiency in time.

At first, in a k8s cluster built with minikube, I submitted the pod create request in batches, and reproduced the description of "IIRC the default K8s pod scheduler enforces a 2 QPS limit for pod creations". The qps is between 2~3.

But in the k8s version of our internal production environment, we have done some optimizations for the pod scheduler to submit its qps, here we assume that its speed is still ok.

Since the v0.5.x version we used before and has not been upgraded to v0.6.0, the following description is based on v0.5.x. When I submit a simple task yaml without inputs and ouputs, and then submit taskrun concurrently (taskrun number is 20 or 40 or 60 ), observe the interval between the start of pod startTime and apply request, and find that the pod scheduling speed is qps<1.

In the processNextWorkItem method, when the workitem is consumed, there will be a "Time taken" style log output. I found out that the consumption time of a reconcile may be greater than 100ms, and some even exceed one second. At this time, since the number of threads is too small, will the controller's io for qps become a bottleneck?

For example:
{"level":"info","logger":"controller.taskrun-controller","caller":"controller/controller.go:339","msg":"Reconcile succeeded. Time taken: 599.764148ms.","knative.dev/controller":"taskrun-controller","knative.dev/traceid":"0b1a1eca-28c2-4762-a457-47dbc7627da4","knative.dev/key":"tekton-pipelines/tekton-load-task-run-2"}

@vdemeester
Copy link
Member

/kind question

@tekton-robot tekton-robot added the kind/question Issues or PRs that are questions around the project or a particular feature label Dec 9, 2019
@cccfeng
Copy link
Contributor Author

cccfeng commented Feb 11, 2020

Hi all. From pipeline 0.0.8 version, I found that Prometheus feature was introduced and tekton_reconcile_latency_bucket in the knative.dev/pkg recorded the time data of controller reconcile.

As mentioned above, I found sometimes there are logs with reconcile reaching several hundred millsecond or even several second. I guess when the concurrency of pipelinerun/taskrun is large, the throughput of tekton controller will reach the bottleneck. I modified the DefaultThreadsPerController to 64 and give the container the specification of 48C32Gi. However, I found that the tekton_reconcile_latency_bucket is not optimistic.

I deploy tekton in two different k8s cluster. First one receive very small traffic and I found most of reconcile action cost time less than 100ms.

tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="10"} 22
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="100"} 37
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="1000"} 38
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="10000"} 38
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt1-2stnm",reconciler="TaskRun",success="true",le="30000"} 38

But the second cluster receive big traffic (almost 30~40 running pipelinerun) and the cost time is so high.

tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="10"} 8
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="100"} 9
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="1000"} 10
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="10000"} 24
tekton_reconcile_latency_bucket{key="tekton-pipelines/xxxx-pt3-285h8",reconciler="TaskRun",success="true",le="30000"} 24

I'd like to ask how I can improve Tekton's throughput. It seems that multiple tekton / pipeline controller cannot be deployed because CRD will be consumed repeatedly.

Sincerely hope to get your reply. @imjasonh @vdemeester

@aeweidne
Copy link

aeweidne commented Jun 10, 2020

The performance at 30-40 concurrent pipelines is actively killing our ability to set up a CI service @ scacle as well

@cccfeng how much better is the performance for 64 threads vs the default 2 (based on searching the vendored knative dep in this repo)

@cccfeng
Copy link
Contributor Author

cccfeng commented Jun 13, 2020

The performance at 30-40 concurrent pipelines is actively killing our ability to set up a CI service @ scacle as well

@cccfeng how much better is the performance for 64 threads vs the default 2 (based on searching the vendored knative dep in this repo)

Maybe you can look at The default qps of kubeClient

When the pipelineRun/taskRun is submitted simultaneously, we found that the first bottleneck is kubeClient. The QPS allocated to each controller by default is only 5. When the reconcile code relies on k8s apiserver, it will be restricted on the client side. So I suggest that you try to increase this value by 10-20 times first, and observe whether the controller work queue is no longer stacked, and the reconcile delay will keep less than 100ms. Keep reconcile delay low and reconcile count will increase.

@zhangtbj
Copy link
Contributor

zhangtbj commented Aug 5, 2020

Hi @cccfeng and all,

I also would like to increase DefaultThreadsPerController to more, but I found it is hardcode to 2 in the knative.dev/pkg:
https://github.com/tektoncd/pipeline/blob/release-v0.14.x/vendor/knative.dev/pkg/controller/controller.go#L56

Can you please tell me how you modify your DefaultThreadsPerController to 64?

Is there any ENV or Property exported for Tekton, so that I can set in for the tekton deployment?

Thanks!

@zhangtbj
Copy link
Contributor

zhangtbj commented Aug 5, 2020

Hi all,

We are running the concurrent test to test the tekton controller performance by using 100 taskruns.

But we found there is a big delay if there are lots of taskruns are created.
I got an issue:#1281

And after I changed below two settings and rebuilt a test image:

The tekton performance is improved very much! The average taskrun execution time will be reduced from 100s to 30s, that is really cool!

But there is a problem, the tekton uses knative.dev/pkg vendor which hardcode the DefaultThreadsPerController = 2 , QPS and Burst are also hardcode a small value.

Does anyone know is there a way to allow us to pass a custom value for these settings? such as use ENV or property for the tekton deployment.

It is bad if there is no way to customize these settings and tekton only supports 2 threads... it is not good for a production level usage.

Thanks!

@imjasonh
Copy link
Member

imjasonh commented Aug 5, 2020

cc @mattmoor who has done some work scaling the knative/pkg code that Tekton uses, and might have insights about how to proceed.

If these are the right settings to tweak to improve performance it makes sense to me to expose them as either envs or ConfigMap values.

@mattmoor
Copy link
Member

mattmoor commented Aug 5, 2020

Generally the caution I'd give when adjusting these things would be: you may be treating the symptoms instead of the disease.

I'd make sure that the reconciliations are properly using lister caches and carefully auditing client requests (including K8s events) before just jacking these values up.

@zhangtbj
Copy link
Contributor

zhangtbj commented Aug 6, 2020

Thanks and I agree, but the problem here is the user cannot customize these settings now. Maybe 100 is too high, maybe it should based on the cache/request, etc.... But if I found the default values are not good to me, I should be able to change by myself.

@mattmoor
Copy link
Member

mattmoor commented Aug 6, 2020

With knative/pkg at HEAD it's possible to override these at the rest config level, but I agree that exposing flags to override these is also probably reasonable (certainly to treat the "symptoms" until the "disease" is cured).

I'd probably define the flags here and if they are set, override the cfg.QPS value a few lines below.

@zhangtbj
Copy link
Contributor

zhangtbj commented Aug 6, 2020

Cool thanks Matt!

BTW,

Thanks!

@mattmoor
Copy link
Member

mattmoor commented Aug 6, 2020

We cut knative releases every 6 weeks, PKG always cuts a week before serving/eventing. The next PKG cut is Tuesday, so if you are in by then you'll be in release-0.17.

@zhangtbj
Copy link
Contributor

zhangtbj commented Aug 6, 2020

Cool, thanks matt!

Hope Tekton can pick it up this new change in the new release soon.

Thanks!

@afrittoli afrittoli added the area/performance Issues or PRs that are related to performance aspects. label Aug 10, 2020
@afrittoli
Copy link
Member

Generally the caution I'd give when adjusting these things would be: you may be treating the symptoms instead of the disease.

I'd make sure that the reconciliations are properly using lister caches and carefully auditing client requests (including K8s events) before just jacking these values up.

v0.15.0 resolves one issue where the pipelinerun controller would fetch configmaps via API instead of using the config store: #2947. It would be interesting to see if that had an impact on performance.

@imjasonh @zhangtbj I think it would be valuable to have your performance tests executed against nightly releases, and results collected and graphed, so that we could monitor changes over time.
It would be helpful to have Tekton instrumented for tracing (#2814) so that we could find out where the controller is spending its time.

@mrutkows @NavidZ for info - since you've been looking into metrics and tracing.

@rakhbari
Copy link

@afrittoli Do you have any idea when pipelines code base will pick up this change in knative.dev/pkg so that these parameters can be externally configured?

@zhangtbj
Copy link
Contributor

As I know the QPS, Burst and Thread can be configured since new Tekton 0.17.0 release:
https://github.com/tektoncd/pipeline/releases/tag/v0.17.0

@cccfeng cccfeng closed this as completed Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Issues or PRs that are related to performance aspects. kind/question Issues or PRs that are questions around the project or a particular feature
Projects
None yet
Development

No branches or pull requests

9 participants