Replicate webhook pods for resiliency #3391

raballew · 2020-10-15T07:46:33Z

Changes

Closes #3386

Avoid node becoming single point of failure for webhook
Ensure availability at node failure or high load
Protect webhook in case of disruption

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices
Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Add HorizontalPodAutoscaler, PodDisruptionBudget and podAntiAffinity to webhook

tekton-robot · 2020-10-15T07:46:48Z

Hi @raballew. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

imjasonh · 2020-10-15T20:04:18Z

/kind feature
/ok-to-test

This looks good so far, let's see what the tests think. :)

Can you squash your commits so this PR only contains one commit?

linux-foundation-easycla · 2020-10-15T20:28:54Z

The committers are authorized under a signed CLA.

✅ Paul Wallrabe (6ef6088)

afrittoli · 2020-10-15T20:33:44Z

@raballew Thank you for your contribution!!
You (or your company) will need to sign the CLA - see link in above for instructions.

Could you include release notes in the PR description?
It's in the new PR template and it needs to be filled. Once you do so, the do-not-merge/release-note-label-needed will go away automatically.

Thank you again!

At the moment, the webhook is a SPOF in certain scenarios. Under high load or when a node failure occurs the webhook becomes unavailable. Defining a HPA, PDB and affinity rules solves this issue.

raballew · 2020-10-15T21:31:01Z

@afrittoli CLA signed
@imjasonh commits are squashed now

Thanks for guiding me trough the PR!

afrittoli · 2020-10-16T09:49:12Z

I've not really worked with the HPA before, so I tried this on my kind cluster.

After I got the metrics server running with:

          - --kubelet-insecure-tls
          - --kubelet-preferred-address-types=InternalIP

The HPA was still not happy, because of missing resource requests in the webhook pod:

Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: missing request for cpu

After adding resources from in the deployment:

    resources:
      requests:
        cpu: 250m

It is now happy:

$ k get hpa -n tekton-pipelines
NAME                       REFERENCE                             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
tekton-pipelines-webhook   Deployment/tekton-pipelines-webhook   3%/100%   1         5         1          135m

$ k describe hpa -n tekton-pipelines
Name:                                                  tekton-pipelines-webhook
Namespace:                                             tekton-pipelines
Labels:                                                app.kubernetes.io/component=webhook
                                                       app.kubernetes.io/instance=default
                                                       app.kubernetes.io/name=webhook
                                                       app.kubernetes.io/part-of=tekton-pipelines
                                                       app.kubernetes.io/version=devel
                                                       pipeline.tekton.dev/release=devel
                                                       version=devel
Annotations:                                           <none>
CreationTimestamp:                                     Fri, 16 Oct 2020 08:23:24 +0100
Reference:                                             Deployment/tekton-pipelines-webhook
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  3% (8m) / 100%
Min replicas:                                          1
Max replicas:                                          5
Deployment pods:                                       1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

According to the HPA docs:

Please note that if some of the Pod's containers do not have the relevant resource request set, CPU utilization for the Pod will not be defined and the autoscaler will not take any action for that metric. See the algorithm details section below for more information about how the autoscaling algorithm works.

So I think that for this to work we need to set resource requests too.
The good news is that without a metrics service the HPA will be un-happy, but the deployment will still work fine, so for dev environments this is pretty much transparent. So I think it should be ok to have this the "default" deployment.

afrittoli

Thanks for this. I think we should clarify if we need to add resources.
TBH I don't know what a good value for CPU would be? @imjasonh wdyt?

afrittoli · 2020-10-15T20:37:38Z

config/webhook-hpa.yaml

@@ -0,0 +1,65 @@
+# Copyright 2019 The Tekton Authors


NIT: s/2019/2020

Sure I will bump the year.

@afrittoli I also looked through all resource definitions and most of the files were modified through the course of this year and still use 2019 copyright year. If I am not wrong, technically the copyright year should be updated when you made contributions to a file within that year.

Uhm, the practise that I've always followed and that I've seen being used in several open source projects is it to set the year to that of the creation of the file, and do not bother updating the date again.
There are a lot of different opinions about this in the internet. The feeling I get after reading a few opinions is that the year is not so critical in the copyright notice; however company lawyers usually insist on having it.

We could add guidance about this in the community repo @vdemeester @bobcatfish @imjasonh @abayer

In general the guidance I've seen is to bump the year if you're touching the file. Bulk commits to update the year everywhere aren't great for history/blame tracking, but keeping it up to date is slightly better than not...

config/webhook-hpa.yaml

Technically, the copyright year should have a value equal to the year of the last contribution made to a file.

imjasonh · 2020-10-16T11:08:10Z

Thanks for this. I think we should clarify if we need to add resources.

TBH I don't know what a good value for CPU would be? @imjasonh wdyt?

Two options:

Run a Tekton installation on your cluster and give it some load, and observe how many resources it uses, then give it 2x or so to give us some headroom.
Crib from Knative even further, and copy their resource requests: https://github.com/knative/serving/blob/master/config/core/deployments/webhook.yaml

I'd probably just borrow from Knative, I don't think either of us do anything very resource intensive in our webhook checks, and if it turns out to be too low we can bump it up in the future.

afrittoli · 2020-10-16T11:12:44Z

2. https://github.com/knative/serving/blob/master/config/core/deployments/webhook.yaml

ok, (2) sounds reasonable to me, also 100m seems small enough for a validation service.
Do we needs limits? If the limit is reached, the HPA will spawn a new instance?

        resources:
          requests:
            cpu: 100m
            memory: 100Mi
          limits:
            cpu: 500m
            memory: 500Mi

raballew · 2020-10-16T11:31:10Z

If the limit is reached, the HPA will spawn a new instance?

According to HPA docs:

 [...] the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each Pod.

When a new replica should be spawned, is described here.

A resource request is required for autoscaler to take any action for a metric.

raballew · 2020-10-16T13:29:38Z

/retest

afrittoli · 2020-10-19T10:14:51Z

Thanks for the updates!

This looks good to me know, the only thing missing is documentation, but I'm ok doing that as a separate PR if you prefer.
We need to document that a metrics server is required for this to work, and in general it would be good to have an HA setup dedicated document or paragraph.

CC'ing @qu1queee since he's been looking into HA documentation for Tekton.

/lgtm

raballew · 2020-10-19T11:08:12Z

@afrittoli I will do it in this PR, so everything related to webhooks and HPA is bundled here. Could you point me to the correct file where to add the documentation?

afrittoli · 2020-10-19T11:17:17Z

I think the entry point might be the install and configuration doc.
It could be a chapter in that page, the content could be in the install page directly or perhaps in a dedicated ha.md with the HA concept and how to configure it.

If you decide to go for a dedicated new file, could you also file a PR to the website repo, so that it shows up on tekton.dev too?

raballew · 2020-10-19T18:23:59Z

@afrittoli I have added the HA docs to the install page as the amount of documentation needed for this topic is rather small. Once more components such as the controller implement a scaling mechanism as well, this section should be moved to a dedicated ha.md file.

raballew · 2020-10-19T19:03:20Z

/retest

raballew · 2020-10-27T17:55:51Z

/assign @afrittoli

afrittoli

Thanks for the docs on this!
/approve

tekton-robot · 2020-11-03T11:34:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [afrittoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

afrittoli · 2020-11-03T11:36:32Z

/cc @vdemeester

vdemeester · 2020-11-05T08:56:22Z

/lgtm

tekton-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 15, 2020

tekton-robot requested review from dibyom and a user October 15, 2020 07:46

tekton-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 15, 2020

tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 15, 2020

afrittoli closed this Oct 15, 2020

afrittoli reopened this Oct 15, 2020

Avoid SPOF for webhook

6ef6088

At the moment, the webhook is a SPOF in certain scenarios. Under high load or when a node failure occurs the webhook becomes unavailable. Defining a HPA, PDB and affinity rules solves this issue.

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 15, 2020

afrittoli reviewed Oct 16, 2020

View reviewed changes

Bump copyright year

c0a4676

Technically, the copyright year should have a value equal to the year of the last contribution made to a file.

Add resource requests and limits for webhook

191214c

A resource request is required for autoscaler to take any action for a metric.

tekton-robot assigned afrittoli Oct 19, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020

imjasonh added this to the Pipelines v0.18 milestone Oct 19, 2020

Add HA documentation

190cb55

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020

raballew mentioned this pull request Oct 19, 2020

Bump copyright year #3408

Closed

raballew requested a review from afrittoli October 27, 2020 17:58

afrittoli reviewed Nov 3, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2020

tekton-robot requested a review from vdemeester November 3, 2020 11:36

tekton-robot assigned vdemeester Nov 5, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2020

tekton-robot merged commit e995e7c into tektoncd:master Nov 5, 2020

imjasonh mentioned this pull request Feb 12, 2021

PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654

Closed

imjasonh mentioned this pull request Feb 24, 2021

Disable Webhook PDB by default, document enabling it #3787

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicate webhook pods for resiliency #3391

Replicate webhook pods for resiliency #3391

raballew commented Oct 15, 2020 •

edited

Loading

tekton-robot commented Oct 15, 2020

imjasonh commented Oct 15, 2020

linux-foundation-easycla bot commented Oct 15, 2020 •

edited

Loading

afrittoli commented Oct 15, 2020

raballew commented Oct 15, 2020 •

edited

Loading

afrittoli commented Oct 16, 2020 •

edited

Loading

afrittoli left a comment

afrittoli Oct 15, 2020

raballew Oct 16, 2020

afrittoli Oct 16, 2020

dlorenc Oct 19, 2020

imjasonh commented Oct 16, 2020

afrittoli commented Oct 16, 2020

raballew commented Oct 16, 2020

raballew commented Oct 16, 2020

afrittoli commented Oct 19, 2020

raballew commented Oct 19, 2020

afrittoli commented Oct 19, 2020

raballew commented Oct 19, 2020

raballew commented Oct 19, 2020

raballew commented Oct 27, 2020

afrittoli left a comment

tekton-robot commented Nov 3, 2020

afrittoli commented Nov 3, 2020

vdemeester commented Nov 5, 2020

Replicate webhook pods for resiliency #3391

Replicate webhook pods for resiliency #3391

Conversation

raballew commented Oct 15, 2020 • edited Loading

Changes

Submitter Checklist

Reviewer Notes

Release Notes

tekton-robot commented Oct 15, 2020

imjasonh commented Oct 15, 2020

linux-foundation-easycla bot commented Oct 15, 2020 • edited Loading

afrittoli commented Oct 15, 2020

raballew commented Oct 15, 2020 • edited Loading

afrittoli commented Oct 16, 2020 • edited Loading

afrittoli left a comment

Choose a reason for hiding this comment

afrittoli Oct 15, 2020

Choose a reason for hiding this comment

raballew Oct 16, 2020

Choose a reason for hiding this comment

afrittoli Oct 16, 2020

Choose a reason for hiding this comment

dlorenc Oct 19, 2020

Choose a reason for hiding this comment

imjasonh commented Oct 16, 2020

afrittoli commented Oct 16, 2020

raballew commented Oct 16, 2020

raballew commented Oct 16, 2020

afrittoli commented Oct 19, 2020

raballew commented Oct 19, 2020

afrittoli commented Oct 19, 2020

raballew commented Oct 19, 2020

raballew commented Oct 19, 2020

raballew commented Oct 27, 2020

afrittoli left a comment

Choose a reason for hiding this comment

tekton-robot commented Nov 3, 2020

afrittoli commented Nov 3, 2020

vdemeester commented Nov 5, 2020

raballew commented Oct 15, 2020 •

edited

Loading

linux-foundation-easycla bot commented Oct 15, 2020 •

edited

Loading

raballew commented Oct 15, 2020 •

edited

Loading

afrittoli commented Oct 16, 2020 •

edited

Loading