Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert CoreDNS to 1.3.1 in kube-up #78691

Merged
merged 1 commit into from
Jun 10, 2019
Merged

Revert CoreDNS to 1.3.1 in kube-up #78691

merged 1 commit into from
Jun 10, 2019

Conversation

rajansandeep
Copy link
Contributor

@rajansandeep rajansandeep commented Jun 4, 2019

What type of PR is this?

/kind cleanup

What this PR does / why we need it:
Reverts CoreDNS to 1.3.1 for k8s 1.15
Reverts changes from #78030 and #78305

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Revert the CoreDNS version to 1.3.1

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 4, 2019
@rajansandeep
Copy link
Contributor Author

/cc @MrHohn @chrisohaver

@k8s-ci-robot k8s-ci-robot added area/test sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/testing Categorizes an issue or PR as relevant to SIG Testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. labels Jun 4, 2019
@rajansandeep
Copy link
Contributor Author

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 4, 2019
@mborsz
Copy link
Member

mborsz commented Jun 4, 2019

/hold

Hi,
Could you say why are you reverting this?
The upgrade to 1.5.0 is critical for keeping k8s 1.15 scalable. With 1.3.1 coredns is constantly OOMing in 5000-node scale after the cos upgrade to cos-beta-73-11647-64-0.

See #78562 and #76579 (comment) for context.

/cc @wojtek-t @mm4tt @oxddr

@k8s-ci-robot k8s-ci-robot requested review from wojtek-t and mm4tt June 4, 2019 13:33
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2019
@k8s-ci-robot
Copy link
Contributor

@mborsz: GitHub didn't allow me to request PR reviews from the following users: oxddr.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/hold

Hi,
Could you say why are you reverting this?
The upgrade to 1.5.0 is critical for keeping k8s 1.15 scalable. With 1.3.1 coredns is constantly OOMing in 5000-node scale after the cos upgrade to cos-beta-73-11647-64-0.

See #78562 and #76579 (comment) for context.

/cc @wojtek-t @mm4tt @oxddr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrisohaver
Copy link
Contributor

@mborsz,

CoreDNS 1.5.0 is being reverted due to configuration migration troubles. A non-trivial migration of the CoreDNS configuration is required between 1.3.1 and 1.5.0. This migration step was not approved in kubeadm in 1.15 in time, so it is being reverted. IMO, we should be testing the version of CoreDNS that will be used by most 1.15 deployments, which I think will be 1.3.1, since kubeadm will be installing that way. Not everyone uses kubeadm I realize, but it is I think it is widely used as a reference. Even if we upgrade to CoreDNS in the test framework to make tests pass, I think most people in the field will be using k8s 1.15 + CoreDNS 1.3.1, which potentially has this scaling issue.

Is the COS upgrade to the beta version critical? Can that be reverted? Or would that cause more trouble (e.g. is it fixing other more important problems)?

If we leave the kube-up tests at CoreDNS 1.5.0, then I think CoreDNS 1.5.0 would be the "recommended" version of CoreDNS to use with K8s 1.15 (since that is what was tested). Perhaps thats the easiest route. Kubeadm would then need to release note that it doesn't upgrade to the version of CoreDNS recommended for 1.15 due to manual configuration migrations required (although it wont install a fresh build with 1.5.0 either).

@mborsz
Copy link
Member

mborsz commented Jun 4, 2019

/cc @yguo0905 @yujuhong

For question whether COS upgrade is critical.

@spiffxp
Copy link
Member

spiffxp commented Jun 7, 2019

Thanks @chrisohaver for the summary. When in doubt, I'm going to err on the side of user upgrade experience and suggest that we revert. Going to read a bit more for context before making a call.

That said, in 1.11 kubeadm switched to coredns, while in 1.12 kube-up still used kube-dns due to scalability concerns, finally switching in 1.13... so we have prior art for two different ways of standing up clusters using two different DNS implementations.

@spiffxp
Copy link
Member

spiffxp commented Jun 7, 2019

My opinion:

@jdumars
Copy link
Member

jdumars commented Jun 7, 2019

@spiffxp this seems like a very reasonable path forward. I also agree that upgrade UX is the most high-impact aspect in play here.

@chrisohaver
Copy link
Contributor

Sounds good to me. Basically, rollback to CoreDNS 1.3.1, release k8s 1.15 with the failing 5k node test, and issue some kind of scaling exception notice advising use of coredns 1.5.0 for large clusters in the ballpark of 5K nodes.

Thanks @spiffxp!

@shyamjvs
Copy link
Member

shyamjvs commented Jun 7, 2019

IIUC from above discussion, it seems like coredns 1.5.0 is having some correctness bugs but scalability improvements. We seem to have been using 1.3.1 even until 3 weeks ago (when it was bumped to 1.5.0 in #78030) and scalability tests were passing for last release where we were still using the old one (though IIRC we manually had to bump the resource requests). So if we're not at least regressing from last release wrt core-dns scalability, staying at old version to avoid correctness bugs seems better IMHO. I might be missing sth here.

@wojtek-t @mborsz - fyi

@jdumars
Copy link
Member

jdumars commented Jun 7, 2019

Given release timing, is there a deadline for this decision?

@mborsz
Copy link
Member

mborsz commented Jun 8, 2019

IIUC from above discussion, it seems like coredns 1.5.0 is having some correctness bugs but scalability improvements. We seem to have been using 1.3.1 even until 3 weeks ago (when it was bumped to 1.5.0 in #78030) and scalability tests were passing for last release where we were still using the old one (though IIRC we manually had to bump the resource requests). So if we're not at least regressing from last release wrt core-dns scalability, staying at old version to avoid correctness bugs seems better IMHO. I might be missing sth here.

Wojciech Tyczynski Maciej Borsz - fyi

It's not really a full image.

http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadRequestCountByClient&client=coredns%2Fv0.0.0%20(linux%2Famd64)%20kubernetes&code=200&component=apiserver&group=&resource=endpoints&scope=cluster&subresource=&verb=LIST&version=v1

shows number of LIST endpoints issued in load test which represents mostly number of coredns restarts.

There are 3 notable builds:

  • 330 -- this is when cos version was bumped. The number of restart started being really high (300-1000 restarts). Debugging few runs after that has shown that it was OOMing mostly
  • 1115162902861451269 -- the number of restarts has significantly decreased (to 5-20). I bet we already debugged this, but can't remember now what was the reason for decrease.
  • 1120236390706057221 -- the number of restarts goes to zero -- this is because of [metrics-server addon] Restore connecting to nodes via IP addresses #76819 got merged which basically removes the only dns client in scalability tests.

So the fact that in the test was passing before upgrading coredns to 1.5.0 was mostly because we removed the last dns client talking to it.

We see that in similar conditions (328 vs 330 builds) coredns 1.3.1 with newer cos is using more memory than before cos upgrade.

@mborsz
Copy link
Member

mborsz commented Jun 8, 2019

I agree with @spiffxp suggestion.

We can also suggest increasing memory limit for coredns in big clusters (assuming the increase isn't that big). This seems to be more safe option than asking for coredns upgrade.

Just two concerns:

  • We should make sure that it's possible to use custom coredns version if we are going to suggest doing so. I'm not aware of any knob that can be used in kube-up. Can we add something like that in 1.15?
  • What we are doing here is moving the upgrade problem from all clusters to only big clusters. Given the current timeline this seems to be reasonable, but we still need to solve this issue somehow.

@mborsz
Copy link
Member

mborsz commented Jun 10, 2019

#78851 adds a way to customize coredns version in kube-up scripts. Can we get merged in 1.15?

@alejandrox1
Copy link
Contributor

alejandrox1 commented Jun 10, 2019

Hi all, so what the plan?
will #78851 be merged/approved after this one or instead of this one?

@mborsz
Copy link
Member

mborsz commented Jun 10, 2019

I suggest merging this one and then #78851.

@alejandrox1
Copy link
Contributor

Awesome, thank you @mborsz !
Could we get an approval on this ?

@spiffxp
Copy link
Member

spiffxp commented Jun 10, 2019

/retest
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rajansandeep, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2019
@jdumars
Copy link
Member

jdumars commented Jun 10, 2019

/lgtm
does someone want to remove the hold? Or, I can.

@dims
Copy link
Member

dims commented Jun 10, 2019

/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.