acs-engine scale reboots masters and rewrites network config #2662

carlpett · 2018-04-12T10:12:54Z

Is this a request for help?:
Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
v0.15.1

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.2

What happened:
We issued an acs-scale command to add more nodes to our running cluster, since we were getting low on free resources.

After the new nodes were created, the following unexpected things happened:

All three masters were rebooted, simultaneously
The masters got new IP addresses, which breaks at least etcd config (since it has the IP-addresses hard-coded from the initial cloud-init). Thus etcd refuses to start.

What you expected to happen:

New nodes added
No affect on existing nodes/masters

How to reproduce it (as minimally and precisely as possible):
From what we know so far, just acs-engine scale

Anything else we need to know:
This happened ~15 minutes ago, investigations and fixing still ongoing.

The text was updated successfully, but these errors were encountered:

carlpett · 2018-04-13T06:22:49Z

Following up on this, here's a summarized list of events and steps taken:

Run acs-engine scale. Deployment reports success after ~10 minutes
Try to validate new nodes being added with kubectl get nodes, but get an auth error
ssh into master-0, notice it has been rebooted. Check the deployment report, note that all masters were rebooted
systemctl status etcd shows it is failing to start, manually trying to do so reveals it cannot bind to the address specified in /etc/defaults/etcd, 10.240.0.4
Checking the IP address of master-0 reveals it now has 10.255.255.5. The other masters have also changed IP into .6 and .7 respectively
Change the IP address of the node in the Azure portal, reboot => etcd starts, and kubectl starts working
Here we note that all the old agent nodes are marked as NotReady (new nodes still haven't registered)
Kubelet logs shows that it cannot reach the load balanced Kubernetes API. We notice that the load balancer has a frontend configuration specifying IP 10.255.255.15, while the nodes are trying to reach it at 10.240.0.14.
Revert the load balancer IP change => Old nodes start reporting. New ones still aren't registered.
ssh to one of the new nodes and check kubelet logs, and it is trying to reach the Kubernetes API on the address we just reverted
On each of the new nodes, we change the server config in /var/lib/kubelet/kubeconfig, restart kubelet => New nodes are registered

Checking the original json files we used when we provisioned the cluster in late January, we do not explicitly set any IP-related options in our input file, but the generated apimodel.json has "firstConsecutiveStaticIP": "10.240.0.4" in the masterProfile section. I'm guessing this is what got overridden.

I can't be 100% sure, sadly, but I'm quite confident that we used version 0.12.4 when building the cluster (is this annotated somewhere inside the cluster? Otherwise it might be a good idea?)

CecileRobertMichon · 2018-04-16T22:06:27Z

Hi @carlpett, what you're seeing is a side effect of PR #1966 (part of the 0.13.0 release) which was reverted by #2315 in v0.13.1... #1966 changed the default firstConsecutiveStaticIP which broke backwards compatibility for all previously built clusters. Reverting to the previous default of "10.240.255.5" fixed upgrade and scale for previously built clusters but unfortunately upgrading/scaling from clusters built with version 0.13.0 requires some manual tweaking. If building a new cluster is not an option for you, you might be able to get around this issue by switching all references to "10.240.0.4" in your apimodel and in your kubeconfig to "10.240.255.5".

FYI I just opened #2703, I agree that it would be useful to check the acs-engine version.

@JunSun17 might be able to help with this issue since he worked on reverting the change.

carlpett · 2018-04-17T06:13:19Z

Thanks for the explanation @CecileRobertMichon!
We will have some discussions internally if we can rebuild with new clusters, or if we should try to go the tweaking route.

stale · 2019-03-09T13:22:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

carlpett changed the title ~~acs-engine scale reboots masters and deletes network config~~ acs-engine scale reboots masters and rewrites network config May 1, 2018

stale bot added the stale label Mar 9, 2019

stale bot closed this as completed Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

acs-engine scale reboots masters and rewrites network config #2662

acs-engine scale reboots masters and rewrites network config #2662

carlpett commented Apr 12, 2018

carlpett commented Apr 13, 2018

CecileRobertMichon commented Apr 16, 2018

carlpett commented Apr 17, 2018

stale bot commented Mar 9, 2019

acs-engine scale reboots masters and rewrites network config #2662

acs-engine scale reboots masters and rewrites network config #2662

Comments

carlpett commented Apr 12, 2018

carlpett commented Apr 13, 2018

CecileRobertMichon commented Apr 16, 2018

carlpett commented Apr 17, 2018

stale bot commented Mar 9, 2019