Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

acs-engine scale reboots masters and rewrites network config #2662

Closed
carlpett opened this issue Apr 12, 2018 · 4 comments
Closed

acs-engine scale reboots masters and rewrites network config #2662

carlpett opened this issue Apr 12, 2018 · 4 comments
Labels

Comments

@carlpett
Copy link
Contributor

Is this a request for help?:
Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
v0.15.1

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.2

What happened:
We issued an acs-scale command to add more nodes to our running cluster, since we were getting low on free resources.

After the new nodes were created, the following unexpected things happened:

  • All three masters were rebooted, simultaneously
  • The masters got new IP addresses, which breaks at least etcd config (since it has the IP-addresses hard-coded from the initial cloud-init). Thus etcd refuses to start.

What you expected to happen:

  • New nodes added
  • No affect on existing nodes/masters

How to reproduce it (as minimally and precisely as possible):
From what we know so far, just acs-engine scale

Anything else we need to know:
This happened ~15 minutes ago, investigations and fixing still ongoing.

@carlpett
Copy link
Contributor Author

Following up on this, here's a summarized list of events and steps taken:

  1. Run acs-engine scale. Deployment reports success after ~10 minutes
  2. Try to validate new nodes being added with kubectl get nodes, but get an auth error
  3. ssh into master-0, notice it has been rebooted. Check the deployment report, note that all masters were rebooted
  4. systemctl status etcd shows it is failing to start, manually trying to do so reveals it cannot bind to the address specified in /etc/defaults/etcd, 10.240.0.4
  5. Checking the IP address of master-0 reveals it now has 10.255.255.5. The other masters have also changed IP into .6 and .7 respectively
  6. Change the IP address of the node in the Azure portal, reboot => etcd starts, and kubectl starts working
  7. Here we note that all the old agent nodes are marked as NotReady (new nodes still haven't registered)
  8. Kubelet logs shows that it cannot reach the load balanced Kubernetes API. We notice that the load balancer has a frontend configuration specifying IP 10.255.255.15, while the nodes are trying to reach it at 10.240.0.14.
  9. Revert the load balancer IP change => Old nodes start reporting. New ones still aren't registered.
  10. ssh to one of the new nodes and check kubelet logs, and it is trying to reach the Kubernetes API on the address we just reverted
  11. On each of the new nodes, we change the server config in /var/lib/kubelet/kubeconfig, restart kubelet => New nodes are registered

Checking the original json files we used when we provisioned the cluster in late January, we do not explicitly set any IP-related options in our input file, but the generated apimodel.json has "firstConsecutiveStaticIP": "10.240.0.4" in the masterProfile section. I'm guessing this is what got overridden.

I can't be 100% sure, sadly, but I'm quite confident that we used version 0.12.4 when building the cluster (is this annotated somewhere inside the cluster? Otherwise it might be a good idea?)

@CecileRobertMichon
Copy link
Contributor

Hi @carlpett, what you're seeing is a side effect of PR #1966 (part of the 0.13.0 release) which was reverted by #2315 in v0.13.1... #1966 changed the default firstConsecutiveStaticIP which broke backwards compatibility for all previously built clusters. Reverting to the previous default of "10.240.255.5" fixed upgrade and scale for previously built clusters but unfortunately upgrading/scaling from clusters built with version 0.13.0 requires some manual tweaking. If building a new cluster is not an option for you, you might be able to get around this issue by switching all references to "10.240.0.4" in your apimodel and in your kubeconfig to "10.240.255.5".

FYI I just opened #2703, I agree that it would be useful to check the acs-engine version.

@JunSun17 might be able to help with this issue since he worked on reverting the change.

@carlpett
Copy link
Contributor Author

Thanks for the explanation @CecileRobertMichon!
We will have some discussions internally if we can rebuild with new clusters, or if we should try to go the tweaking route.

@carlpett carlpett changed the title acs-engine scale reboots masters and deletes network config acs-engine scale reboots masters and rewrites network config May 1, 2018
@stale
Copy link

stale bot commented Mar 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

@stale stale bot added the stale label Mar 9, 2019
@stale stale bot closed this as completed Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants