Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing failing nodes does not restore the cluster.... #6

Open
ajohnstone opened this issue Jan 20, 2017 · 3 comments
Open

Testing failing nodes does not restore the cluster.... #6

ajohnstone opened this issue Jan 20, 2017 · 3 comments

Comments

@ajohnstone
Copy link

ajohnstone commented Jan 20, 2017

Testing failing nodes does not restore the cluster....

$ kubectl delete pods consul-2 consul-1;

HTTP error code from Consul: 500 Internal Server Error

This is an error page for the Consul web UI. You may have visited a URL that is loading an unknown resource, so you can try going back to the root.

Otherwise, please report any unexpected issues on the GitHub page.
$kubectl  exec --tty -i consul-0 -- consul members
Node      Address           Status  Type    Build  Protocol  DC
consul-0  100.96.4.13:8301  alive   server  0.7.2  2         dc1
consul-1  100.96.7.6:8301   alive   server  0.7.2  2         dc1
consul-2  100.96.6.12:8301  alive   server  0.7.2  2         dc1

$ kubectl get pods -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP            NODE
consul-0       1/1       Running   0          7h        100.96.4.13   ip-10-117-89-126.eu-west-1.compute.internal
consul-1       1/1       Running   0          8m        100.96.7.6    ip-10-117-97-131.eu-west-1.compute.internal
consul-2       1/1       Running   0          8h        100.96.6.12   ip-10-117-37-128.eu-west-1.compute.internal
docker-debug   1/1       Running   0          10h       100.96.6.2    ip-10-117-37-128.eu-west-1.compute.internal

$ kubectl  exec --tty -i consul-0 -- consul operator raft -list-peers
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)

$ kubectl  exec --tty -i consul-0 -- consul members
Node      Address           Status  Type    Build  Protocol  DC
consul-0  100.96.4.13:8301  alive   server  0.7.2  2         dc1
consul-1  100.96.7.6:8301   alive   server  0.7.2  2         dc1
consul-2  100.96.6.12:8301  alive   server  0.7.2  2         dc1

$ kubectl  exec --tty -i consul-0 -- consul monitor
...
2017/01/20 10:50:59 [WARN] raft: Election timeout reached, restarting election
2017/01/20 10:50:59 [INFO] raft: Node at 100.96.4.13:8300 [Candidate] entering Candidate state in term 4324
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.10:8300 100.96.6.10:8300}: dial tcp 100.96.6.10:8300: getsockopt: no route to host
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.11:8300 100.96.6.11:8300}: dial tcp 100.96.6.11:8300: getsockopt: no route to host
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.4.7:8300 100.96.4.7:8300}: dial tcp 100.96.4.7:8300: getsockopt: no route to host
2017/01/20 10:51:01 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.7.5:8300 100.96.7.5:8300}: dial tcp 100.96.7.5:8300: getsockopt: no route to host
2017/01/20 10:51:05 [WARN] raft: Election timeout reached, restarting election
2017/01/20 10:51:05 [INFO] raft: Node at 100.96.4.13:8300 [Candidate] entering Candidate state in term 4325
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.4.7:8300 100.96.4.7:8300}: dial tcp 100.96.4.7:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.10:8300 100.96.6.10:8300}: dial tcp 100.96.6.10:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.11:8300 100.96.6.11:8300}: dial tcp 100.96.6.11:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.7.5:8300 100.96.7.5:8300}: dial tcp 100.96.7.5:8300: getsockopt: no route to host
2017/01/20 10:51:12 [INFO] agent.rpc: Accepted client: 127.0.0.1:42080
...
@santinoncs
Copy link

Hi

Happened to me in a cluster GKE upgrade version process.
I got a consul cluster deployed.
I performed the gke upgrade available process from 1.5.2 to 1.5.3.
As it restarts the nodes one by one, there were two pods in the same node. What happened?
The consensus was broken and I got the same error

HTTP error code from Consul: 500 Internal Server Error

@santinoncs
Copy link

In order to avoid a downtime in the consul cluster when performing an upgrade of version in GKE,
I modified the statefuleset with this

  lifecycle:
        preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - consul leave

with this, if a pod is evicted from a node, if will leave the cluster gracefully.

Also I add a PodDisruptionBudget with a minAvailable of 2. So, the drain will wait until this is
accomplish.

@combatpoodle
Copy link

Just ran a quick test on GKE off of PR #34 which is pretty close to mainline, just consul 1.2 instead of 0.9.1. kill -9 on all the agents results in them getting brought back up - different hosts, but alive and sync'd just the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants