Testing failing nodes does not restore the cluster.... #6

ajohnstone · 2017-01-20T02:07:44Z

Testing failing nodes does not restore the cluster....

$ kubectl delete pods consul-2 consul-1;

HTTP error code from Consul: 500 Internal Server Error

This is an error page for the Consul web UI. You may have visited a URL that is loading an unknown resource, so you can try going back to the root.

Otherwise, please report any unexpected issues on the GitHub page.

$kubectl  exec --tty -i consul-0 -- consul members
Node      Address           Status  Type    Build  Protocol  DC
consul-0  100.96.4.13:8301  alive   server  0.7.2  2         dc1
consul-1  100.96.7.6:8301   alive   server  0.7.2  2         dc1
consul-2  100.96.6.12:8301  alive   server  0.7.2  2         dc1

$ kubectl get pods -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP            NODE
consul-0       1/1       Running   0          7h        100.96.4.13   ip-10-117-89-126.eu-west-1.compute.internal
consul-1       1/1       Running   0          8m        100.96.7.6    ip-10-117-97-131.eu-west-1.compute.internal
consul-2       1/1       Running   0          8h        100.96.6.12   ip-10-117-37-128.eu-west-1.compute.internal
docker-debug   1/1       Running   0          10h       100.96.6.2    ip-10-117-37-128.eu-west-1.compute.internal

$ kubectl  exec --tty -i consul-0 -- consul operator raft -list-peers
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)

$ kubectl  exec --tty -i consul-0 -- consul members
Node      Address           Status  Type    Build  Protocol  DC
consul-0  100.96.4.13:8301  alive   server  0.7.2  2         dc1
consul-1  100.96.7.6:8301   alive   server  0.7.2  2         dc1
consul-2  100.96.6.12:8301  alive   server  0.7.2  2         dc1

$ kubectl  exec --tty -i consul-0 -- consul monitor
...
2017/01/20 10:50:59 [WARN] raft: Election timeout reached, restarting election
2017/01/20 10:50:59 [INFO] raft: Node at 100.96.4.13:8300 [Candidate] entering Candidate state in term 4324
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.10:8300 100.96.6.10:8300}: dial tcp 100.96.6.10:8300: getsockopt: no route to host
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.11:8300 100.96.6.11:8300}: dial tcp 100.96.6.11:8300: getsockopt: no route to host
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.4.7:8300 100.96.4.7:8300}: dial tcp 100.96.4.7:8300: getsockopt: no route to host
2017/01/20 10:51:01 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.7.5:8300 100.96.7.5:8300}: dial tcp 100.96.7.5:8300: getsockopt: no route to host
2017/01/20 10:51:05 [WARN] raft: Election timeout reached, restarting election
2017/01/20 10:51:05 [INFO] raft: Node at 100.96.4.13:8300 [Candidate] entering Candidate state in term 4325
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.4.7:8300 100.96.4.7:8300}: dial tcp 100.96.4.7:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.10:8300 100.96.6.10:8300}: dial tcp 100.96.6.10:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.11:8300 100.96.6.11:8300}: dial tcp 100.96.6.11:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.7.5:8300 100.96.7.5:8300}: dial tcp 100.96.7.5:8300: getsockopt: no route to host
2017/01/20 10:51:12 [INFO] agent.rpc: Accepted client: 127.0.0.1:42080
...

The text was updated successfully, but these errors were encountered:

santinoncs · 2017-03-14T15:17:35Z

Hi

Happened to me in a cluster GKE upgrade version process.
I got a consul cluster deployed.
I performed the gke upgrade available process from 1.5.2 to 1.5.3.
As it restarts the nodes one by one, there were two pods in the same node. What happened?
The consensus was broken and I got the same error

HTTP error code from Consul: 500 Internal Server Error

santinoncs · 2017-03-16T13:55:30Z

In order to avoid a downtime in the consul cluster when performing an upgrade of version in GKE,
I modified the statefuleset with this

  lifecycle:
        preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - consul leave

with this, if a pod is evicted from a node, if will leave the cluster gracefully.

Also I add a PodDisruptionBudget with a minAvailable of 2. So, the drain will wait until this is
accomplish.

combatpoodle · 2018-07-06T04:09:31Z

Just ran a quick test on GKE off of PR #34 which is pretty close to mainline, just consul 1.2 instead of 0.9.1. kill -9 on all the agents results in them getting brought back up - different hosts, but alive and sync'd just the same.

ajohnstone mentioned this issue Jan 21, 2017

#6 - ensures better resiliency and moves joining to initialisation. #7

Closed

santinoncs mentioned this issue Mar 16, 2017

Leave grafefully the cluster when drain in a node is executed. #10

Merged

ajohnstone mentioned this issue Apr 25, 2017

consul - unable to recover cluster helm/charts#965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing failing nodes does not restore the cluster.... #6

Testing failing nodes does not restore the cluster.... #6

ajohnstone commented Jan 20, 2017 •

edited

Loading

santinoncs commented Mar 14, 2017

santinoncs commented Mar 16, 2017

combatpoodle commented Jul 6, 2018

Testing failing nodes does not restore the cluster.... #6

Testing failing nodes does not restore the cluster.... #6

Comments

ajohnstone commented Jan 20, 2017 • edited Loading

santinoncs commented Mar 14, 2017

santinoncs commented Mar 16, 2017

combatpoodle commented Jul 6, 2018

ajohnstone commented Jan 20, 2017 •

edited

Loading