Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dev #1193

Merged
merged 4 commits into from
Nov 24, 2022
Merged

New dev #1193

merged 4 commits into from
Nov 24, 2022

Conversation

itswl
Copy link
Contributor

@itswl itswl commented Nov 16, 2022

同步修改 etcd restore,使用 {{ ETCD_DATA_DIR }} 为恢复文件路径,不创建 /etcd_backup 路径

@itswl
Copy link
Contributor Author

itswl commented Nov 16, 2022

当前多次备份恢复,未发现问题
节点信息
ansible : 10.0.0.14

etcd : 10.0.0.41, 10.0.0.56, 10.0.0.219

详细信息

# dk ezctl  backup k8s-test
ansible-playbook -i clusters/k8s-test/hosts -e @clusters/k8s-test/config.yml playbooks/94.backup.yml
2022-11-16 12:47:52 INFO cluster:k8s-test backup begins in 5s, press any key to abort:


PLAY [localhost] ******************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [set NODE_IPS of the etcd cluster] *******************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [get etcd cluster status] ****************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [debug] **********************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
    "ETCD_CLUSTER_STATUS": {
        "changed": true,
        "cmd": "for ip in 10.0.0.56 10.0.0.41 10.0.0.219 ;do ETCDCTL_API=3 /etc/kubeasz/bin/etcdctl --endpoints=https://\"$ip\":2379 --cacert=/etc/kubeasz/clusters/k8s-test/ssl/ca.pem --cert=/etc/kubeasz/clusters/k8s-test/ssl/etcd.pem --key=/etc/kubeasz/clusters/k8s-test/ssl/etcd-key.pem endpoint health; done",
        "delta": "0:00:00.120765",
        "end": "2022-11-16 12:48:01.223679",
        "failed": false,
        "msg": "",
        "rc": 0,
        "start": "2022-11-16 12:48:01.102914",
        "stderr": "",
        "stderr_lines": [],
        "stdout": "https://10.0.0.56:2379 is healthy: successfully committed proposal: took = 19.226564ms\nhttps://10.0.0.41:2379 is healthy: successfully committed proposal: took = 17.817878ms\nhttps://10.0.0.219:2379 is healthy: successfully committed proposal: took = 17.321196ms",
        "stdout_lines": [
            "https://10.0.0.56:2379 is healthy: successfully committed proposal: took = 19.226564ms",
            "https://10.0.0.41:2379 is healthy: successfully committed proposal: took = 17.817878ms",
            "https://10.0.0.219:2379 is healthy: successfully committed proposal: took = 17.321196ms"
        ]
    }
}

TASK [get a running ectd node] ****************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [debug] **********************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
    "RUNNING_NODE.stdout": "10.0.0.56"
}

TASK [get current time] ***********************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [make a backup on the etcd node] *********************************************************************************************************************************************************************************************************************************************************
changed: [localhost -> 10.0.0.56]

TASK [fetch the backup data] ******************************************************************************************************************************************************************************************************************************************************************
changed: [localhost -> 10.0.0.56]

TASK [update the latest backup] ***************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

PLAY RECAP ************************************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=10   changed=6    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

# dk ezctl restore k8s-test   # 节选
ansible-playbook -i clusters/k8s-test/hosts -e @clusters/k8s-test/config.yml playbooks/95.restore.yml
2022-11-16 12:48:11 INFO cluster:k8s-test restore begins in 5s, press any key to abort:


PLAY [etcd] ***********************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.56]
ok: [10.0.0.219]
ok: [10.0.0.41]

TASK [cluster-restore : 停止ectd 服务] ************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.41]
changed: [10.0.0.219]

TASK [cluster-restore : 清除etcd 数据目录] **********************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 准备指定的备份etcd 数据] *******************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.41]
changed: [10.0.0.56]
changed: [10.0.0.219]

TASK [cluster-restore : 清理上次备份恢复数据] ***********************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.41]
changed: [10.0.0.219]

TASK [cluster-restore : etcd 数据恢复] ************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 恢复数据至etcd 数据目录] *******************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.41]
changed: [10.0.0.219]

TASK [cluster-restore : 重启etcd 服务] ************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 以轮询的方式等待服务同步完成] *******************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.219]
changed: [10.0.0.41]

PLAY RECAP ************************************************************************************************************************************************************************************************************************************************************************************
10.0.0.219                 : ok=9    changed=8    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
10.0.0.41                  : ok=9    changed=8    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
10.0.0.56                  : ok=9    changed=8    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

多次操作后, 任一 etcd 节点

# tree
.
|-- etcd-10.0.0.56.etcd
|   `-- member
|       |-- snap
|       |   |-- 0000000000000001-0000000000000003.snap
|       |   `-- db
|       `-- wal
|           `-- 0000000000000000-0000000000000000.wal
|-- member
|   |-- snap
|   |   |-- 0000000000000001-0000000000000003.snap
|   |   `-- db
|   `-- wal
|       |-- 0000000000000000-0000000000000000.wal
|       `-- 0.tmp
`-- snapshot.db

7 directories, 8 files

@itswl
Copy link
Contributor Author

itswl commented Nov 16, 2022

重新修改逻辑,在 ansible 主控节点生成恢复文件,然后下发到各个 etcd 节点。不在 etcd 节点额外生产目录和文件

测试没有问题

# dk ezctl restore k8s-test
ansible-playbook -i clusters/k8s-test/hosts -e @clusters/k8s-test/config.yml playbooks/95.restore.yml
2022-11-16 17:27:18 INFO cluster:k8s-test restore begins in 5s, press any key to abort:


PLAY [etcd] **************************************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.56]
ok: [10.0.0.219]
ok: [10.0.0.41]

TASK [cluster-restore : 停止ectd 服务] ***************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.219]
ok: [10.0.0.41]
ok: [10.0.0.56]

TASK [cluster-restore : 清除etcd 数据目录] *************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.41]
ok: [10.0.0.56]
ok: [10.0.0.219]

TASK [cluster-restore : 清除 etcd 备份目录] ************************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56 -> 127.0.0.1]

TASK [cluster-restore : etcd 数据恢复] ***************************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56 -> 127.0.0.1]

TASK [cluster-restore : 分发备份文件到 etcd 各个节点] *******************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 重启etcd 服务] ***************************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.41]
changed: [10.0.0.56]

TASK [cluster-restore : 以轮询的方式等待服务同步完成] **********************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.219]
changed: [10.0.0.41]

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************************************************************************
10.0.0.219                 : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
10.0.0.41                  : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
10.0.0.56                  : ok=8    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

@gjmzj gjmzj merged commit 13439cb into easzlab:master Nov 24, 2022
kubeasz pushed a commit that referenced this pull request Jan 7, 2023
优化etcd 恢复逻辑
@gjmzj
Copy link
Collaborator

gjmzj commented Apr 16, 2023

恢复脚本有问题,使用这个恢复3节点etcd集群,会变成3个leader

for ip in ${NODE_IPS}; do   ETCDCTL_API=3 etcdctl   --endpoints=https://${ip}:2379    --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/etcd.pem   --key=/etc/kubernetes/ssl/etcd-key.pem   --write-out=table endpoint status; done
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.96:2379 | 8e9e05c52164694d |   3.5.6 |  3.6 MB |      true |      false |         2 |       5261 |               5261 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.97:2379 | 8e9e05c52164694d |   3.5.6 |  3.6 MB |      true |      false |         2 |       5323 |               5323 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.98:2379 | 8e9e05c52164694d |   3.5.6 |  3.6 MB |      true |      false |         2 |       5270 |               5270 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

需要回退成原先脚本,就能正常

kubeasz pushed a commit that referenced this pull request Apr 16, 2023
@itswl
Copy link
Contributor Author

itswl commented Apr 16, 2023

不好意思,看来不能那么操作

- name: 停止ectd 服务
  service: name=etcd state=stopped

- name: 清除etcd 数据目录
  file: name={{ ETCD_DATA_DIR }}/member state=absent

- name: 清除etcd 备份文件
  file: name={{ ETCD_DATA_DIR }}/snapshot.db state=absent
  
- name: 清除历史恢复文件
  file: name={{ ETCD_DATA_DIR }}/etcd-{{ inventory_hostname }}.etcd state=absent 

- name: 拷贝备份文件到各节点
  copy:
    src: "{{ cluster_dir }}/backup/snapshot.db"
    dest: "{{ ETCD_DATA_DIR }}/snapshot.db"

- name: etcd 数据恢复
  shell: "cd {{ ETCD_DATA_DIR }} && \
	ETCDCTL_API=3 {{ bin_dir }}/etcdctl snapshot restore snapshot.db \
	--name etcd-{{ inventory_hostname }} \
	--initial-cluster {{ ETCD_NODES }} \
	--initial-cluster-token etcd-cluster-0 \
	--initial-advertise-peer-urls https://{{ inventory_hostname }}:2380"

- name: 恢复数据至etcd 数据目录
  shell: "cp -rf {{ ETCD_DATA_DIR }}/etcd-{{ inventory_hostname }}.etcd/member {{ ETCD_DATA_DIR }}/"
  
- name: 重启etcd 服务
  service: name=etcd state=restarted

- name: 以轮询的方式等待服务同步完成
  shell: "systemctl is-active etcd.service"
  register: etcd_status
  until: '"active" in etcd_status.stdout'
  retries: 8
  delay: 8

这样改回来了,修改了一下目录

ETCDCTL_API=3 etcdctl   -w table  --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://172.20.19.17:2379,https://172.20.19.14:2379,https://172.20.19.9:2379 endpoint status
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.20.19.17:2379 | 3d0481e94aabd34d |   3.5.5 |   56 MB |     false |      false |         2 |       6369 |               6369 |        |
| https://172.20.19.14:2379 | e7b3523af07db303 |   3.5.5 |   56 MB |     false |      false |         2 |       6369 |               6369 |        |
|  https://172.20.19.9:2379 | 232989330c375192 |   3.5.5 |   56 MB |      true |      false |         2 |       6369 |               6369 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

kubeasz pushed a commit that referenced this pull request Apr 16, 2023
kubeasz pushed a commit that referenced this pull request Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants