Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diskpool stuck after force remove, cannot create a new pool with same spec.node #1656

Closed
todeb opened this issue May 14, 2024 · 7 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug

Comments

@todeb
Copy link

todeb commented May 14, 2024

Describe the bug
After forced removing diskpool on failed node. Diskpool is present on kubectl mayastor get pools
New pool cannot be created on new node reinitialized with same node name.

To Reproduce

* kubectl delete diskpool pool-on-node  (remove finalizer to force delete)
* kubectl mayastor get pools (pool is still listed)
* kubectl apply -f diskpool.yaml (create new disk pool even with different pool name but with same spec.node)

mayastor see old pool with unknown status:

kubectl mayastor get pools -n openebs
 ID             DISKS                                                     MANAGED  NODE           STATUS   CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pool-on-node  /dev/sdb                                                  true     node  Unknown  0 B       0 B        0 B        <none>

Pool is not provisioned:

NAME             NODE            STATE     POOL_STATUS   CAPACITY       USED         AVAILABLE
pool-on-nodea   node

if recreated with same poolname:

NAME             NODE            STATE     POOL_STATUS   CAPACITY       USED         AVAILABLE
pool-on-node    node   Created   Unknown       0              0            0

Expected behavior
there should be the way to remove stuck pools and be able to create a new pools.

Current Behavior
stuck pools and cannot create new pools on node with same name.

Screenshots
If applicable, add screenshots to help explain your problem.

OS info (please complete the following information):

  • openebs.io/version: 2.6.1

Additional context
Add any other context about the problem here.

@tiagolobocastro
Copy link
Contributor

There's currently no way of removing failed pools gracefully (this is in the roadmap).
So for now you'll need to do some manual work, sorry about that.
First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core.
Then you can delete the DiskPool CR.
Hope this helps

@tiagolobocastro tiagolobocastro added kind/bug Categorizes issue or PR as related to a bug waiting on user input labels May 14, 2024
@todeb
Copy link
Author

todeb commented May 15, 2024

works, thank you

@cmontemuino
Copy link

There's currently no way of removing failed pools gracefully (this is in the roadmap). So for now you'll need to do some manual work, sorry about that. First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core. Then you can delete the DiskPool CR. Hope this helps

@tiagolobocastro -- I couldn't find this in the milestones (4.2 or 4.3), so perhaps there's another place where the roadmap is defined?

We're experiencing failed pools very frequently and using the hack from above. The problem is that our clusters are too large and this process just does not scale. I'd love to follow the issue, and even contribute to it.

@tiagolobocastro
Copy link
Contributor

You're correct, it's on the ROADMAP but there's no tracking issue for it yet.
We've been going through all the tickets on all repos, and we've not finished all yet. Once we're finished expect to see some github projects and all roadmap items as issues. (we've moving into projects as we're finding milestones not very flexible being repo specific)
Meanwhile if you'd like to create an OEP for this on openebs/openebs, please do so :)

@alan9-1
Copy link

alan9-1 commented Jan 8, 2025

There's currently no way of removing failed pools gracefully (this is in the roadmap). So for now you'll need to do some manual work, sorry about that. First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core. Then you can delete the DiskPool CR. Hope this helps

There's currently no way of removing failed pools gracefully (this is in the roadmap). So for now you'll need to do some manual work, sorry about that. First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core. Then you can delete the DiskPool CR. Hope this helps

i follow steps, kubectl get diskpool -n openebs not show the delete diskpool,but kubectl mayastor get pools -n openebs still show the pool and managed is false
k8s-node03 Online 7TiB 600GiB 6.4TiB 600GiB
pool-node03-nvme4n1 aio:///dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNY0N101141?blk_size=4096&uuid=f669e9c5-2795-432b-81e6-9d0f72a0fb9f false
kubectl-mayastor get block-devices k8s-node03 -n openebs show error
thread 'main' panicked at 'called Option::unwrap() on a None value', dependencies/control-plane/control-plane/plugin/./src/resources/blockdevice.rs:139:26
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
@tiagolobocastro

@tiagolobocastro
Copy link
Contributor

Ah, sorry there's a bug in the plugin, seems it doesn't handle unmanaged pools!

if managed is false then you need to delete the unmanaged pool, for example:
kubectl -n openebs exec mayastor-io-engine-xxxxxx -- io-engine-client pool destroy pool-node03-nvme4n1pool-node03-nvme4n1

@tiagolobocastro
Copy link
Contributor

I got a fix for the crash @alan9-1 : openebs/mayastor-control-plane#914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug
Projects
None yet
Development

No branches or pull requests

5 participants