diskpool stuck after force remove, cannot create a new pool with same spec.node #1656

todeb · 2024-05-14T13:21:03Z

Describe the bug
After forced removing diskpool on failed node. Diskpool is present on kubectl mayastor get pools
New pool cannot be created on new node reinitialized with same node name.

To Reproduce

* kubectl delete diskpool pool-on-node  (remove finalizer to force delete)
* kubectl mayastor get pools (pool is still listed)
* kubectl apply -f diskpool.yaml (create new disk pool even with different pool name but with same spec.node)

mayastor see old pool with unknown status:

kubectl mayastor get pools -n openebs
 ID             DISKS                                                     MANAGED  NODE           STATUS   CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pool-on-node  /dev/sdb                                                  true     node  Unknown  0 B       0 B        0 B        <none>

Pool is not provisioned:

NAME             NODE            STATE     POOL_STATUS   CAPACITY       USED         AVAILABLE
pool-on-nodea   node

if recreated with same poolname:

NAME             NODE            STATE     POOL_STATUS   CAPACITY       USED         AVAILABLE
pool-on-node    node   Created   Unknown       0              0            0

Expected behavior
there should be the way to remove stuck pools and be able to create a new pools.

Current Behavior
stuck pools and cannot create new pools on node with same name.

Screenshots
If applicable, add screenshots to help explain your problem.

OS info (please complete the following information):

openebs.io/version: 2.6.1

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tiagolobocastro · 2024-05-14T20:42:59Z

There's currently no way of removing failed pools gracefully (this is in the roadmap).
So for now you'll need to do some manual work, sorry about that.
First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core.
Then you can delete the DiskPool CR.
Hope this helps

todeb · 2024-05-15T09:30:13Z

works, thank you

cmontemuino · 2024-10-30T14:35:59Z

There's currently no way of removing failed pools gracefully (this is in the roadmap). So for now you'll need to do some manual work, sorry about that. First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core. Then you can delete the DiskPool CR. Hope this helps

@tiagolobocastro -- I couldn't find this in the milestones (4.2 or 4.3), so perhaps there's another place where the roadmap is defined?

We're experiencing failed pools very frequently and using the hack from above. The problem is that our clusters are too large and this process just does not scale. I'd love to follow the issue, and even contribute to it.

tiagolobocastro · 2024-11-01T11:01:11Z

You're correct, it's on the ROADMAP but there's no tracking issue for it yet.
We've been going through all the tickets on all repos, and we've not finished all yet. Once we're finished expect to see some github projects and all roadmap items as issues. (we've moving into projects as we're finding milestones not very flexible being repo specific)
Meanwhile if you'd like to create an OEP for this on openebs/openebs, please do so :)

alan9-1 · 2025-01-08T09:07:58Z

There's currently no way of removing failed pools gracefully (this is in the roadmap). So for now you'll need to do some manual work, sorry about that. First find the pool spec in etcd, example:

kubectl -n openebs exec -it openebs-etcd-0 -c etcd -- etcdctl get --prefix "" | grep PoolSpec -A1
/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2
{"node":"ksnode-2","id":"pool-ksnode-2","disks":["/dev/sdb"],"status":{"Created":"Online"},"labels":{"openebs.io/created-by":"operator-diskpool"},"operation":null}

Then delete the poolspec

etcdctl del '/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/PoolSpec/pool-ksnode-2'

Then delete all ReplicaSpec pertaining to the same pools, example:

/openebs.io/mayastor/apis/v0/clusters/68ea9180-d58c-40cd-8aba-2dfbdd6ba9d9/namespaces/openebs/ReplicaSpec/88bdc7fe-80ea-4c3e-bd61-3ac4760e213c
{"name":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","uuid":"88bdc7fe-80ea-4c3e-bd61-3ac4760e213c","size":1073741824,"pool":"pool-ksnode-2","share":"none","thin":false,"status":{"Created":"online"},"managed":true,"owners":{"volume":null,"disown_all":false},"operation":null,"allowed_hosts":[]}

Then restart the agent-core. Then you can delete the DiskPool CR. Hope this helps

i follow steps, kubectl get diskpool -n openebs not show the delete diskpool,but kubectl mayastor get pools -n openebs still show the pool and managed is false
k8s-node03 Online 7TiB 600GiB 6.4TiB 600GiB
pool-node03-nvme4n1 aio:///dev/disk/by-id/nvme-SAMSUNG_MZQLB7T6HMLA-00007_S4BGNY0N101141?blk_size=4096&uuid=f669e9c5-2795-432b-81e6-9d0f72a0fb9f false
kubectl-mayastor get block-devices k8s-node03 -n openebs show error
thread 'main' panicked at 'called Option::unwrap() on a None value', dependencies/control-plane/control-plane/plugin/./src/resources/blockdevice.rs:139:26
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
@tiagolobocastro

tiagolobocastro · 2025-01-08T11:03:05Z

Ah, sorry there's a bug in the plugin, seems it doesn't handle unmanaged pools!

if managed is false then you need to delete the unmanaged pool, for example:
kubectl -n openebs exec mayastor-io-engine-xxxxxx -- io-engine-client pool destroy pool-node03-nvme4n1pool-node03-nvme4n1

tiagolobocastro · 2025-01-09T17:13:40Z

I got a fix for the crash @alan9-1 : openebs/mayastor-control-plane#914

todeb assigned GlennBullingham May 14, 2024

tiagolobocastro unassigned GlennBullingham May 14, 2024

tiagolobocastro added kind/bug Categorizes issue or PR as related to a bug waiting on user input labels May 14, 2024

todeb closed this as completed May 15, 2024

Champ-Goblem mentioned this issue Jul 2, 2024

Handling disk pool creation via PVC-backed storage openebs/openebs#3750

Open

pckroon mentioned this issue Jul 17, 2024

Kernel panics with usercopy: Kernel memory exposure attempt detected from page alloc #1693

Open

cbcoutinho mentioned this issue Sep 21, 2024

LV node migration openebs/lvm-localpv#314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diskpool stuck after force remove, cannot create a new pool with same spec.node #1656

diskpool stuck after force remove, cannot create a new pool with same spec.node #1656

todeb commented May 14, 2024

tiagolobocastro commented May 14, 2024

todeb commented May 15, 2024

cmontemuino commented Oct 30, 2024

tiagolobocastro commented Nov 1, 2024

alan9-1 commented Jan 8, 2025 •

edited

Loading

tiagolobocastro commented Jan 8, 2025

tiagolobocastro commented Jan 9, 2025

diskpool stuck after force remove, cannot create a new pool with same spec.node #1656

diskpool stuck after force remove, cannot create a new pool with same spec.node #1656

Comments

todeb commented May 14, 2024

tiagolobocastro commented May 14, 2024

todeb commented May 15, 2024

cmontemuino commented Oct 30, 2024

tiagolobocastro commented Nov 1, 2024

alan9-1 commented Jan 8, 2025 • edited Loading

tiagolobocastro commented Jan 8, 2025

tiagolobocastro commented Jan 9, 2025

alan9-1 commented Jan 8, 2025 •

edited

Loading