ZFS Kernel Panic at boot on Arch Linux #11480

TheDome · 2021-01-18T08:57:36Z

Hi there,

System info

I am using the latest stable ZFS-dkms release on Artix linux with s6-init.

Type	Version
System	Artix Linux with s6-init
System Version	5.10.4-artix2-1
System Architecture	Arch linux
ZFS Version	`zfs-2.0.1-1`

Current behavior

When I currently boot the system, the Vconsole shows a log message regarding the ZFS AVL tree.

VERIFY(avl_find(tree, new_node, &where) == NULL) failed
PANIC at avl.c:641:avl_add()
CPU: 6 PID: 1019 Comm: z_livelist_dest Tainted: P           OE     5.10.4-artix2-1 #1
Call Trace:
dump_stack+0x6b/0x83
spl_panic+0xef/0x117 [spl]
? allocate_slab+0x30b/0x4c0
? spl_kmem_alloc_impl+0xae/0xf0 [spl]
? __kmalloc_node+0x180/0x380
avl_add+0x9b/0xb0 [zavl]
dsl_livelist_iterate+0x142/0x1f0 [zfs]
 bpobj_iterate_blkptrs+0xfe/0x360 [zfs]
? dsl_deadlist_load_cache+0x250/0x250 [zfs]
bpobj_iterate_impl+0x280/0x4f0 [zfs]
? dsl_deadlist_load_cache+0x250/0x250 [zfs]
 dsl_process_sub_livelist+0xb5/0xf0 [zfs]
? dsl_deadlist_cache_compare+0x20/0x20 [zfs]
spa_livelist_delete_cb+0x1ea/0x350 [zfs]
zthr_procedure+0x135/0x150 [zfs]
? zrl_is_locked+0x20/0x20 [zfs]
thread_generic_wrapper+0x6f/0x80 [spl]
? __thread_exit+0x20/0x20 [spl]
 kthread+0x133/0x150
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30

Additionally the kernel kills a ZFS thread every now and then resulting in the following stack trace:

INFO: task z_livelist_dest:1019 blocked for more than 491 seconds.
 Tainted: P           OE     5.10.4-artix2-1 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:z_livelist_dest state:D stack:    0 pid: 1019 ppid:     2 flags:0x00004000
 Call Trace:
 __schedule+0x295/0x810
 schedule+0x5b/0xc0
 spl_panic+0x115/0x117 [spl]
 ? spl_kmem_alloc_impl+0xae/0xf0 [spl]
 ? __kmalloc_node+0x180/0x380
 avl_add+0x9b/0xb0 [zavl]
 dsl_livelist_iterate+0x142/0x1f0 [zfs]
 bpobj_iterate_blkptrs+0xfe/0x360 [zfs]
 ? dsl_deadlist_load_cache+0x250/0x250 [zfs]
 bpobj_iterate_impl+0x280/0x4f0 [zfs]
 ? dsl_deadlist_load_cache+0x250/0x250 [zfs]
 dsl_process_sub_livelist+0xb5/0xf0 [zfs]
 ? dsl_deadlist_cache_compare+0x20/0x20 [zfs]
 spa_livelist_delete_cb+0x1ea/0x350 [zfs]
 zthr_procedure+0x135/0x150 [zfs]
 ? zrl_is_locked+0x20/0x20 [zfs]
 thread_generic_wrapper+0x6f/0x80 [spl]
 ? __thread_exit+0x20/0x20 [spl]
 kthread+0x133/0x150
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

Since I am using ZFS on a mirrored rotational vdev, I can hear, that the zfs driver issues a huge amount of random write requests to the disk. As far as I know, the following iotop shows over 3K write requests queued. But I am not doing anything and I have not opened any program, which writes to the system, so there should be no requests to write to the disk.

>zpool iostat
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zroot        48.6G   189G     11  3.15K   320K  31.6M

Additionally iotop also shows a huge disk occupation by the txg_sync kthread, again with no other processes accessing disk at that time.

> iotop
Total DISK READ :     164.10 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:      95.16 M/s | Actual DISK WRITE:      16.63 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 1521 be/4 root        0.00 B/s    0.00 B/s  0.00 % 83.48 % [txg_sync]
 1552 be/0 root        0.00 B/s    0.00 B/s  0.00 % 69.83 % [vdev_autotrim]

Expected behavior

ZFS should not perform the random write requests.

Additional info

I am currently using docker (20.10.1) with the zfs driver. So docker performs zfs clones and writes. Those datasets seem to create zfs errors, which can't be recovered through removing them. After running docker system prune -af the zdb shows:

❯ sudo zdb -d zroot
Dataset mos [META], ID 0, cr_txg 4, 606M, 515 objects
Dataset zroot/ROOT/default@znap_2021-01-13-1004_monthly [ZPL], ID 2064, cr_txg 934007, 11.0G, 353740 objects
Dataset zroot/ROOT/default [ZPL], ID 144, cr_txg 38, 12.1G, 360045 objects
Dataset zroot/ROOT@znap_2021-01-13-1004_monthly [ZPL], ID 2060, cr_txg 934007, 96K, 6 objects
Dataset zroot/ROOT [ZPL], ID 259, cr_txg 34, 96K, 6 objects
Dataset zroot/swap [ZVOL], ID 150, cr_txg 157, 72K, 2 objects
Dataset zroot/HOME/user@znap_2021-01-13-1004_monthly [ZPL], ID 1469, cr_txg 934006, 34.9G, 822515 objects
Dataset zroot/HOME/user [ZPL], ID 515, cr_txg 91, 34.0G, 837682 objects
Dataset zroot/HOME@znap_2021-01-13-1004_monthly [ZPL], ID 1466, cr_txg 934006, 96K, 6 objects
Dataset zroot/HOME [ZPL], ID 387, cr_txg 46, 96K, 6 objects
Dataset zroot@znap_2021-01-13-1004_monthly [ZPL], ID 1342, cr_txg 934005, 96K, 6 objects
Dataset zroot [ZPL], ID 54, cr_txg 1, 96K, 6 objects
	ERROR: Duplicate ALLOC: 0:3700c19000:1000 200L/200P F=0 B=1077805/500933 cksum=3e2115db729c012:2d8e864d83b6e7e1:b98e28f20bc247f7:3ee97a4be2bc609c
	ERROR: Duplicate ALLOC: 0:3700c19000:1000 200L/200P F=0 B=1077805/500933 cksum=3e2115db729c012:2d8e864d83b6e7e1:b98e28f20bc247f7:3ee97a4be2bc609c
	ERROR: Duplicate ALLOC: 0:3700c19000:1000 200L/200P F=0 B=1077805/500933 cksum=3e2115db729c012:2d8e864d83b6e7e1:b98e28f20bc247f7:3ee97a4be2bc609c
zfs_btree_find(tree, node, &where) == NULL (0x5583989d7f10 == 0x0)
ASSERT at ../../module/zfs/btree.c:1296:zfs_btree_add()[1]    9397 abort      sudo zdb -d zroot

Things I've tried already

Since this looks like a problem in the ZFS arrangement for me, I searched through the issues and some old solaris documentations.
I have already ran a zpool scrub to verify data integrity. This results in a successful scrub with no errors. Additionally I ran the zdb -bcsvL -AAA command like i read in this article. Thus I am not completely sure, if I have set the zfs:zfs_recover=1 flag correctly (I set it using /etc/systemlike said in this post).
The command also exits with a panic report. Thus I had to append the error correction flag -AAA. The output looks like this:

Cached configuration:
        version: 5000
        name: 'zroot'
        state: 0
        txg: 2400861
        pool_guid: 9365715525508061175
        errata: 0
        hostname: 'artixlinux'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 9365715525508061175
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11552697276303360577
                path: '/dev/sda' 
                whole_disk: 0
                metaslab_array: 131
                metaslab_shift: 31
                ashift: 12
                asize: 255765250048
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 129
                com.delphix:vdev_zap_top: 130
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data

MOS Configuration:
        version: 5000
        name: 'zroot'
        state: 0
        txg: 2400861
        pool_guid: 9365715525508061175
        errata: 0
        hostname: 'artixlinux'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 9365715525508061175
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11552697276303360577
                path: '/dev/sda'
                whole_disk: 0
                metaslab_array: 131
                metaslab_shift: 31
                ashift: 12
                asize: 255765250048
                is_log: 0
                DTL: 3041
                create_txg: 4
                com.delphix:vdev_zap_leaf: 129
                com.delphix:vdev_zap_top: 130
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data

l->blk_birth == r->blk_birth (0x107a75 == 0x107934)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107a75 == 0x107934)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x10790a == 0x107934)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()avl_find(tree, new_node, &where) == NULL
ASSERT at ../../module/avl/avl.c:641:avl_add()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107934 == 0x10790a)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()l->blk_birth == r->blk_birth (0x107905 == 0x107932)
ASSERT at ../../module/zfs/dsl_deadlist.c:933:livelist_compare()avl_find(tree, new_node, &where) == NULL
ASSERT at ../../module/avl/avl.c:641:avl_add()l->blk_birth == r->blk_birth (0x107932 == 0x107905)
zdb_blkptr_cb: Got error 52 reading <0, 5260, 0, 2> DVA[0]=<0:10c8636000:d000> DVA[1]=<0:1228e6c000:d000> DVA[2]=<0:f4408e000:d000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/d000P birth=2551273L/2551273P fill=1 cksum=b96b4337e5c:12a9c6c2ef60af2:272200e1a4124a6d:a3236cbd3e6d4de5 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 7183, 0, 3> DVA[0]=<0:a08de6000:a000> DVA[1]=<0:3a01cd6000:a000> DVA[2]=<0:371011a000:a000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/a000P birth=2552118L/2552118P fill=1 cksum=8aa7c63331d:ab5b227864148e:d52a86dbe0754906:dbe896479a0074f4 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 7190, 0, 3> DVA[0]=<0:120db42000:c000> DVA[1]=<0:3a04324000:c000> DVA[2]=<0:8202a000:c000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/c000P birth=2552190L/2552190P fill=1 cksum=9d641bed268:f445df807aac09:73715b7fc334ba78:8581d0f6e2d0ce15 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 10299, 0, 2> DVA[0]=<0:a3e0e8000:11000> DVA[1]=<0:3a0de9f000:11000> DVA[2]=<0:826b8000:11000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/11000P birth=2552130L/2552130P fill=1 cksum=da0ba031ed9:1dbdab2ce810cf1:e63b6ae3058778df:af2b0a9bbbd4efed -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 0, 1, 3> DVA[0]=<0:c1582d000:2000> DVA[1]=<0:14031b5000:2000> DVA[2]=<0:8085a000:2000> [L1 DMU dnode] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/2000P birth=2552203L/2552203P fill=49 cksum=1bc4cfc2d1e:812f98b82925e:15efd46b335fd978:d69c46d1d7aee47d -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 0, 1, 4> DVA[0]=<0:c15859000:2000> DVA[1]=<0:14031d9000:2000> DVA[2]=<0:8085c000:2000> [L1 DMU dnode] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/2000P birth=2552220L/2552220P fill=141 cksum=2150faf7dfb:8b65362478f73:174a37f76e7aedf1:b51ee644db7d97b4 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 153479, 0, 0> DVA[0]=<0:10c9f29000:1000> DVA[1]=<0:11be3b8000:1000> DVA[2]=<0:6cbbea000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=2552175L/2552175P fill=1 cksum=94871a0d72:1f4400678e74a:34f39294858a255:c232ba0ff4a0390a -- skipping
zdb_blkptr_cb: Got error 52 reading <144, 0, 1, 16> DVA[0]=<0:1424b51000:7000> DVA[1]=<0:c5ef7e000:7000> [L1 DMU dnode] fletcher4 lz4 unencrypted LE contiguous unique double size=20000L/7000P birth=2552108L/2552108P fill=3434 cksum=8b559a74703:7e5aa2f6b61e0d:a6fe050ed7b89717:fdca16de56b456df -- skipping
zdb_blkptr_cb: Got error 52 reading <515, 0, 2, 0> DVA[0]=<0:120ca2e000:2000> DVA[1]=<0:3a01ce2000:2000> [L2 DMU dnode] fletcher4 lz4 unencrypted LE contiguous unique double size=20000L/2000P birth=2552220L/2552220P fill=837415 cksum=1a2b7f91d87:93464a419dc7f:1b3cbbf438232892:4422824b5791f0b5 -- skipping

I have no other ideas how to fix the panics in the driver. The last time this happened to me on another vdev, i had to manually copy all the files from the corrupt zpool to a newly created one and I would like to avoid doing this.

Is there any way to force zfs to check its AVL tree for errors and correct them by itself?
I would be glad for every hint how to fix this corrupt vdev!

The text was updated successfully, but these errors were encountered:

TheDome · 2021-01-18T17:28:46Z

Since I didn't want to waste my disk write cycles, I decided to create a backup and manually copy the data from one zfs pool to another.
The problem seems to have vanished by downgrading from zfs-2.0.1-1 to zfs-2.0.0.
Since those datasets were in context to the docker engine and the corrupted data seems to originate from the docker pool, I would say. That there seems to be a problem regarding the zfs clones and the way docker handles those since the last release.

Edit: clarified ZFS versions

wengole · 2021-01-23T16:57:10Z

I don't usually like posts like this but "me too". https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tc8e82b0bb2e8bc0f/3-days-of-constant-txgsync-100-blocking-io

[ 1349.527916]       Tainted: P           OE     5.4.91-1-lts #1                                                                                                          
[ 1349.527941] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                  
[ 1349.527975] z_livelist_dest D    0  1086      2 0x80004080
[ 1349.527977] Call Trace:
[ 1349.527982]  __schedule+0x2a6/0x7d0
[ 1349.527985]  schedule+0x39/0xa0
[ 1349.527990]  spl_panic+0x115/0x117 [spl]
[ 1349.527994]  ? __kmalloc_node+0x23b/0x2e0
[ 1349.527996]  avl_add+0x9b/0xb0 [zavl]
[ 1349.528035]  dsl_livelist_iterate+0x142/0x1f0 [zfs]
[ 1349.528065]  bpobj_iterate_blkptrs+0xfe/0x360 [zfs]
[ 1349.528102]  ? dsl_deadlist_load_cache+0x250/0x250 [zfs]
[ 1349.528131]  bpobj_iterate_impl+0x282/0x500 [zfs]
[ 1349.528166]  ? dsl_deadlist_load_cache+0x250/0x250 [zfs]
[ 1349.528201]  dsl_process_sub_livelist+0xb5/0xf0 [zfs]
[ 1349.528236]  ? dsl_deadlist_cache_compare+0x20/0x20 [zfs]
[ 1349.528275]  spa_livelist_delete_cb+0x1ea/0x350 [zfs]
[ 1349.528320]  zthr_procedure+0x139/0x150 [zfs]
[ 1349.528361]  ? zrl_is_locked+0x20/0x20 [zfs]
[ 1349.528364]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 1349.528368]  ? __thread_exit+0x20/0x20 [spl]
[ 1349.528370]  kthread+0x117/0x130
[ 1349.528371]  ? __kthread_bind_mask+0x60/0x60
[ 1349.528373]  ret_from_fork+0x22/0x40

There are more examples of this above, but see the zfs-discussion thread for more details of issues I'm seeing including constant random writes by ZFS (happening with no other services running, and in fact happening whilst no datasets are even mounted)

sdimitro · 2021-01-26T16:06:23Z

Hi @TheDome! Thanks for taking the time to file the issue. I've got a couple of questions for you:
[1] You mention that the issue went away after 2.0.0 . What version of ZFS where you using before?
[2] Looking at the stack trace it looks like the system is panicking while trying to delete clones. Could you elaborate a bit more on your use-case? I'm mostly asking to see if we can come up with a reproducer.

The livelist code used to have 2 bugs: One related to device removal (which only existed while the PR was open - the fix was merged together with the initial upstream commit), and one related to clone promotions #10652. The former I think is irrelevant to you as looking at the config printed by ZDB you don't have any removed/indirect vdevs. The latter also seems unlikely looking at the stacktrace that you've posted.

cc: @ahrens

TheDome · 2021-01-26T17:16:53Z

Hi @sdimitro! Thanks for taking the time to resolve this issue.
[1] Im sorry, that I wasn't clear about the versions. You are right. I downgraded from zfs-2.0.1-1 to zfs-2.0.0-1. So the problem happened after an upgrade of the ZFS.
[2] I am not sure, if you are able to reproduce this issue, since it is quite obvious and should therefore already be reported. But I will give you direction how I am able to reproduce this issue:

I upgraded to ZFS v2.0.1.
I am using docker Server Version: 20.10.2 and several containers which i have to start and stop pretty often.
The problem seems to occur, when docker removes containers since this is when the problems occurs. After this step, the hard drive performs an enormous number of random r/w requests, since you can easily hear the heads moving (not like sequential read/write).
This resulted in me being able to reliably reproduce this on two separate vdevs (also an SSD vdev).
Since I am not a ZFS engineer, I am not able to pinpoint the problem further, but it seems that docker corrupts internal ZFS structures (specifically an AVL tree), beyond a point where zfs is able to repair it further. I even made a complete backup of the data on another drive using zfs send | zfs receive and recovered it without success. (However I'd like to pinpoint that I am unsing the encrypted ZFS and therefor needed to append the -w flag.
Docker uses the technique of zfs clones and snapshots. So this problem also seems to correlate with clones not existing anymore.
Otherwise I did not modify vdevs during that time so this Problem should have another context.

Additionally I'd like to pinpoint that I am using swap on a ZFS vdev but swap is never being used on my system which has enough ram to not use swap. Also I am not experiencing a deadlock in ZFS resulting in a complete kernel crash but several txg_sync threads using up to 100% of the I/O resources.

justinianpopa · 2021-01-27T12:31:17Z

I am also observing identical conditions to this issue on a ZFS pool with 3x nvme SSDs (all 3 single vdevs). The pool is used for fast temporary storage of large builds with clones and ocasionally promotions of clones. The version is "zfs-2.0.1-0york0~18.04" on Ubuntu 18.04.2 (via https://launchpad.net/~jonathonf/+archive/ubuntu/zfs) with kernel 5.3.0-59-generic.

I have fully imaged the pool drives for archival on another system and can replicate the issue every time in a virtual machine with the same distro/debs + kernel 5.4.0-64-generic. The panic happens on pool import where there is an instant dmesg error attached below.
dmesg-out.txt

A full zpool scrub does not find any corruption and the pool is usable but when using clones there is the [txg_sync] thread that keeps running and using small random write IO (no programs using the pool, iotop shows only that thread, zpool iostat shows the usage, /proc/spl/kstat/zfs/mypool/txgs shows new txgs rapidly being created).

At first i thought it is a normal part of operations with clones/dataset deletions but it has not stopped for 20 hours continuously before i force stopped the whole thing when i noticed they were real writes using up my ssd's nand for no reason.

The pool disks were imaged via files on another machine and snapshotted in separate datasets. Then i started a VM with a clean install of 18.04.2 + 2.0.1 zfs backport and the image files mounted as virtual drives. The random IO workload does seem to change things on disks as i am seeing steady increasing space usage compared to the original datasets snapshot.

I will try to test a master branch build and some 2.0.0-rc's but i can probably only go back so far because of the new pool features in 2.0.0.

justinianpopa · 2021-01-28T12:10:17Z

Tested with master branch builds and back a few rc's, same results. It seems the problem is embedded in the pool itself and not the code but i have no idea how to replicate the corruption from a good pool. In the mean time, the pool's replicated datasets (with "zfs send" to a clean pool) show no obvious signs of corruption.

sblive · 2021-02-15T20:44:10Z

same here: #11603

HorayNarea · 2021-03-10T06:33:28Z

I have the same behaviour since zfs 2.0.0 or 2.0.1 (I can't remember exactly, sorry) with Archlinux latest kernel on an upgraded pool with a docker dataset, dedup was active on some datasets and currently is still active on one dataset.

On my pool this generates constant writes of ~200-300MB/s (thanks to this being an SSD-only pool 😬). While a scrub is running the writes disappear; scrub finds no errors; after scrubbing the writes immediately appear again; sometimes there are a few seconds without the massive writes, I suspect that's when a "real"/larger synchronous write happens or the dirty bits are flushed to disk?

Things I've tried that did not help:

zpool scrub zroot
disabling swap on zvol
stopping Docker
destroying the Docker datasets
and zdb -bcsvL -AAA zroot, which outputs:

Traversing all blocks to verify checksums ...

avl_find(tree, new_node, &where) == NULL
ASSERT at ../../module/avl/avl.c:641:avl_add()avl_find(tree, new_node, &where) == NULL
ASSERT at ../../module/avl/avl.c:641:avl_add()avl_find(tree, new_node, &where) == NULL
7.78G completed ( 333MB/s) estimated time remaining: 0hr 07min 10sec        zdb_blkptr_cb: Got error 52 reading <0, 40030, 1, 0> DVA[0]=<0:21ee49a000:1000> DVA[1]=<0:17c3c30000:1000> DVA[2]=<0:1d0c195000:1000> [L1 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=4000L/1000P birth=63650830L/63650830P fill=5 cksum=2db91fd4ba:aabcbd5eac2f:13f8c4965dcb60f:8fab4c19ab5aba2e -- skipping
7.90G completed ( 325MB/s) estimated time remaining: 0hr 07min 20sec        zdb_blkptr_cb: Got error 52 reading <0, 44998, 0, 1> DVA[0]=<0:21f30e8000:7000> DVA[1]=<0:17c7804000:7000> DVA[2]=<0:1d13944000:7000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/7000P birth=63650948L/63650948P fill=1 cksum=6447e12924a:5847a07314f5d2:309d2baa15f92c79:1ddff7b400453b75 -- skipping
8.11G completed ( 308MB/s) estimated time remaining: 0hr 07min 44sec        zdb_blkptr_cb: Got error 52 reading <0, 60375, 0, 0> DVA[0]=<0:21f6a45000:1000> DVA[1]=<0:17cd365000:1000> DVA[2]=<0:1d1cb0f000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651099L/63651099P fill=1 cksum=948e2ca3a7:1e3de0d531f19:319225be54182fd:6899036851527505 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60385, 0, 0> DVA[0]=<0:21f6c56000:1000> DVA[1]=<0:17cdf5e000:1000> DVA[2]=<0:1d1fda7000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651110L/63651110P fill=1 cksum=9b092127a6:1f7dd7cadf2ee:338d24545119ef9:8b1b011fd4018db5 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60389, 0, 0> DVA[0]=<0:21f6cdd000:1000> DVA[1]=<0:17ce0e2000:1000> DVA[2]=<0:1d1feba000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651114L/63651114P fill=1 cksum=9dd788d3e6:2003f528d96d8:3458cdebd6a046a:984225b7609f3b7b -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60387, 0, 0> DVA[0]=<0:21f6c9a000:1000> DVA[1]=<0:17cdfba000:1000> DVA[2]=<0:1d1fe10000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651112L/63651112P fill=1 cksum=9f22bbba31:2091088413932:35b6d5a1c2e3d8f:b854498ca094ec8d -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60414, 0, 0> DVA[0]=<0:21f7446000:1000> DVA[1]=<0:17cf449000:1000> DVA[2]=<0:1d0485c000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651143L/63651143P fill=1 cksum=9f8278f328:203fcb5573f59:34a1a425c6a50b5:9cf78175ac86c532 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60433, 0, 0> DVA[0]=<0:21f75aa000:1000> DVA[1]=<0:17cf5cb000:1000> DVA[2]=<0:1d05b70000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651150L/63651150P fill=1 cksum=9cb1c74632:1fda1722e1eb1:3435d8a11a3108d:987d2226123f2630 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60439, 0, 0> DVA[0]=<0:21f7760000:1000> DVA[1]=<0:17cf915000:1000> DVA[2]=<0:1d06707000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651159L/63651159P fill=1 cksum=9be9415e84:1eff16eb18d27:31b8963b246bd16:5a730e67ab809d26 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60437, 0, 0> DVA[0]=<0:21f76b5000:1000> DVA[1]=<0:17cf7fa000:1000> DVA[2]=<0:1d060ef000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651155L/63651155P fill=1 cksum=99e2411e81:1f4091e961136:33235bc83164169:833522d535d7e404 -- skipping
38.1G completed (  47MB/s) estimated time remaining: 0hr 39min 26sec        zdb_blkptr_cb: Got error 52 reading <6387, 1508, 0, 5> DVA[0]=<0:17d9b95000:b000> salt=267e9c40a8076cfb iv=c289aa731e2905b7:f54860b1 [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/b000P birth=60674648L/60674648P fill=1 cksum=6c3d83590ec551a:665cb7f498888b43:1760b25d5ddb74a:4e08d418506653e2 -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 1> DVA[0]=<0:17d7e40000:c000> salt=267e9c40a8076cfb iv=806e8d242d4ce98:a457ea25 [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=fc780b37ed120cd9:4a96b6e84befa49:37b28e15330556ba:f484312b5002f2ab -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 5> DVA[0]=<0:17d7eb2000:c000> salt=267e9c40a8076cfb iv=a6ee881997acac3b:139bde5c [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=89f942425d5508b7:5346669cc45c355a:a054880a23f1aba0:6d95e887b4f45f1a -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 6> DVA[0]=<0:17d7ebe000:c000> salt=267e9c40a8076cfb iv=eadba6102ca5f4a:cb79db1c [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=6a2fbb12f9e9144d:6a5d5ed393ee44e:7eacc875350839e7:e64eed4a3cafb3d6 -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 8> DVA[0]=<0:17d7eca000:c000> salt=267e9c40a8076cfb iv=a21f8e85bea37e4d:11a6da63 [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=2dd09149cc90e9ee:4599ab3f84746c6f:8d95af2b46a98a21:f8746181f34c29ea -- skipping
[…]

Running zdb -d zroot produces normal output but then aborts:

Dataset zroot [ZPL], ID 51, cr_txg 1, 96K, 6 objects
zfs_btree_find(tree, node, &where) == NULL (0x558269ab8790 == 0x0)
ASSERT at ../../module/zfs/btree.c:1296:zfs_btree_add()[1]    4131080 abort      sudo zdb -d zroot

I have a hunch that something with how Docker handles cloning/snapshotting/etc. together with enabled dedup may be the culprit here, but have no strong proof other than the fact that "enabling dedup", "upgrading to 2.0.0"/"upgrading pool" and "massive writes occurring" all happened around the same time, but that could also be a coincidence.

So my questions are now:

How can I help finding the cause of the problem?
Do you think it is possible to recover from the problem without migrating to a new pool?

pgeorgi · 2021-03-11T06:58:14Z

Given

[ 1349.527982]  __schedule+0x2a6/0x7d0
[ 1349.527985]  schedule+0x39/0xa0
[ 1349.527990]  spl_panic+0x115/0x117 [spl]
[ 1349.527994]  ? __kmalloc_node+0x23b/0x2e0
[ 1349.527996]  avl_add+0x9b/0xb0 [zavl]
[ 1349.528035]  dsl_livelist_iterate+0x142/0x1f0 [zfs]

does it make sense to skip enabling the livelist feature on zpool upgrades for now so that those who haven't updated yet won't run into the issue? Also, as livelists seem to be an optimization: could a pool property disable their use when they already exist (to work around broken data structures like here)?

mherkazandjian · 2021-03-11T11:39:54Z

I have the same behaviour since zfs 2.0.0 or 2.0.1 (I can't remember exactly, sorry) with Archlinux latest kernel on an upgraded pool with a docker dataset, dedup was active on some datasets and currently is still active on one dataset.

On my pool this generates constant writes of ~200-300MB/s (thanks to this being an SSD-only pool ). While a scrub is running the writes disappear; scrub finds no errors; after scrubbing the writes immediately appear again; sometimes there are a few seconds without the massive writes, I suspect that's when a "real"/larger synchronous write happens or the dirty bits are flushed to disk?

Things I've tried that did not help:

zpool scrub zroot
disabling swap on zvol
stopping Docker
destroying the Docker datasets
and zdb -bcsvL -AAA zroot, which outputs:

Traversing all blocks to verify checksums ...

avl_find(tree, new_node, &where) == NULL
ASSERT at ../../module/avl/avl.c:641:avl_add()avl_find(tree, new_node, &where) == NULL
ASSERT at ../../module/avl/avl.c:641:avl_add()avl_find(tree, new_node, &where) == NULL
7.78G completed ( 333MB/s) estimated time remaining: 0hr 07min 10sec        zdb_blkptr_cb: Got error 52 reading <0, 40030, 1, 0> DVA[0]=<0:21ee49a000:1000> DVA[1]=<0:17c3c30000:1000> DVA[2]=<0:1d0c195000:1000> [L1 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=4000L/1000P birth=63650830L/63650830P fill=5 cksum=2db91fd4ba:aabcbd5eac2f:13f8c4965dcb60f:8fab4c19ab5aba2e -- skipping
7.90G completed ( 325MB/s) estimated time remaining: 0hr 07min 20sec        zdb_blkptr_cb: Got error 52 reading <0, 44998, 0, 1> DVA[0]=<0:21f30e8000:7000> DVA[1]=<0:17c7804000:7000> DVA[2]=<0:1d13944000:7000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/7000P birth=63650948L/63650948P fill=1 cksum=6447e12924a:5847a07314f5d2:309d2baa15f92c79:1ddff7b400453b75 -- skipping
8.11G completed ( 308MB/s) estimated time remaining: 0hr 07min 44sec        zdb_blkptr_cb: Got error 52 reading <0, 60375, 0, 0> DVA[0]=<0:21f6a45000:1000> DVA[1]=<0:17cd365000:1000> DVA[2]=<0:1d1cb0f000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651099L/63651099P fill=1 cksum=948e2ca3a7:1e3de0d531f19:319225be54182fd:6899036851527505 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60385, 0, 0> DVA[0]=<0:21f6c56000:1000> DVA[1]=<0:17cdf5e000:1000> DVA[2]=<0:1d1fda7000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651110L/63651110P fill=1 cksum=9b092127a6:1f7dd7cadf2ee:338d24545119ef9:8b1b011fd4018db5 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60389, 0, 0> DVA[0]=<0:21f6cdd000:1000> DVA[1]=<0:17ce0e2000:1000> DVA[2]=<0:1d1feba000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651114L/63651114P fill=1 cksum=9dd788d3e6:2003f528d96d8:3458cdebd6a046a:984225b7609f3b7b -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60387, 0, 0> DVA[0]=<0:21f6c9a000:1000> DVA[1]=<0:17cdfba000:1000> DVA[2]=<0:1d1fe10000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651112L/63651112P fill=1 cksum=9f22bbba31:2091088413932:35b6d5a1c2e3d8f:b854498ca094ec8d -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60414, 0, 0> DVA[0]=<0:21f7446000:1000> DVA[1]=<0:17cf449000:1000> DVA[2]=<0:1d0485c000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651143L/63651143P fill=1 cksum=9f8278f328:203fcb5573f59:34a1a425c6a50b5:9cf78175ac86c532 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60433, 0, 0> DVA[0]=<0:21f75aa000:1000> DVA[1]=<0:17cf5cb000:1000> DVA[2]=<0:1d05b70000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651150L/63651150P fill=1 cksum=9cb1c74632:1fda1722e1eb1:3435d8a11a3108d:987d2226123f2630 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60439, 0, 0> DVA[0]=<0:21f7760000:1000> DVA[1]=<0:17cf915000:1000> DVA[2]=<0:1d06707000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651159L/63651159P fill=1 cksum=9be9415e84:1eff16eb18d27:31b8963b246bd16:5a730e67ab809d26 -- skipping
zdb_blkptr_cb: Got error 52 reading <0, 60437, 0, 0> DVA[0]=<0:21f76b5000:1000> DVA[1]=<0:17cf7fa000:1000> DVA[2]=<0:1d060ef000:1000> [L0 SPA space map] fletcher4 lz4 unencrypted LE contiguous unique triple size=20000L/1000P birth=63651155L/63651155P fill=1 cksum=99e2411e81:1f4091e961136:33235bc83164169:833522d535d7e404 -- skipping
38.1G completed (  47MB/s) estimated time remaining: 0hr 39min 26sec        zdb_blkptr_cb: Got error 52 reading <6387, 1508, 0, 5> DVA[0]=<0:17d9b95000:b000> salt=267e9c40a8076cfb iv=c289aa731e2905b7:f54860b1 [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/b000P birth=60674648L/60674648P fill=1 cksum=6c3d83590ec551a:665cb7f498888b43:1760b25d5ddb74a:4e08d418506653e2 -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 1> DVA[0]=<0:17d7e40000:c000> salt=267e9c40a8076cfb iv=806e8d242d4ce98:a457ea25 [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=fc780b37ed120cd9:4a96b6e84befa49:37b28e15330556ba:f484312b5002f2ab -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 5> DVA[0]=<0:17d7eb2000:c000> salt=267e9c40a8076cfb iv=a6ee881997acac3b:139bde5c [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=89f942425d5508b7:5346669cc45c355a:a054880a23f1aba0:6d95e887b4f45f1a -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 6> DVA[0]=<0:17d7ebe000:c000> salt=267e9c40a8076cfb iv=eadba6102ca5f4a:cb79db1c [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=6a2fbb12f9e9144d:6a5d5ed393ee44e:7eacc875350839e7:e64eed4a3cafb3d6 -- skipping
zdb_blkptr_cb: Got error 52 reading <6387, 1708, 0, 8> DVA[0]=<0:17d7eca000:c000> salt=267e9c40a8076cfb iv=a21f8e85bea37e4d:11a6da63 [L0 ZFS plain file] edonr zstd encrypted LE contiguous unique single size=20000L/c000P birth=60696879L/60696879P fill=1 cksum=2dd09149cc90e9ee:4599ab3f84746c6f:8d95af2b46a98a21:f8746181f34c29ea -- skipping
[…]

Running zdb -d zroot produces normal output but then aborts:

Dataset zroot [ZPL], ID 51, cr_txg 1, 96K, 6 objects
zfs_btree_find(tree, node, &where) == NULL (0x558269ab8790 == 0x0)
ASSERT at ../../module/zfs/btree.c:1296:zfs_btree_add()[1]    4131080 abort      sudo zdb -d zroot

I have a hunch that something with how Docker handles cloning/snapshotting/etc. together with enabled dedup may be the culprit here, but have no strong proof other than the fact that "enabling dedup", "upgrading to 2.0.0"/"upgrading pool" and "massive writes occurring" all happened around the same time, but that could also be a coincidence.

So my questions are now:

How can I help finding the cause of the problem?
Do you think it is possible to recover from the problem without migrating to a new pool?

I have the same issue. I migrated to a new pool yesterday.
Today when i was working and i built a docker image for some work, the issue popped up again.

pgeorgi · 2021-03-14T21:41:02Z

very stupid workaround without checking for all the potential side effects (that is, it might eat all your data - therefore deliberately no patch or commit): In module/zfs/dsl_deadlist.c, wrap avl_add(avl, node); with if (avl_find(avl, node, NULL) == NULL)

That gets around the panic and seems to avoid the sustained txg_sync write traffic that drowns out everything else. It merely means that a given node isn't added twice: If zfs tries to do that, the on-disk data is probably already bad (after all, there's an unexpected duplicate addition of a node), but no need to add massive amounts of unknown write traffic to that.

(after running this patch for a week, including a scrub: The dedup ratio is 11258999068423.99x, which seems.... improbable, but otherwise the pool, a collection of mirrors providing 23TB net space, seems healthy)

mk01 · 2021-03-23T04:46:00Z

@mherkazandjian
did you disable the features in question on the new pool ? (zpool create -o feature@xyz=disabled).

anyhow, I really tried to avoid migration of ~35TB to new pool, and for a moment I had glimpse it would be possible to downgrade back to 0.8 (not the pool (feature set), but RW access with the 0.8 module). zstd and also livelist are features which can flip back to "enabled". so far so good (I was thinking)...

so I just had to replicate all filesystems where zstd was used without the -c option. easy task, unfortunatelly pool already dropped into that stage of killing threads from the defered thread pool. after destroy of the old filesystems all that stuff was just stuck there in the "freeing" stage without any progress (few hundreds GB). at this moment it was not yet trashing the disks with the never ending IO, and I did remmeber some issues (and solutions) recently here telling about unmount -> mount back. and because the freeing stuff was stuck already for more than 24h, I went switched into single user runlevel, unmount export, reboot ....
yeah, istead of win win, that trashing started, in a bit over next 24h it rewrote 6TB of metadata (on special vdevs). so I stopped the bpobj garbaging, that stopped that traffic, did checkpoint and used zdb to walk blocks, checksum meta, intent logs - assertion, another assertion, (blocks / meta walk / leak detect), thousands of "duplicate allocation" errors on intent logs ... (no light at the end of the tunnel - not even dark tunnel around for that matter).

long story short, I hacked the avl_find() assertion (as @pgeorgi did), rebooted with zfs.zfs_livelist_min_percent_shared=100. this finally allowed the freeing to start progressing, in between it started huge removal of livelists and after few minutes, it all came to stop.

z4t   freeing                        0                              -
z4t   feature@livelist               enabled                        local
z4t   feature@zstd_compress          enabled                        local

again checkpoint, zdb - walking block, meta, leak detection - 0 issues, no extra activity, dbgmsg nice and clean, pool back at full workload. (and I didn't even removed the clones - still more than 300 alive). It still wasn't the right feeling somehow, so full of expectations (because that livelists + zstd back to enable) rebooted into 0.8, only to realize, that it will be the log_spacemap feature to stop me for good.

so for now, I still run zfs_livelist_min_percent_shared=100 but beside that, back to full traffic and same pool.

anyhow what for me is most interresting on the story is the speed at which the issues progressed. I was waiting for some minor updates on 2.x, updated to 2.0.3 on 4th of March, I didn't upgrade the pool for ~week yet, so we are at around 12th of March, then it was this fast downfall as per story few lines back...

all this being said - no leaks left behind, scrub no chksum errors, I even took some backups out of the vault and did diff on random samples - also fully identical - at least the cold archive data. harder to say for the hot stuff being changed hour after hour but that is DBs and others having complex structures and own chksum/integrity checks, so I would pop up already by now.

that means, still the same zfs and its data - at peace, healty, solid and sound.
(anyhow2: most probably I will resync the pool out & back and for now will recreate the pool using
-o feature@livelist=disabled -o feature@zstd_compress=disabled -o feature@log_spacemap=disabled).

btw: I have already seen stupid discussions around this issue, comparing to early btrfs etc. here I would just quote Luke 23:34: "Father, forgive them, for they do not know what they are doing."

mk

HorayNarea · 2021-03-26T23:26:35Z

After reading @mk01's comment, I noticed that my pool also had a few GBs in freeing so i tried what they describe and it solved the problem for me.

I basically just applied @pgeorgi's patch, rebooted, opened Terminals with iotop, zpool iostat and zpool get freeing and almost immediately freeing was slowly going down while threads that handle freeing/deadlist stuff where doing writes (I can't remember their names) …after a few minutes it was all done.
I didn't even have to set zfs_livelist_min_percent_shared=100.

Only one weirdness still, freeing does not show 0 but instead:

❯ zpool get freeing zroot
NAME   PROPERTY  VALUE    SOURCE
zroot  freeing   16.0E    -

Then I reverted the patch, rebooted and now my pool behaves normal again!

Thank you both very much! ❤️

For good measures I also ran a zpool scrub and zdb -bcsvL -AAA zroot afterwards, scrub finishes without errors, zdb finds about 10 error 52s that look like they where also in my past zdb output before I "repaired" the pool. 🤷

So it would probably be a good idea to disable the livelist feature if you ever want to build docker containers on your pool :D

stooone · 2021-05-01T12:12:57Z

Sigh... Me too. There will be an official solution/fix for this? I'm afraid a bit about hacking with my data.

rrygl · 2021-05-16T00:34:09Z

I hit this kernel panic without "zfs destroy". I think it's due to one of my cloned zvol for iscsi.
I added zfs_livelist_min_percent_shared=100 and rebooted without any patching. It seemed to work as kernel panic is no longer showing up. Can I assume that?

ahrens · 2021-05-17T17:15:58Z

We would like to get to the bottom of this. Does someone have a pool that's in this state and is able to work with us to reproduce and diagnose it? We will probably start with running zdb -y on the damaged pool.

stooone · 2021-05-17T19:07:21Z

I will try to reproduce it in a VM in the next few days (my affected system already migrated to a new pool).

sblive · 2021-05-17T19:49:48Z

yes, still high permanent IOs for months now

# zdb -y data1
Verifying deleted livelist entries
zfs_btree_find(tree, node, &where) == NULL (0x279a010 == 0x0)
ASSERT at ../../module/zfs/btree.c:1296:zfs_btree_add()Aborted

stooone · 2021-05-18T13:28:09Z

Created a test VM but still no success on corrupting it. Still trying...

ahrens · 2021-05-18T15:19:03Z

@sblive could you get the stack trace for where we're hitting that assertion? You should be able to load the core file from zdb into gdb to get it. I'm guessing it may be from sublivelist_verify_blkptr() but want to confirm. Then we can build a version of zdb that handles that error better and can give us more information about the damage.

sblive · 2021-05-18T15:53:09Z

i'm (also) a software dev, but i've never really used gdb much. can you provide the commands?

sdimitro · 2021-05-18T15:59:16Z

i'm (also) a software dev, but i've never really used gdb much. can you provide the commands?

@sblive - once zdb hits the error and generated a core dump you can invoke gdb on that core dump the following way:

$ gdb <path to zdb binary - e.g. /sbin/zdb or similar> <path to core dump>

Then once you get into the gdb you can run bt to print the backtrace of the panicked thread.

TheDome · 2021-05-19T20:06:35Z

I have also tried corrupting a pool again using zfs-2.0.4-2. No luck till now, which is quite surprising since this issue happened about a day after installation on my production machine on two different storage devices (SSD and rotational). Also still trying

sdimitro · 2021-05-19T21:32:11Z

@TheDome here are some questions out of curiosity:
[1] Do you use ZFS on Root? or just as a kernel module that you load only for specific datasets?
[2] What was your upgrade process like? When you got the new bits did you just run zpool upgrade while I/O was happening in the pool?
[3] Do you also use dedup as some other folks in this thread?

TheDome · 2021-05-26T16:31:11Z

@stooone Could you detail, how you accomplished that? I still have no luck reproducing this error even though I used the VM with the same workload I used corrupting two pools at one day.

ahrens · 2021-05-26T16:45:44Z

@yottabit42 and @stooone we would definitely like to work with you both on debugging this. Would you be interested in joining the OpenZFS slack so that we can communicate in realtime? If so, send me your email address (to [email protected]) or join using this temporary link

stooone · 2021-05-26T17:05:51Z

@stooone Could you detail, how you accomplished that? I still have no luck reproducing this error even though I used the VM with the same workload I used corrupting two pools at one day.

I was just playing with docker and then I mass deleted some containers with: docker rm $(docker ps -a -q) I think that was what triggered something for me (not 100% sure because I was playing with docker without wanting to reproduce the bug after giving it up;))

TheDome · 2021-05-26T17:08:31Z

@stooone Could you detail, how you accomplished that? I still have no luck reproducing this error even though I used the VM with the same workload I used corrupting two pools at one day.

I was just playing with docker and then I mass deleted some containers with: docker rm $(docker ps -a -q) I think that was what triggered something for me (not 100% sure because I was playing with docker without wanting to reproduce the bug after giving it up;))

Sounds great! Then I need to try more. But I can confirm, that this will be the error since that is exactly what I did to destroy my zfs. :)

yottabit42 · 2021-05-26T17:30:54Z

@ahrens, I have joined the Slack. Which channel should I join to discuss further?

I am preparing to zfs send all of my datasets and zvols to a backup provider, which will take ~5-6 days. After which, I am prepared to let you poke around on the zpool, even if it's destructive.

sdimitro · 2021-05-26T17:58:31Z

@stooone I just added you to the OpenZFS slack you should have gotten an email. Feel free to DM me personally there once you join.

Some context on the issue for those who want to try to reproduce it:
From the stack trace in the first comment of this issue we know that the panic happens when we delete a clone but not every time. This implies a race and the main suspect from my point of view currently is the livelist condensing thread that gets spawned in the background occasionally. So with that in mind, if you're trying to reproduce this your best bet is probably to keep creating clones, write some data and then start overwriting it to start pushing more TXGs and delete them.

HorayNarea · 2021-05-27T21:02:22Z

So with that in mind, if you're trying to reproduce this your best bet is probably to keep creating clones, write some data and then start overwriting it to start pushing more TXGs and delete them.

That would explain why I first encountered this right after building Docker containers with a lot of build steps 🤔

sdimitro · 2021-05-28T22:45:52Z

Hi folks! I'd like to provide a critical update for this issue.

@ahrens and I, believe that we have root-caused the issue and I'm currently validating our hypothesis. We believe the problem to be a bug in the logic between the livelist code and dedup. Regardless of whether you have dedup currently enabled, the panic from this bug can still show up later if you had dedup enabled in the past while working on a ZFS clone.

Assuming my validation verifies our hypothesis, the fix for the bug should be simple and the recovery of your pools should be straightforward. There won't be any need to destroy or run any other administrative commands to your datasets, nor there will be any on-disk format changes/updates to your pools. You'll only need to update your software (ZFS module that contains the fix) and everything should work like nothing happened.

I'll keep this thread posted with my verification progress and subsequent PRs that fix the issue. Special shoutouts should go to @yottabit42 and @stooone for engaging with us - your pool data and debugging output were crucial parts for this effort.

sdimitro · 2021-05-29T17:45:00Z

Update: I was able to verify our hypothesis and reproduce the issue on my machine.

Here is a reproducer together with some comments for those that are curious:

#!/bin/sh -x

# Step: Set min_percent_shared Tunable
#   This guarantees that the livelist code won't be disabled
#   during this experiment
echo 10 > /sys/module/zfs/parameters/zfs_livelist_min_percent_shared

# Step: Create Test Pool
# Note: We set cachefile=none so we don't get into a panic loop
#       for VMs configured to reboot when encountering a panic
#       and have ZFS on Root.
zpool create -f -o cachefile=none testpool sdb

# Step: Create Filesystem With Some Data
zfs create testpool/fs
dd if=/dev/urandom of=/testpool/fs/data bs=1024 count=10240
zpool sync testpool

# Step: Snapshot Filesystem And Create A Clone With Dedup
zfs snap testpool/fs@snap
zfs clone -o dedup=on testpool/fs@snap testpool/fs-clone
zpool sync testpool

# Step: Create Livelist & The Blocks To Be Deduped
dd if=/dev/urandom of=/testpool/fs-clone/newdata bs=512 count=1024

# Step: Create Dedup Blocks
# Note: We sync before and after so all deduped blocks are from the
#       same TXG, otherwise they won't look identical to the livelist
#       iterator due to their logical birth TXG being different.
zpool sync testpool
cp /testpool/fs-clone/newdata /testpool/fs-clone/newdata-dup
cp /testpool/fs-clone/newdata /testpool/fs-clone/newdata-dup2
cp /testpool/fs-clone/newdata /testpool/fs-clone/newdata-dup3
cp /testpool/fs-clone/newdata /testpool/fs-clone/newdata-dup4
zpool sync testpool

# Step: Introduce "Double Frees"
#   We want to introduce consecutive FREEs of the same block (e.g.
#   no ALLOC between them) so we can trigger our panic.
# Note: Similarly to the previous step we sync before and after our
#       deletions so they all end up in the same TXG.
zpool sync testpool
rm /testpool/fs-clone/newdata-dup3
rm /testpool/fs-clone/newdata-dup4
zpool sync testpool

# Step: Trigger Panic
# Note: We destroy the clone, which spawns the livelist deletion thread
#       which in turn panics as it encounters our double-FREEs.
zfs destroy testpool/fs-clone

Running the above shell script panic's the system while the last command is executed. The stack trace of this panic matches what others have posted in this issue:

[  438.738943] VERIFY(avl_find(tree, new_node, &where) == NULL) failed
[  438.740001] PANIC at avl.c:640:avl_add()
[  438.740680] Kernel panic - not syncing: VERIFY(avl_find(tree, new_node, &where) == NULL) failed
[  438.742067] CPU: 1 PID: 3844 Comm: z_livelist_dest Kdump: loaded Tainted: P           OE     5.4.0-73-dx2021051305-4c331f84b-generic #82~18.04.1
...<cropped>...
[  438.745930] Call Trace:   # NOTE: I filtered out the bogus stack frames!
[  438.746455]  dump_stack+0x6d/0x8b
[  438.747132]  panic+0x101/0x2e3
[  438.747695]  spl_panic+0x10f/0x110 [spl]
[  438.753148]  avl_add+0x88/0x90 [zavl]
[  438.753787]  dsl_livelist_iterate+0x174/0x200 [zfs]
[  438.754595]  bpobj_iterate_blkptrs+0x1b9/0x4a0 [zfs]
[  438.757158]  bpobj_iterate_impl+0x1ff/0x760 [zfs]
[  438.759639]  bpobj_iterate_nofree+0x16/0x20 [zfs]
[  438.760823]  dsl_process_sub_livelist+0x70/0xb0 [zfs]
[  438.763402]  spa_livelist_delete_cb+0x182/0x350 [zfs]
[  438.764643]  zthr_procedure+0x16a/0x200 [zfs]
[  438.766922]  thread_generic_wrapper+0x7d/0xc0 [spl]
[  438.768095]  kthread+0x120/0x140
[  438.770937]  ret_from_fork+0x1f/0x40

With this verification of our root-cause out of the way I'll be posting a review soon that fixes the issue. A variant of the above reproducer will also be added to our test-suite.

= Problem When the livelist logic was designed it didn't take into account that when dedup is enabled the sublivelists can have consecutive FREE entries for the same block without an ALLOC entry for it in-between them. This caused panics in systems that were deleting/condesing clones where dedup is enabled. = This patch Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensured that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. = Testing After I reproduced the issue with a shell script, I added a variant of that shell script to ZTS. After ensuring that this new test panics the system the same way as the original reproducer I tried it against the updated logic in this patch and verified that the system no longer panics. = Side Fixes * zdb -y no longer panics when encountering double frees Reviewed-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480

= Problem When the livelist logic was designed it didn't take into account that when dedup is enabled the sublivelists can have consecutive FREE entries for the same block without an ALLOC entry for it in-between them. This caused panics in systems that were deleting/condesing clones where dedup is enabled. = This patch Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensured that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. = Testing After I reproduced the issue with a shell script, I added a variant of that shell script to ZTS. After ensuring that this new test panics the system the same way as the original reproducer I tried it against the updated logic in this patch and verified that the system no longer panics. = Side Fixes * zdb -y no longer panics when encountering double frees Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480

Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11480 Closes #12177

Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177

rdmcguire · 2021-09-15T12:16:10Z

Is there a release or branch I can build from that fixes this, or a patch to apply to 2.0.5? I previously had dedup enabled and am running docker, I built and installed the 2.0.5 release but the issue hasn't gone away. I'm very eager to get this system to cool off, txg_sync is pegged, still seeing these: "INFO: task z_livelist_dest:1896 blocked...", and I have data stuck in freeing still.

yottabit42 · 2021-09-15T14:49:44Z

I am still having the same problem with 2.1.0, even though the patch is supposed to be merged there.

Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11480 Closes #12177

Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177

VoxSciurorum · 2022-02-06T20:19:06Z

I am having a probably related problem with 2.1.2 (FreeBSD stable/13 as of January 2022). I was using a tool (poudriere) that does a lot of snapshotting, on a pool with dedup=verify. Now zfs destroy tends to cause a crash in livelist_compare. The problem persists past reboot.

panic: VERIFY3(l->blk_birth == r->blk_birth) failed (9269896 == 9269889)

I have checked that the patch meant to fix this problem is present in my kernel sources.

Should this assertion be removed from dsl_deadlist.c?

        if (l_dva0_offset == r_dva0_offset)
                ASSERT3U(l->blk_birth, ==, r->blk_birth);

opsec · 2023-12-26T10:47:34Z

I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261538 which points to this issue. I have the same crash on a FreeBSD 15 (build from source on the 23th of December, 2023), so I think it's still an issue.

amotin · 2024-01-02T22:41:31Z

@opsec, @VoxSciurorum, #15732 should fix it.

This comment has been minimized.

Sign in to view

sblive mentioned this issue Feb 15, 2021

VERIFY(avl_find(tree, new_node, &where) == NULL) failed - spl_panic at boot, constant, very high IO load + CPU load (z_wr_int) #11603

Closed

pgeorgi mentioned this issue Mar 17, 2021

ZFS hanging, eventually leading to system lockup #11754

Closed

stooone mentioned this issue May 1, 2021

Very slow zfs destroy #11933

Open

TheDome changed the title ~~ZFS Kernel Panic at boot an Arch Linux~~ ZFS Kernel Panic at boot on Arch Linux May 8, 2021

rrygl mentioned this issue May 17, 2021

Fast Clone Deletion #8416

Merged

12 tasks

sdimitro mentioned this issue Jun 1, 2021

Livelist logic should handle dedup blkptrs #12177

Merged

mmaybee closed this as completed in #12177 Jun 7, 2021

bretton mentioned this issue Sep 11, 2021

FreeBSD13 reliably crashes on ZFS destroy activity (panic: VERIFY(avl_find(tree, new_node, &where) == NULL) failed) #12559

Closed

ZFS Kernel Panic at boot on Arch Linux #11480

ZFS Kernel Panic at boot on Arch Linux #11480

Comments

TheDome commented Jan 18, 2021 • edited Loading

System info

Current behavior

Expected behavior

Additional info

Things I've tried already

TheDome commented Jan 18, 2021 • edited Loading

wengole commented Jan 23, 2021

This comment has been minimized.

sdimitro commented Jan 26, 2021

TheDome commented Jan 26, 2021 • edited Loading

justinianpopa commented Jan 27, 2021

justinianpopa commented Jan 28, 2021

sblive commented Feb 15, 2021

HorayNarea commented Mar 10, 2021

pgeorgi commented Mar 11, 2021

mherkazandjian commented Mar 11, 2021

pgeorgi commented Mar 14, 2021 • edited Loading

mk01 commented Mar 23, 2021 • edited Loading

HorayNarea commented Mar 26, 2021

stooone commented May 1, 2021

rrygl commented May 16, 2021

ahrens commented May 17, 2021

stooone commented May 17, 2021

sblive commented May 17, 2021 • edited by ahrens Loading

stooone commented May 18, 2021

ahrens commented May 18, 2021

sblive commented May 18, 2021

sdimitro commented May 18, 2021

TheDome commented May 19, 2021

sdimitro commented May 19, 2021

TheDome commented May 26, 2021

ahrens commented May 26, 2021 • edited Loading

stooone commented May 26, 2021

TheDome commented May 26, 2021

yottabit42 commented May 26, 2021

sdimitro commented May 26, 2021

HorayNarea commented May 27, 2021

sdimitro commented May 28, 2021

sdimitro commented May 29, 2021

rdmcguire commented Sep 15, 2021

yottabit42 commented Sep 15, 2021

VoxSciurorum commented Feb 6, 2022

opsec commented Dec 26, 2023

amotin commented Jan 2, 2024

TheDome commented Jan 18, 2021 •

edited

Loading

TheDome commented Jan 18, 2021 •

edited

Loading

TheDome commented Jan 26, 2021 •

edited

Loading

pgeorgi commented Mar 14, 2021 •

edited

Loading

mk01 commented Mar 23, 2021 •

edited

Loading

sblive commented May 17, 2021 •

edited by ahrens

Loading

ahrens commented May 26, 2021 •

edited

Loading