-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS Kernel Panic at boot on Arch Linux #11480
Comments
Since I didn't want to waste my disk write cycles, I decided to create a backup and manually copy the data from one zfs pool to another. Edit: clarified ZFS versions |
I don't usually like posts like this but "me too". https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tc8e82b0bb2e8bc0f/3-days-of-constant-txgsync-100-blocking-io
There are more examples of this above, but see the zfs-discussion thread for more details of issues I'm seeing including constant random writes by ZFS (happening with no other services running, and in fact happening whilst no datasets are even mounted) |
This comment has been minimized.
This comment has been minimized.
Hi @TheDome! Thanks for taking the time to file the issue. I've got a couple of questions for you: The livelist code used to have 2 bugs: One related to device removal (which only existed while the PR was open - the fix was merged together with the initial upstream commit), and one related to clone promotions #10652. The former I think is irrelevant to you as looking at the config printed by ZDB you don't have any removed/indirect vdevs. The latter also seems unlikely looking at the stacktrace that you've posted. cc: @ahrens |
Hi @sdimitro! Thanks for taking the time to resolve this issue. I upgraded to ZFS v2.0.1. Additionally I'd like to pinpoint that I am using swap on a ZFS vdev but swap is never being used on my system which has enough ram to not use swap. Also I am not experiencing a deadlock in ZFS resulting in a complete kernel crash but several |
I am also observing identical conditions to this issue on a ZFS pool with 3x nvme SSDs (all 3 single vdevs). The pool is used for fast temporary storage of large builds with clones and ocasionally promotions of clones. The version is "zfs-2.0.1-0york0~18.04" on Ubuntu 18.04.2 (via https://launchpad.net/~jonathonf/+archive/ubuntu/zfs) with kernel 5.3.0-59-generic. I have fully imaged the pool drives for archival on another system and can replicate the issue every time in a virtual machine with the same distro/debs + kernel 5.4.0-64-generic. The panic happens on pool import where there is an instant dmesg error attached below. A full At first i thought it is a normal part of operations with clones/dataset deletions but it has not stopped for 20 hours continuously before i force stopped the whole thing when i noticed they were real writes using up my ssd's nand for no reason. The pool disks were imaged via files on another machine and snapshotted in separate datasets. Then i started a VM with a clean install of 18.04.2 + 2.0.1 zfs backport and the image files mounted as virtual drives. The random IO workload does seem to change things on disks as i am seeing steady increasing space usage compared to the original datasets snapshot. I will try to test a master branch build and some 2.0.0-rc's but i can probably only go back so far because of the new pool features in 2.0.0. |
Tested with master branch builds and back a few rc's, same results. It seems the problem is embedded in the pool itself and not the code but i have no idea how to replicate the corruption from a good pool. In the mean time, the pool's replicated datasets (with "zfs send" to a clean pool) show no obvious signs of corruption. |
same here: #11603 |
I have the same behaviour since zfs 2.0.0 or 2.0.1 (I can't remember exactly, sorry) with Archlinux latest kernel on an upgraded pool with a docker dataset, dedup was active on some datasets and currently is still active on one dataset. On my pool this generates constant writes of ~200-300MB/s (thanks to this being an SSD-only pool 😬). While a scrub is running the writes disappear; scrub finds no errors; after scrubbing the writes immediately appear again; sometimes there are a few seconds without the massive writes, I suspect that's when a "real"/larger synchronous write happens or the dirty bits are flushed to disk? Things I've tried that did not help:
Running
I have a hunch that something with how Docker handles cloning/snapshotting/etc. together with enabled dedup may be the culprit here, but have no strong proof other than the fact that "enabling dedup", "upgrading to 2.0.0"/"upgrading pool" and "massive writes occurring" all happened around the same time, but that could also be a coincidence. So my questions are now:
|
Given
does it make sense to skip enabling the livelist feature on zpool upgrades for now so that those who haven't updated yet won't run into the issue? Also, as livelists seem to be an optimization: could a pool property disable their use when they already exist (to work around broken data structures like here)? |
I have the same issue. I migrated to a new pool yesterday. |
very stupid workaround without checking for all the potential side effects (that is, it might eat all your data - therefore deliberately no patch or commit): In module/zfs/dsl_deadlist.c, wrap That gets around the panic and seems to avoid the sustained txg_sync write traffic that drowns out everything else. It merely means that a given node isn't added twice: If zfs tries to do that, the on-disk data is probably already bad (after all, there's an unexpected duplicate addition of a node), but no need to add massive amounts of unknown write traffic to that. (after running this patch for a week, including a scrub: The dedup ratio is 11258999068423.99x, which seems.... improbable, but otherwise the pool, a collection of mirrors providing 23TB net space, seems healthy) |
@mherkazandjian anyhow, I really tried to avoid migration of ~35TB to new pool, and for a moment I had glimpse it would be possible to downgrade back to 0.8 (not the pool (feature set), but RW access with the 0.8 module). zstd and also livelist are features which can flip back to "enabled". so far so good (I was thinking)... so I just had to replicate all filesystems where zstd was used without the -c option. easy task, unfortunatelly pool already dropped into that stage of killing threads from the defered thread pool. after destroy of the old filesystems all that stuff was just stuck there in the "freeing" stage without any progress (few hundreds GB). at this moment it was not yet trashing the disks with the never ending IO, and I did remmeber some issues (and solutions) recently here telling about unmount -> mount back. and because the freeing stuff was stuck already for more than 24h, I went switched into single user runlevel, unmount export, reboot .... long story short, I hacked the avl_find() assertion (as @pgeorgi did), rebooted with
again checkpoint, zdb - walking block, meta, leak detection - 0 issues, no extra activity, dbgmsg nice and clean, pool back at full workload. (and I didn't even removed the clones - still more than 300 alive). It still wasn't the right feeling somehow, so full of expectations (because that livelists + zstd back to enable) rebooted into 0.8, only to realize, that it will be the log_spacemap feature to stop me for good. so for now, I still run anyhow what for me is most interresting on the story is the speed at which the issues progressed. I was waiting for some minor updates on 2.x, updated to 2.0.3 on 4th of March, I didn't upgrade the pool for ~week yet, so we are at around 12th of March, then it was this fast downfall as per story few lines back... all this being said - no leaks left behind, scrub no chksum errors, I even took some backups out of the vault and did diff on random samples - also fully identical - at least the cold archive data. harder to say for the hot stuff being changed hour after hour but that is DBs and others having complex structures and own chksum/integrity checks, so I would pop up already by now. that means, still the same zfs and its data - at peace, healty, solid and sound. btw: I have already seen stupid discussions around this issue, comparing to early btrfs etc. here I would just quote Luke 23:34: "Father, forgive them, for they do not know what they are doing." mk |
After reading @mk01's comment, I noticed that my pool also had a few GBs in I basically just applied @pgeorgi's patch, rebooted, opened Terminals with Only one weirdness still,
Then I reverted the patch, rebooted and now my pool behaves normal again! Thank you both very much! ❤️ For good measures I also ran a So it would probably be a good idea to disable the livelist feature if you ever want to build docker containers on your pool :D |
Sigh... Me too. There will be an official solution/fix for this? I'm afraid a bit about hacking with my data. |
I hit this kernel panic without "zfs destroy". I think it's due to one of my cloned zvol for iscsi. |
We would like to get to the bottom of this. Does someone have a pool that's in this state and is able to work with us to reproduce and diagnose it? We will probably start with running zdb -y on the damaged pool. |
I will try to reproduce it in a VM in the next few days (my affected system already migrated to a new pool). |
yes, still high permanent IOs for months now
|
Created a test VM but still no success on corrupting it. Still trying... |
@sblive could you get the stack trace for where we're hitting that assertion? You should be able to load the core file from zdb into |
i'm (also) a software dev, but i've never really used gdb much. can you provide the commands? |
@sblive - once
Then once you get into the gdb you can run |
I have also tried corrupting a pool again using |
@TheDome here are some questions out of curiosity: |
@stooone Could you detail, how you accomplished that? I still have no luck reproducing this error even though I used the VM with the same workload I used corrupting two pools at one day. |
@yottabit42 and @stooone we would definitely like to work with you both on debugging this. Would you be interested in joining the OpenZFS slack so that we can communicate in realtime? If so, send me your email address (to [email protected]) or join using this temporary link |
I was just playing with docker and then I mass deleted some containers with: docker rm $(docker ps -a -q) I think that was what triggered something for me (not 100% sure because I was playing with docker without wanting to reproduce the bug after giving it up;)) |
Sounds great! Then I need to try more. But I can confirm, that this will be the error since that is exactly what I did to destroy my zfs. :) |
@ahrens, I have joined the Slack. Which channel should I join to discuss further? I am preparing to zfs send all of my datasets and zvols to a backup provider, which will take ~5-6 days. After which, I am prepared to let you poke around on the zpool, even if it's destructive. |
@stooone I just added you to the OpenZFS slack you should have gotten an email. Feel free to DM me personally there once you join. Some context on the issue for those who want to try to reproduce it: |
That would explain why I first encountered this right after building Docker containers with a lot of build steps 🤔 |
Hi folks! I'd like to provide a critical update for this issue. @ahrens and I, believe that we have root-caused the issue and I'm currently validating our hypothesis. We believe the problem to be a bug in the logic between the livelist code and dedup. Regardless of whether you have dedup currently enabled, the panic from this bug can still show up later if you had dedup enabled in the past while working on a ZFS clone. Assuming my validation verifies our hypothesis, the fix for the bug should be simple and the recovery of your pools should be straightforward. There won't be any need to destroy or run any other administrative commands to your datasets, nor there will be any on-disk format changes/updates to your pools. You'll only need to update your software (ZFS module that contains the fix) and everything should work like nothing happened. I'll keep this thread posted with my verification progress and subsequent PRs that fix the issue. Special shoutouts should go to @yottabit42 and @stooone for engaging with us - your pool data and debugging output were crucial parts for this effort. |
Update: I was able to verify our hypothesis and reproduce the issue on my machine. Here is a reproducer together with some comments for those that are curious:
Running the above shell script panic's the system while the last command is executed. The stack trace of this panic matches what others have posted in this issue:
With this verification of our root-cause out of the way I'll be posting a review soon that fixes the issue. A variant of the above reproducer will also be added to our test-suite. |
= Problem When the livelist logic was designed it didn't take into account that when dedup is enabled the sublivelists can have consecutive FREE entries for the same block without an ALLOC entry for it in-between them. This caused panics in systems that were deleting/condesing clones where dedup is enabled. = This patch Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensured that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. = Testing After I reproduced the issue with a shell script, I added a variant of that shell script to ZTS. After ensuring that this new test panics the system the same way as the original reproducer I tried it against the updated logic in this patch and verified that the system no longer panics. = Side Fixes * zdb -y no longer panics when encountering double frees Reviewed-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480
= Problem When the livelist logic was designed it didn't take into account that when dedup is enabled the sublivelists can have consecutive FREE entries for the same block without an ALLOC entry for it in-between them. This caused panics in systems that were deleting/condesing clones where dedup is enabled. = This patch Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensured that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. = Testing After I reproduced the issue with a shell script, I added a variant of that shell script to ZTS. After ensuring that this new test panics the system the same way as the original reproducer I tried it against the updated logic in this patch and verified that the system no longer panics. = Side Fixes * zdb -y no longer panics when encountering double frees Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480
= Problem When the livelist logic was designed it didn't take into account that when dedup is enabled the sublivelists can have consecutive FREE entries for the same block without an ALLOC entry for it in-between them. This caused panics in systems that were deleting/condesing clones where dedup is enabled. = This patch Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensured that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. = Testing After I reproduced the issue with a shell script, I added a variant of that shell script to ZTS. After ensuring that this new test panics the system the same way as the original reproducer I tried it against the updated logic in this patch and verified that the system no longer panics. = Side Fixes * zdb -y no longer panics when encountering double frees Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480
= Problem When the livelist logic was designed it didn't take into account that when dedup is enabled the sublivelists can have consecutive FREE entries for the same block without an ALLOC entry for it in-between them. This caused panics in systems that were deleting/condesing clones where dedup is enabled. = This patch Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensured that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. = Testing After I reproduced the issue with a shell script, I added a variant of that shell script to ZTS. After ensuring that this new test panics the system the same way as the original reproducer I tried it against the updated logic in this patch and verified that the system no longer panics. = Side Fixes * zdb -y no longer panics when encountering double frees Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11480 Closes #12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177
Is there a release or branch I can build from that fixes this, or a patch to apply to 2.0.5? I previously had dedup enabled and am running docker, I built and installed the 2.0.5 release but the issue hasn't gone away. I'm very eager to get this system to cool off, txg_sync is pegged, still seeing these: "INFO: task z_livelist_dest:1896 blocked...", and I have data stuck in freeing still. |
I am still having the same problem with 2.1.0, even though the patch is supposed to be merged there. |
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11480 Closes #12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #11480 Closes #12177
Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#11480 Closes openzfs#12177
I am having a probably related problem with 2.1.2 (FreeBSD stable/13 as of January 2022). I was using a tool (poudriere) that does a lot of snapshotting, on a pool with dedup=verify. Now zfs destroy tends to cause a crash in livelist_compare. The problem persists past reboot.
I have checked that the patch meant to fix this problem is present in my kernel sources. Should this assertion be removed from dsl_deadlist.c?
|
I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261538 which points to this issue. I have the same crash on a FreeBSD 15 (build from source on the 23th of December, 2023), so I think it's still an issue. |
@opsec, @VoxSciurorum, #15732 should fix it. |
Hi there,
System info
I am using the latest stable ZFS-dkms release on Artix linux with s6-init.
zfs-2.0.1-1
Current behavior
When I currently boot the system, the Vconsole shows a log message regarding the ZFS AVL tree.
Additionally the kernel kills a ZFS thread every now and then resulting in the following stack trace:
Since I am using ZFS on a mirrored rotational vdev, I can hear, that the zfs driver issues a huge amount of random write requests to the disk. As far as I know, the following iotop shows over 3K write requests queued. But I am not doing anything and I have not opened any program, which writes to the system, so there should be no requests to write to the disk.
Additionally iotop also shows a huge disk occupation by the
txg_sync kthread
, again with no other processes accessing disk at that time.Expected behavior
ZFS should not perform the random write requests.
Additional info
I am currently using docker (
20.10.1
) with the zfs driver. So docker performs zfs clones and writes. Those datasets seem to create zfs errors, which can't be recovered through removing them. After runningdocker system prune -af
the zdb shows:Things I've tried already
Since this looks like a problem in the ZFS arrangement for me, I searched through the issues and some old solaris documentations.
I have already ran a
zpool scrub
to verify data integrity. This results in a successful scrub with no errors. Additionally I ran thezdb -bcsvL -AAA
command like i read in this article. Thus I am not completely sure, if I have set thezfs:zfs_recover=1
flag correctly (I set it using/etc/system
like said in this post).The command also exits with a panic report. Thus I had to append the error correction flag
-AAA
. The output looks like this:I have no other ideas how to fix the panics in the driver. The last time this happened to me on another vdev, i had to manually copy all the files from the corrupt zpool to a newly created one and I would like to avoid doing this.
Is there any way to force zfs to check its AVL tree for errors and correct them by itself?
I would be glad for every hint how to fix this corrupt vdev!
The text was updated successfully, but these errors were encountered: