-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation faults / memory corruption using zfs git with init_on_alloc=0 init_on_free=0 #16689
Comments
can you try to run a debug build? (configure with would be great if you could try - if you see "Kernel panic - not syncing: buffer modified while frozen!" - then it's probably the same problem FWIW it seems to be related to how ZFS works on 6.10 and newer kernels, older ones don't hit it, this bug is also already present in OpenZFS 2.2 stable release. |
when you say 2.2.6 is fine, are you sure that is also with 6.11 kernel series? |
If it ends up being the same issue, it's also worth noting that we've tried disabling block cloning, direct IO and tried to run only with |
@snajpa hi, thanks for your reply. I didn't have time yet to bisect this on another machine and I'm a little scared to test on my work-notebook at the moment before doing backups...
block cloning was not enabled and is unrelated I think. I think it happened before directio hit master. 2.3.0-rc2 also runs fine here. So I'm still not 100% sure if this a zfs issue or some general issue. Lot's of words to say: No hard data yet, I need to reproduce it and bisect the commits. I have another machine for that but it can take a few days :/ |
I think it's somewhere in the impedance mismatch between new folios APIs and current ZFS code, I need to dig into it way deeper. It's the VFS which is now allocating pages, it seems to also be freeing them on other occasions than just migration, dunno. When I sugar the code with printks it won't reproduce :D so I'm stuck with going through crashdumps... originally I thought this has to be a bug with DMU, but now I think it'll be about a buf loaned to arc which gets freed by the kernel, or something on that note... tricky |
okay some progress in pinning it down - using arch and zfs git ( it does happen on my dell notebook but only with So using |
how big is the memory capacity + usage difference between the two machines? it only happens on memory reclaim is why I'm asking |
notebook has 32gb memory, the elitedesk 48gb memory, both run some incus containers, docker but no hard 100% usage. I'm closing this as false-positive - I don't have the capacity to debug this in detail at the moment and it looks like it was specific to that machine and never happened elsewhere. |
here we go (with debug enabled I think)
|
it happens with:
I can reliable trigger it by doing a
|
it's unrelated to zfs_abd_scatter_enabled and happens with zfs-2.3.99-90_g38c0324c0f |
here is the one with abd_scatter=1
hope this is somehow useful. |
in that case that's probably already fixed in master, can you verify? |
I'm running current master (git as of today) zfs-2.3.99-90_g38c0324c0f all the backtraces above are also using that zfs version. |
can you please share the |
while we're at it, can you please retry also with most current 6.11 kernel, 6.11.10? I've been hitting some weird stuff with older than 6.11.6... |
docker-compose: I've managed to trigger it using this docker-compose - software is from the company so I can't share it but looks like mysql is causing this - without environment variables where it bails out it's fine with supplied enironment it starts up and then the segfaults happen - for completness I've kept the other containers:
running kernel is 6.11.9-zen1-1-zen - I can try using testing with a newer kernel later it's hitting again some more data for debugging:
|
awesome, thank you! one more question, how much memory does the machine where you're running the compose have? |
machine has 32gb of memory. It's a kde plasma desktop but it happens reliable here on starting mysql regardless of usage. i've seen anything from segfault anything (basically desktop dying after this) - to some things still work but fail quickly. Logs are also full of process going crap and spewing backtraces. |
on the other machine (elitedesk) mysql also fails but no segfaults:
disk has 1.4tb free space and 48gb memory. sorry for the initial confusion. it must be somehow related to mysql doing something that zfs doesn't like. |
OK, reproduced, thank you! will ping you when I have a patch to test |
got sidetracked by some interesting networking issues, will be back at this in a few days |
no worries please. thanks for looking into it. I'm wondering if it's related to direct io and this might be an issue that could hit 2.3.0 release? |
AFAIK you're right, my bet is also on DIO, but it might also be the GPL-only change of zero page or a change related to |
@mtippmann could you please try #16812? |
looks good using zfs-kmod-2.3.99-94_g0ffa6f3464 (with patch) and 6.12.1-arch1-1 mysql starts up without complaints. have to test lts und -zen kernel but so far the problem did not appear. |
6.6.63-1-lts and 6.12.1.zen1-1 is also fine with init_on_alloc=0 thank you so much! |
The intent here is to replace the zero page pointer in the array of pointers to pages in the struct. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pavel Snajdr <[email protected]> Closes openzfs#16812 Closes openzfs#16689 Closes openzfs#16642
System information
Describe the problem you're observing
I'm seeing segmentation faults when using zfs git (zfs 2.2.6 is fine) with
init_on_alloc=0 init_on_free=0
incmdline
- nothing indmesg
- I can trigger that using adocker compose up
with a few containers rails, mysql - after that system crashes and most commands fail. Shortly after it first appears whole system is crashing includingplasmashell
and so on.It's a system I need to work so I was going back to 2.2.6 where everything is fine and stable. Not using
init_on_alloc=0 init_on_free=0
might help but i'm not 100% sure here. I'm not using zvols.System passes a bios memory test just fine. Dell Latitude E5470 / i7-6820HQ
Describe how to reproduce the problem
Good question. Maybe it reproduces using the kmod options listed here and the cmdline - for me it's triggered by a
docker compose up
so it could be related to overlayfs. At least that's when I was noticing it.I assume it's a problem related to my kmod config settings or the cmdline settings overwise it would have already been found. Noticed a similiar behavoir a few weeks ago and tried pinning it down but failed. So I'd thought i'd put that here.
Include any warning/errors/backtraces from the system logs
there is nothing in dmesg. Below some random
journalctl
logfile entries about crashes (it all looks pretty random)The text was updated successfully, but these errors were encountered: