-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: VERIFY3(rc->rc_count == number) failed (FreeBSD 15-CURRENT, ZFS 2.1.14) #15802
Comments
On a different FreeBSD system running 2.2.3:
This may be the same bug. ZFS did not crash because assertions were not enabled. The OS noticed memory had not been freed. (The message came from the kernel, not kldunload. I was typing the command on the system console.) |
Old ticket, but came across something similar at openzfsonwindows#364 (comment) I wonder if |
I set
This is happening in a block of code in Looking at the remaining references in ARC_anon.arcs_size (it's always anonymous buffers that are leaked), I can see it's two 128KB buffers. (Unsurprisingly, I could see 256KB of "Anon" ARC memory lingering on the idle system in
I'm out of time to look at this for the moment, but perhaps this is enough for someone smarter than me to spot the problem? |
@markjdb thanks for this. Nothing jumping out at me, but when I did a code read a couple of days ago I didn't go through that codepath, so it gives me something else to go on. I'll start looking later today. Is there a semi-reliable repro here? From the descriptions it looks like the inducing a bunch of ARC churn and then unloading the module is probably enough. |
I messaged you out-of-band with some info. If anyone else has steps to reproduce this reliably, please feel free to add them. |
Right, I think I understand the dbuf refcount bug, and I'll send a patch through for that soon (tonight I hope). I don't think it's related to the original issue; its a straight refcounting bug (actually, we're not even miscounting, just holding the refcount API a little bit wrong). It might be related at a distance, but I think not, because the memory leak reported is real, and is from the ARC, not the dbuf cache. |
#16191 should fix the dbuf refcounting thing. I doubt it'll do much for the original issue, but should at least let |
I tried modifying the refcount code to save a stack trace in the tracker structure at the point where the reference is acquired. If I dump the stacks, I see:
Both 128KB buffers have the same stack. Reading the relevant code, I see that commit 9b1677f introduced some new cases, and it was committed around the time that this bug was first noticed. Maybe @amotin has some idea? |
I don't have any quick ideas, but at some point Brian noticed FreeBSD CI failures with supposedly some kernel panic after tests completion, possibly on module unload. With some hope it may be this assertion. If we could reproduce it on proper system with a console and find what it is and what test triggers it -- it could help. |
I have access to such a system. To reproduce it, I run |
I hoped for something more specific/synthetic, like a CI test. I'll think about it. |
@amotin Suppose we're in |
Here's a reproducer program. Give it a path to a non-existent file on a ZFS filesystem. If run in a loop, I see the number of anon ARC pages grow quickly. Something like
|
@markjdb Thank you for the reproduction, it works. Addition of |
In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Fixes: openzfs#15665 Closes: openzfs#15802 Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Fixes: openzfs#15665 Closes: openzfs#15802 Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216
@behlendorf Worth pointing out the PR did fix the leak problem, but does not fix the title of the ticket. |
Thanks I overlooked that. Reopening. |
@lundman Are you talking about failmode=continue, or you have some other reproduction? If that is it, then I would call it a different issue. |
For me, I can trigger the assert But it can be triggered with default ZFS timeout and zstd19. My guess is that IO just takes "too long" - for whatever reason, zstd19/failing-device etc, and the code that is supposed to restart IO does so incorrectly. Just a guess. This ticket does say FreeBSD, but it is the same alert - I do not know if the FreeBSD bug had The memory leak issue was guessed to be related to this ticket, but probably should have been a separate ticket. |
This assertion IS about memory leaks. The fact that you see it in othet scenarios just means there are other leaks, probably unrelated to the one I fixed. |
Probably not, given the number of people upstream who have reported the problem.
It wasn't guessed. I hit the problem quite a few times and finally debugged it. Based on the timing of the upstream bug report and the commit which introduced the leak (within a couple of weeks of each other), the fact that the folks reporting the problem track the FreeBSD development branch (where the bug would have first appeared), and the fact that this apparently wasn't observed on Linux (which avoids the code path in question because it prefaults data to be written), I'm quite sure that the original report is about the problem that is now fixed. |
OK that sounds good. Let's go with that and close it. |
In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Fixes: openzfs#15665 Closes: openzfs#15802 Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665 Closes openzfs#15802 Closes openzfs#16216
An issue originally reported for v2.1.14 got fixed in some version after 2.2.3? :) |
I don't believe 2.1.14 is actually affected, looking at the code. In particular, |
On Monday July 22 2024 06:45:06 Mark Johnston wrote:
I don't believe 2.1.14 is actually affected, looking at the code. In particular, `dmu_buf_fill_done()` is missing the third parameter, `failed`. The bug came in with commit 9b1677f.
I did notice that the function in question is a lot simpler...
|
In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665 Closes openzfs#15802 Closes openzfs#16216 (cherry picked from commit 02c5aa9)
System information
Describe the problem you're observing
Sometimes the system panics during shutdown with an assertion failure.
Four examples from different systems:
Describe how to reproduce the problem
Unload ZFS and get unlucky. It doesn't always happen, but it happens to more
people than just me. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276341
Include any warning/errors/backtraces from the system logs
The call that failed is
I printed out the object being destroyed and nothing seemed wrong with it
except the count that should have been zero was not zero.
Note that
sx_lock=1
is the correct value for a lock of this type when it is destroyed.The links in the list point to the list object as I expect for an empty list.
The text was updated successfully, but these errors were encountered: