-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider dnode_t
allocations in dbuf cache size accounting
#15511
Conversation
db7eb68
to
fd9022b
Compare
Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Been running with this patch for a few days. Nothing exploded. Guess I don't have a particularly bad local repro for what it fixes, but the changes make sense IMHO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lack of user buffer accounting always concerned me for exactly the reasons you mentioned. This change will cause us to keep fewer dnodes cached, but we can always bump up the maximum cache size if needed to accord for that. Fully reflecting the actual memory size is definitely a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks. You possibly could account also dnode_children_t/dnode_handle_t's allocations. They are not huge, but IIRC still ~64 bytes per each potential dnode on FreeBSD.
I was thinking about this accounting since I was working on MicroZAPs. Originally they could take in user data up to as much as in the dbuf itself, but should take less now. But in that case process of user data reconstruction is not cheap and can hurt performance. So the dbuf caching is really helpful and practically required, otherwise it would make no sense to reconstruct the B-tree in user data for each access. I was actually thinking if we could somehow delay eviction of the dbufs with user data.
Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511
Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511 (cherry picked from commit 92dc4ad)
Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511 (cherry picked from commit 92dc4ad)
Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511 (cherry picked from commit 92dc4ad)
Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511 (cherry picked from commit 92dc4ad)
Motivation and Context
A customer application used a bunch of memory, and got killed by the OOM-killer. This led to the obvious next question: who has all the memory? There’s multiple parts to that answer, this PR is about one of them: the
dnode_t
kmem cache.A graph of the total allocation of the
dnode_t
kmem_cache showed that under this workload,dnode_t
s were basically never being reclaimed. This was in contrast to thezfs_znode_cache
, which does get reclaimed from time to time.The OOM happened at ~03:00, but there had been other ”softer” memory pressure events earlier; these are the ones where we see the znode cache size dropping. The lifetimes of
dnode_t
andznode_t
aren’t entirely related, but the contrast in the graph was interesting. For the cache to not be reclaimed, those must be live allocations, so where are they?Coming from cold, when
dnode_hold_*
is called,dbuf_hold_*
is called for the data block from the metadnode (object 0) that covers that dnode. The dbuf is loaded, adnode_t
is allocated, and added to thednode_children_t
which is attached to the dbuf as “user data”.Additional
dnode_hold_*
calls for dnodes in that block are serviced using the existing dbuf. If there’s already adnode_t
allocated for the dnode, its refcount is bumped and its returned. If not, a new one is allocated and attached.In this way, the dbuf would accumulate some number of attached
dnode_t
allocations. This is fine in theory, as when the last hold on the dbuf is released (that is, when the last dnode is released), the dbuf is freed, and the attached dnodes with it.This changed in d3c2ae1, when ARC compression was introduced. To remove the need to repeatedly decompress hot blocks, the dbuf cache was added to keep recently used dbufs around. Instead of being freed when the last hold is released, they are then sent to the cache.
This is where the problem comes from. The cache keeps a size count. When a dbuf is added to or removed from the cache, the size of the data buffer within is added and removed to the total cache size. The evict thread uses this total size to decide when to evict something.
The problem is that for metadnode dbufs, there’s a whole bunch of other memory allocated that isn’t accounted for. For a regular 16K metadnode L0, with all dnode slots filled, that’s around an extra ~39K (32 x 1224 (
sizeof (dnode_t)
)) - almost 3x.The dnode cache calibrates its ideal size based on the ARC’s ideal size, which is in part derived from the amount of memory free in the system. It will stop evicting if the total cache size is near the ideal size.
And this is the problem. The dbuf cache can believe that its well within its limits, when actually its wildly over. So unused
dnode_t
allocations are just sitting there, not being evicted, and not reclaimable when the system is under memory pressure.This PR addresses the first part of this, by making sure the contents of the dbuf are accounted for.
Description
This PR adds a
dbu_size
field todmu_buf_user_t
, which is added to the dbuf size when adjusting the overall cache size. This makes the cache size much closer to the amount of memory that would be freed if the cached entry was evicted, and so makes the computed cache size far more realistic.This is optional; dbuf users do not need to set it, and it will just be zero.
There’s also a new
usize
field in the dbuf stats, making it easier to see what’s happening.For metadnode dbufs, the size then gets updated as
dnode_children_t
adds or removesdnode_t
s.Note that this PR is only addressing the apparent size of object 0 dnodes on the cache, allowing the evict thread to do a better job. I’ve more to do to assist with OOM case, when we need to evict a lot in a hurry. But regardless, the start of freeing up memory is knowing where it is, so better accounting is necessary.
How Has This Been Tested?
A full test suite run mostly passed locally. There were a handful of failures but I don’t really trust my local test environment, so we’ll see what the CI says.
Here’s a test that shows the effect:
Running this on master a160c15, we see after the files created, there are a bunch of
dnode_t
s allocated, and a bunch of dbufs with holds, not on cache. This is because there are a bunch of znodes (read: inodes) in the inode and dentry caches, each with a dnode hold.Then we instruct the kernel to drop slab objects, which runs the inode reclaim (“superblock shrinker”) and kmem cache reclaims. The kernel evicts the znodes, returning them to the cache, and then reclaims the kmem caches, so the alloc count drops to right off. Note that the dnode allocations did not change; the dbufs have no holds, so have been put on the dbuf cache (
dbc=1
). Note thatcache_count
has gone up by 7, andcache_count_bytes
by 7x16K: only the dbuf size is counted.Now we force the dbuf cache max to 48K, and give it a few seconds to evict things, before reclaiming the cache. Note that the two remaining dbufs on the cache have an apparent size of 32K, under the cache limit, however we know (from the first output above) that
54/0/0/2
and54/0/0/3
had 32 holds each at one point, so they have an additional 2x39K=78K attached at this point. There is no way for that to be reclaimed until and unless those dbufs is evicted.Running again with this change in place. This is the same output as above, except for the new
usize
column on the dbuf stats.After the inode drop and reclaim, we’re in the same position, but
cache_size_bytes
is higher than it was before. The difference between the previous run and this one is1212792-969216=243576
, which is199*1224
, the total holds that have been released from the dbufs now in the cache.And then lowering the cache limit again. This time, an additional dbuf is evicted.
Further discussion
I don’t have a clear sense of how this changes the overall cache efficiency. It does of course make metadnode dbufs much heavier, and so means we’ll be doing more evicting than before, but that’s the point! We still evict least-recently-used first, and evictions don’t go very far anyway - they’re still in the ARC - so my sense is that its probably not going to be a big deal, at least not for the normal steady background evict churn.
I haven’t plumbed through other dbuf userdata clients in the same way, simply because they weren’t obviously part of the problem. I could, but I’d want to check first that there isn’t anything that would change the size sufficiently that the weight actually is a problem. A big part of the problem with the dnodes is that their memory lifetimes are very entangled with znode (inode, dentry) lifetimes, and with the specialness of object 0. That seems like it might not be as much of an issue for other userdata types.
There’s a case to be made that the dbuf cache should retain its original purpose, and only be used to avoid the cost of repeatedly decompressing a compressed ARC buffer. In that case, an alternate approach might be to evict the userdata before putting the dbuf on the cache. Reinflating the dnodes should have minimal overhead; its unlikely that all or even most of the dnodes are needed, and allocations are coming from the kmem cache so its mostly some copying and housekeeping. On the other hand, we could say the whole point of the dbuf cache is to reduce the work of getting the needed data back, and so it should be carrying more, not less. In that case, perhaps we also want to include the size of the intermediate objects (
dnode_children_t
,dmu_buf_user_t
, evendmu_buf_impl_t
proper).There’s also arguments for removing the dbuf cache entirely and keeping the decompressed version alive for longer in the ARC. I have not pursued this, but you might say that that’s a more worthwhile direction to explore.
Types of changes
Checklist:
Signed-off-by