Consider `dnode_t` allocations in dbuf cache size accounting #15511

robn · 2023-11-10T08:53:00Z

Motivation and Context

A customer application used a bunch of memory, and got killed by the OOM-killer. This led to the obvious next question: who has all the memory? There’s multiple parts to that answer, this PR is about one of them: the dnode_t kmem cache.

A graph of the total allocation of the dnode_t kmem_cache showed that under this workload, dnode_ts were basically never being reclaimed. This was in contrast to the zfs_znode_cache, which does get reclaimed from time to time.

The OOM happened at ~03:00, but there had been other ”softer” memory pressure events earlier; these are the ones where we see the znode cache size dropping. The lifetimes of dnode_t and znode_t aren’t entirely related, but the contrast in the graph was interesting. For the cache to not be reclaimed, those must be live allocations, so where are they?

Coming from cold, when dnode_hold_* is called, dbuf_hold_* is called for the data block from the metadnode (object 0) that covers that dnode. The dbuf is loaded, a dnode_t is allocated, and added to the dnode_children_t which is attached to the dbuf as “user data”.

Additional dnode_hold_* calls for dnodes in that block are serviced using the existing dbuf. If there’s already a dnode_t allocated for the dnode, its refcount is bumped and its returned. If not, a new one is allocated and attached.

In this way, the dbuf would accumulate some number of attached dnode_t allocations. This is fine in theory, as when the last hold on the dbuf is released (that is, when the last dnode is released), the dbuf is freed, and the attached dnodes with it.

This changed in d3c2ae1, when ARC compression was introduced. To remove the need to repeatedly decompress hot blocks, the dbuf cache was added to keep recently used dbufs around. Instead of being freed when the last hold is released, they are then sent to the cache.

This is where the problem comes from. The cache keeps a size count. When a dbuf is added to or removed from the cache, the size of the data buffer within is added and removed to the total cache size. The evict thread uses this total size to decide when to evict something.

The problem is that for metadnode dbufs, there’s a whole bunch of other memory allocated that isn’t accounted for. For a regular 16K metadnode L0, with all dnode slots filled, that’s around an extra ~39K (32 x 1224 (sizeof (dnode_t))) - almost 3x.

The dnode cache calibrates its ideal size based on the ARC’s ideal size, which is in part derived from the amount of memory free in the system. It will stop evicting if the total cache size is near the ideal size.

And this is the problem. The dbuf cache can believe that its well within its limits, when actually its wildly over. So unused dnode_t allocations are just sitting there, not being evicted, and not reclaimable when the system is under memory pressure.

This PR addresses the first part of this, by making sure the contents of the dbuf are accounted for.

Description

This PR adds a dbu_size field to dmu_buf_user_t, which is added to the dbuf size when adjusting the overall cache size. This makes the cache size much closer to the amount of memory that would be freed if the cached entry was evicted, and so makes the computed cache size far more realistic.

This is optional; dbuf users do not need to set it, and it will just be zero.

There’s also a new usize field in the dbuf stats, making it easier to see what’s happening.

For metadnode dbufs, the size then gets updated as dnode_children_t adds or removes dnode_ts.

Note that this PR is only addressing the apparent size of object 0 dnodes on the cache, allowing the evict thread to do a better job. I’ve more to do to assist with OOM case, when we need to evict a lot in a hurry. But regardless, the start of freeing up memory is knowing where it is, so better accounting is necessary.

How Has This Been Tested?

A full test suite run mostly passed locally. There were a handful of failures but I don’t really trust my local test environment, so we’ll see what the CI says.

Here’s a test that shows the effect:

function dumpstats {
    echo "### $1"
    head -2 /proc/spl/kmem/slab
    grep -E '^(dnode_t|zfs_znode_cache)' /proc/spl/kmem/slab
    grep ^cache /proc/spl/kstat/zfs/dbufstats | head -7
    head -n3 /proc/spl/kstat/zfs/dbufs | tail -2 | cut -f1 -d'|'
    grep -E 'tank\s+\S+\s+0\s+0\s' /proc/spl/kstat/zfs/dbufs | cut -f1 -d'|'
    echo ; echo
}

zpool create tank loop0

# create some files, which will allocate some dnodes
# they're empty to avoid any data dbufs being created
cd /tank
seq 256 | xargs touch
cd /
zpool sync
dumpstats "created files"

# reclaim inodes, dentries and kmem caches
echo 2 > /proc/sys/vm/drop_caches
dumpstats "reclaimed inodes and kmem caches"

# cap the dbuf cache to 48K
# reclaim caches again, to see freed dnode_t
echo 49152 > /sys/module/zfs/parameters/dbuf_cache_max_bytes
sleep 5
echo 2 > /proc/sys/vm/drop_caches
dumpstats "cache max 65536, reclaimed again"

Running this on master a160c15, we see after the files created, there are a bunch of dnode_ts allocated, and a bunch of dbufs with holds, not on cache. This is because there are a bunch of znodes (read: inodes) in the inode and dentry caches, each with a dnode hold.

### created files
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
dnode_t                               0x00100         -    386784        -     1224      -     -     -      -   316     -      -     -     -
zfs_znode_cache                       0x00100         -    278640        -     1080      -     -     -      -   258     -      -     -     -
cache_count                     4    12
cache_size_bytes                4    854528
cache_size_bytes_max            4    854528
cache_target_bytes              4    2036456
cache_lowater_bytes             4    1832811
cache_hiwater_bytes             4    2240101
cache_total_evicts              4    0
dbuf
pool             objset   object   level    blkid    offset     dbsize   meta  state dbholds dbc
tank             54       0        0        3        49152      16384    1     4     32      0
tank             0        0        0        0        0          16384    1     4     1       0
tank             0        0        0        2        32768      16384    1     4     6       0
tank             0        0        0        4        65536      16384    1     4     1       0
tank             54       0        0        8        131072     16384    1     4     7       0
tank             54       0        0        6        98304      16384    1     4     32      0
tank             54       0        0        1        16384      16384    1     4     30      0
tank             54       0        0        2        32768      16384    1     4     32      0
tank             54       0        0        4        65536      16384    1     4     32      0
tank             0        0        0        1        16384      16384    1     4     16      0
tank             54       0        0        0        0          16384    1     4     31      0
tank             54       0        0        5        81920      16384    1     4     32      0
tank             54       0        0        7        114688     16384    1     4     32      0

Then we instruct the kernel to drop slab objects, which runs the inode reclaim (“superblock shrinker”) and kmem cache reclaims. The kernel evicts the znodes, returning them to the cache, and then reclaims the kmem caches, so the alloc count drops to right off. Note that the dnode allocations did not change; the dbufs have no holds, so have been put on the dbuf cache (dbc=1). Note that cache_count has gone up by 7, and cache_count_bytes by 7x16K: only the dbuf size is counted.

### reclaimed inodes and kmem caches
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
dnode_t                               0x00100         -    386784        -     1224      -     -     -      -   316     -      -     -     -
zfs_znode_cache                       0x00100         -      2160        -     1080      -     -     -      -     2     -      -     -     -
cache_count                     4    19
cache_size_bytes                4    969216
cache_size_bytes_max            4    969216
cache_target_bytes              4    2036456
cache_lowater_bytes             4    1832811
cache_hiwater_bytes             4    2240101
cache_total_evicts              4    0
dbuf
pool             objset   object   level    blkid    offset     dbsize   meta  state dbholds dbc
tank             54       0        0        3        49152      16384    1     4     0       1
tank             0        0        0        0        0          16384    1     4     1       0
tank             0        0        0        2        32768      16384    1     4     6       0
tank             0        0        0        4        65536      16384    1     4     1       0
tank             54       0        0        8        131072     16384    1     4     0       1
tank             54       0        0        6        98304      16384    1     4     0       1
tank             54       0        0        1        16384      16384    1     4     3       0
tank             54       0        0        2        32768      16384    1     4     0       1
tank             54       0        0        4        65536      16384    1     4     0       1
tank             0        0        0        1        16384      16384    1     4     16      0
tank             54       0        0        0        0          16384    1     4     1       0
tank             54       0        0        5        81920      16384    1     4     0       1
tank             54       0        0        7        114688     16384    1     4     0       1

Now we force the dbuf cache max to 48K, and give it a few seconds to evict things, before reclaiming the cache. Note that the two remaining dbufs on the cache have an apparent size of 32K, under the cache limit, however we know (from the first output above) that 54/0/0/2 and 54/0/0/3 had 32 holds each at one point, so they have an additional 2x39K=78K attached at this point. There is no way for that to be reclaimed until and unless those dbufs is evicted.

### cache max 49152, reclaimed again
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
dnode_t                               0x00100         -    177480        -     1224      -     -     -      -   145     -      -     -     -
zfs_znode_cache                       0x00100         -      2160        -     1080      -     -     -      -     2     -      -     -     -
cache_count                     4    2
cache_size_bytes                4    32768
cache_size_bytes_max            4    969216
cache_target_bytes              4    49152
cache_lowater_bytes             4    44237
cache_hiwater_bytes             4    54067
cache_total_evicts              4    19
dbuf
pool             objset   object   level    blkid    offset     dbsize   meta  state dbholds dbc
tank             54       0        0        3        49152      16384    1     4     0       1
tank             0        0        0        0        0          16384    1     4     1       0
tank             0        0        0        2        32768      16384    1     4     4       0
tank             54       0        0        1        16384      16384    1     4     1       0
tank             54       0        0        2        32768      16384    1     4     0       1
tank             0        0        0        1        16384      16384    1     4     14      0

Running again with this change in place. This is the same output as above, except for the new usize column on the dbuf stats.

### created files
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
dnode_t                               0x00100         -    386784        -     1224      -     -     -      -   316     -      -     -     -
zfs_znode_cache                       0x00100         -    278640        -     1080      -     -     -      -   258     -      -     -     -
cache_count                     4    12
cache_size_bytes                4    854528
cache_size_bytes_max            4    854528
cache_target_bytes              4    2036456
cache_lowater_bytes             4    1832811
cache_hiwater_bytes             4    2240101
cache_total_evicts              4    0
dbuf
pool             objset   object   level    blkid    offset     dbsize   usize    meta  state dbholds dbc
tank             54       0        0        4        65536      16384    39168    1     4     32      0
tank             54       0        0        8        131072     16384    8568     1     4     7       0
tank             54       0        0        1        16384      16384    39168    1     4     30      0
tank             54       0        0        0        0          16384    37944    1     4     31      0
tank             54       0        0        3        49152      16384    39168    1     4     32      0
tank             54       0        0        7        114688     16384    39168    1     4     32      0
tank             54       0        0        6        98304      16384    39168    1     4     32      0
tank             54       0        0        2        32768      16384    39168    1     4     32      0
tank             0        0        0        1        16384      16384    39168    1     4     16      0
tank             0        0        0        2        32768      16384    7344     1     4     2       0
tank             54       0        0        5        81920      16384    39168    1     4     32      0
tank             0        0        0        0        0          16384    1224     1     4     1       0
tank             0        0        0        4        65536      16384    8568     1     4     5       0

After the inode drop and reclaim, we’re in the same position, but cache_size_bytes is higher than it was before. The difference between the previous run and this one is 1212792-969216=243576, which is 199*1224, the total holds that have been released from the dbufs now in the cache.

### reclaimed inodes and kmem caches
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
dnode_t                               0x00100         -    386784        -     1224      -     -     -      -   316     -      -     -     -
zfs_znode_cache                       0x00100         -      2160        -     1080      -     -     -      -     2     -      -     -     -
cache_count                     4    19
cache_size_bytes                4    1212792
cache_size_bytes_max            4    1212792
cache_target_bytes              4    2036456
cache_lowater_bytes             4    1832811
cache_hiwater_bytes             4    2240101
cache_total_evicts              4    0
dbuf
pool             objset   object   level    blkid    offset     dbsize   usize    meta  state dbholds dbc
tank             54       0        0        4        65536      16384    39168    1     4     0       1
tank             54       0        0        8        131072     16384    8568     1     4     0       1
tank             54       0        0        1        16384      16384    39168    1     4     3       0
tank             54       0        0        0        0          16384    37944    1     4     1       0
tank             54       0        0        3        49152      16384    39168    1     4     0       1
tank             54       0        0        7        114688     16384    39168    1     4     0       1
tank             54       0        0        6        98304      16384    39168    1     4     0       1
tank             54       0        0        2        32768      16384    39168    1     4     0       1
tank             0        0        0        1        16384      16384    39168    1     4     16      0
tank             0        0        0        2        32768      16384    7344     1     4     2       0
tank             54       0        0        5        81920      16384    39168    1     4     0       1
tank             0        0        0        0        0          16384    1224     1     4     1       0
tank             0        0        0        4        65536      16384    8568     1     4     5       0

And then lowering the cache limit again. This time, an additional dbuf is evicted.

### cache max 49152, reclaimed again
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
dnode_t                               0x00100         -    105264        -     1224      -     -     -      -    86     -      -     -     -
zfs_znode_cache                       0x00100         -      2160        -     1080      -     -     -      -     2     -      -     -     -
cache_count                     4    1
cache_size_bytes                4    23728
cache_size_bytes_max            4    1212792
cache_target_bytes              4    49152
cache_lowater_bytes             4    44237
cache_hiwater_bytes             4    54067
cache_total_evicts              4    20
dbuf
pool             objset   object   level    blkid    offset     dbsize   usize    meta  state dbholds dbc
tank             54       0        0        1        16384      16384    39168    1     4     1       0
tank             0        0        0        1        16384      16384    39168    1     4     14      0
tank             0        0        0        4        65536      16384    8568     1     4     4       0
tank             0        0        0        2        32768      16384    7344     1     4     0       1
tank             0        0        0        0        0          16384    1224     1     4     1       0

Further discussion

I don’t have a clear sense of how this changes the overall cache efficiency. It does of course make metadnode dbufs much heavier, and so means we’ll be doing more evicting than before, but that’s the point! We still evict least-recently-used first, and evictions don’t go very far anyway - they’re still in the ARC - so my sense is that its probably not going to be a big deal, at least not for the normal steady background evict churn.

I haven’t plumbed through other dbuf userdata clients in the same way, simply because they weren’t obviously part of the problem. I could, but I’d want to check first that there isn’t anything that would change the size sufficiently that the weight actually is a problem. A big part of the problem with the dnodes is that their memory lifetimes are very entangled with znode (inode, dentry) lifetimes, and with the specialness of object 0. That seems like it might not be as much of an issue for other userdata types.

There’s a case to be made that the dbuf cache should retain its original purpose, and only be used to avoid the cost of repeatedly decompressing a compressed ARC buffer. In that case, an alternate approach might be to evict the userdata before putting the dbuf on the cache. Reinflating the dnodes should have minimal overhead; its unlikely that all or even most of the dnodes are needed, and allocations are coming from the kmem cache so its mostly some copying and housekeeping. On the other hand, we could say the whole point of the dbuf cache is to reduce the work of getting the needed data back, and so it should be carrying more, not less. In that case, perhaps we also want to include the size of the intermediate objects (dnode_children_t, dmu_buf_user_t, even dmu_buf_impl_t proper).

There’s also arguments for removing the dbuf cache entirely and keeping the decompressed version alive for longer in the ARC. I have not pursued this, but you might say that that’s a more worthwhile direction to explore.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by

Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

adamdmoss · 2023-11-13T21:19:47Z

Been running with this patch for a few days. Nothing exploded. Guess I don't have a particularly bad local repro for what it fixes, but the changes make sense IMHO.

behlendorf

The lack of user buffer accounting always concerned me for exactly the reasons you mentioned. This change will cause us to keep fewer dnodes cached, but we can always bump up the maximum cache size if needed to accord for that. Fully reflecting the actual memory size is definitely a good thing.

module/zfs/dnode.c

amotin

Looks good to me. Thanks. You possibly could account also dnode_children_t/dnode_handle_t's allocations. They are not huge, but IIRC still ~64 bytes per each potential dnode on FreeBSD.

I was thinking about this accounting since I was working on MicroZAPs. Originally they could take in user data up to as much as in the dbuf itself, but should take less now. But in that case process of user data reconstruction is not cheap and can hurt performance. So the dbuf caching is really helpful and practically required, otherwise it would make no sense to reconstruct the B-tree in user data for each access. I was actually thinking if we could somehow delay eviction of the dbufs with user data.

Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511

Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15511 (cherry picked from commit 92dc4ad)

robn force-pushed the dnode-dbuf-user-size branch from db7eb68 to fd9022b Compare November 10, 2023 23:33

robn added 2 commits November 11, 2023 10:34

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 13, 2023

behlendorf reviewed Nov 14, 2023

View reviewed changes

module/zfs/dnode.c Show resolved Hide resolved

behlendorf approved these changes Nov 16, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 16, 2023

amotin approved these changes Nov 17, 2023

View reviewed changes

behlendorf merged commit 92dc4ad into openzfs:master Nov 17, 2023
19 checks passed

robn mentioned this pull request Dec 11, 2023

dbuf: evict harder when under memory pressure #15659

Closed

13 tasks

robn mentioned this pull request Aug 16, 2024

Memory Leak (2x ARC) on Kernel 6.x (Debian12) #15637

Open

robn mentioned this pull request Nov 5, 2024

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider `dnode_t` allocations in dbuf cache size accounting #15511

Consider `dnode_t` allocations in dbuf cache size accounting #15511

robn commented Nov 10, 2023 •

edited

Loading

adamdmoss commented Nov 13, 2023

behlendorf left a comment

amotin left a comment •

edited

Loading

Consider dnode_t allocations in dbuf cache size accounting #15511

Consider dnode_t allocations in dbuf cache size accounting #15511

Conversation

robn commented Nov 10, 2023 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Further discussion

Types of changes

Checklist:

adamdmoss commented Nov 13, 2023

behlendorf left a comment

Choose a reason for hiding this comment

amotin left a comment • edited Loading

Choose a reason for hiding this comment

Consider `dnode_t` allocations in dbuf cache size accounting #15511

Consider `dnode_t` allocations in dbuf cache size accounting #15511

robn commented Nov 10, 2023 •

edited

Loading

amotin left a comment •

edited

Loading