Rework FDT dedup log sync #17038

pcd1193182 · 2025-02-07T20:55:05Z

This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.

Authored-by: Don Brady [email protected]
Authored-by: Paul Dagnelie [email protected]

Motivation and Context

When flushing the DDT log, currently this takes place over multiple sync passes. This means that the same indirect blocks can be updated several times during one sync, which sort of defeats the purpose the DDT log was meant to solve in the first place. In addition, there is no mechanism in place to reduce the size of the DDT log; we try to keep up with the ingest rate, but that's it. If it ever does grow to a large size, we may never make progress in reducing the size, which can result in increased import times.

Description

There are two main changes included in this patch. The first is condensing all the syncing into a single sync pass. We do this by removing the code that divided the flush targets by the number of passes, and generally not doing any work beyond the first sync pass.

The second is the modification to the flush targets for each txg. The basic algorithm has changed; rather than directly targeting the ingest rate, the primary mechanism for determining how much to flush is by looking at the size of the backlog and dividing it by a target turnover rate (measured in TXGs). The idea is that this will smooth out the noise in the ingest rate, and over time, the flush rate will match the ingest rate. This is the result of a differential equation: dbacklog/dt = ingest_rate - backlog/C describes the change in backlog over time. This results in the backlog tending towards C * ingest_rate, where C is the turnover rate. Then the flush rate is C * ingest_rate / C, which is just the ingest rate.

However, one potential issue with this algorithm is that the backlog size is now proportional to the ingest rate. Whenever we do an import of the pool, we have to read through the whole DDT log to build up the in-memory state. If a user has hard requirements on import time, then large DDT log backlogs can cause problems for them. As a result, there is a separate pressure-based system to keep the backlog sizing from rising above a cap, when that cap is set. The way the pressure system works is that every txg, pressure is added if the backlog is above the cap and increasing; the amount added is proportional to the backlog divided by the cap, which helps us catch up to rapid spikes. If the backlog is above the cap but not increasing, we maintain the pressure; either it was a brief spike, or we've added enough pressure to bring the size down. Finally, if the backlog is below the cap, we release some of the pressure. The pressure is based on how far below the cap we are; that way, we quickly release pressure if the increased ingest rate abates, and we return to normal behavior. Here is a few charts to help demonstrate the behavior of this cap system:

In this example, we start with an ingest rate of 2k entries per second. We have a cap of 50k set, and the target turnover rate is 20 ( 20 make the changes happen more quickly, and be easier to see). At txg 10, the ingest rate increases by a factor of 3, and then at TXG 100 it decreases to the baseline. As you can see, the un-capped backlog quickly grows as the flush rate slowly rises to match the new ingest rate. Meanwhile, the capped backlog's flush rate climbs quickly to bring the backlog down near the cap, and then stabilizes to keep it there. Similarly, when the ingest rate drops, the un-capped backlog quickly starts falling as the flush rate slowly drops to the new baseline. Meanwhile the cap-based system starts to flush below the cap size and then corrects, levelling off quickly near the previous baseline.

Finally, in addition to these changes, I added a new test to the ZTS to verify that pacing works as expected.

How Has This Been Tested?

In addition to the zfs test suite, I ran several tests where I simulated various ingestion patterns into the DDT, and verified that the backlog behaved as expected with and without the cap set.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Authored-by: Don Brady <[email protected]> Authored-by: Paul Dagnelie <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]>

Signed-off-by: Paul Dagnelie <[email protected]>

man/man4/zfs.4

module/zfs/ddt.c

Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 force-pushed the pacing branch from ce486b5 to 6bf076a Compare February 10, 2025 17:18

style

037537f

Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 force-pushed the pacing branch from 6bf076a to 037537f Compare February 10, 2025 17:20

tonyhutter reviewed Feb 11, 2025

View reviewed changes

man/man4/zfs.4 Outdated Show resolved Hide resolved

tonyhutter reviewed Feb 11, 2025

View reviewed changes

module/zfs/ddt.c Outdated Show resolved Hide resolved

tony hutter feedback

b21b18f

Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 mentioned this pull request Feb 12, 2025

Add more DDT tests #17049

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework FDT dedup log sync #17038

Rework FDT dedup log sync #17038

pcd1193182 commented Feb 7, 2025

Rework FDT dedup log sync #17038

Are you sure you want to change the base?

Rework FDT dedup log sync #17038

Conversation

pcd1193182 commented Feb 7, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist: