Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system.
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Authored-by: Don Brady [email protected]
Authored-by: Paul Dagnelie [email protected]
Motivation and Context
When flushing the DDT log, currently this takes place over multiple sync passes. This means that the same indirect blocks can be updated several times during one sync, which sort of defeats the purpose the DDT log was meant to solve in the first place. In addition, there is no mechanism in place to reduce the size of the DDT log; we try to keep up with the ingest rate, but that's it. If it ever does grow to a large size, we may never make progress in reducing the size, which can result in increased import times.
Description
There are two main changes included in this patch. The first is condensing all the syncing into a single sync pass. We do this by removing the code that divided the flush targets by the number of passes, and generally not doing any work beyond the first sync pass.
The second is the modification to the flush targets for each txg. The basic algorithm has changed; rather than directly targeting the ingest rate, the primary mechanism for determining how much to flush is by looking at the size of the backlog and dividing it by a target turnover rate (measured in TXGs). The idea is that this will smooth out the noise in the ingest rate, and over time, the flush rate will match the ingest rate. This is the result of a differential equation:
dbacklog/dt = ingest_rate - backlog/C
describes the change in backlog over time. This results in the backlog tending towardsC * ingest_rate
, whereC
is the turnover rate. Then the flush rate isC * ingest_rate / C
, which is just the ingest rate.However, one potential issue with this algorithm is that the backlog size is now proportional to the ingest rate. Whenever we do an import of the pool, we have to read through the whole DDT log to build up the in-memory state. If a user has hard requirements on import time, then large DDT log backlogs can cause problems for them. As a result, there is a separate pressure-based system to keep the backlog sizing from rising above a cap, when that cap is set. The way the pressure system works is that every txg, pressure is added if the backlog is above the cap and increasing; the amount added is proportional to the backlog divided by the cap, which helps us catch up to rapid spikes. If the backlog is above the cap but not increasing, we maintain the pressure; either it was a brief spike, or we've added enough pressure to bring the size down. Finally, if the backlog is below the cap, we release some of the pressure. The pressure is based on how far below the cap we are; that way, we quickly release pressure if the increased ingest rate abates, and we return to normal behavior. Here is a few charts to help demonstrate the behavior of this cap system:
In this example, we start with an ingest rate of 2k entries per second. We have a cap of 50k set, and the target turnover rate is 20 ( 20 make the changes happen more quickly, and be easier to see). At txg 10, the ingest rate increases by a factor of 3, and then at TXG 100 it decreases to the baseline. As you can see, the un-capped backlog quickly grows as the flush rate slowly rises to match the new ingest rate. Meanwhile, the capped backlog's flush rate climbs quickly to bring the backlog down near the cap, and then stabilizes to keep it there. Similarly, when the ingest rate drops, the un-capped backlog quickly starts falling as the flush rate slowly drops to the new baseline. Meanwhile the cap-based system starts to flush below the cap size and then corrects, levelling off quickly near the previous baseline.
![image](https://private-user-images.githubusercontent.com/142840/411073258-7482a485-276c-48c1-90e7-773dcbcf55ad.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0Njc0NjMsIm5iZiI6MTczOTQ2NzE2MywicGF0aCI6Ii8xNDI4NDAvNDExMDczMjU4LTc0ODJhNDg1LTI3NmMtNDhjMS05MGU3LTc3M2RjYmNmNTVhZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxM1QxNzE5MjNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lMDgzNTM5MTYxMWYwMWYxMGJiMWM3ZWExNjVmMmE4MzlkZDE3MTFiN2ExMDUwZTI4OGEzZGU3MWYyZmEzYWJhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.ZAF-V_Cv16Gp_JZXph3f0cvEJ8lscttVmFVoXNZUXO0)
Finally, in addition to these changes, I added a new test to the ZTS to verify that pacing works as expected.
How Has This Been Tested?
In addition to the zfs test suite, I ran several tests where I simulated various ingestion patterns into the DDT, and verified that the backlog behaved as expected with and without the cap set.
Types of changes
Checklist:
Signed-off-by
.