Prime arc to reduce zil_replay & import times #17044
Draft
+246
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
The time it takes to import a zpool is dominated by the time it takes to replay the zfs intent log (zil) when the zil is large.
The zil is replayed serially, and some operations require read-modify-write to occur on the pool iteself as part of replaying the zil, for example in the case of
TX_WRITE
andTX_LINK
logs.The proposed change uses a taskq to issue arc_reads to the pool in parallel prior to exeucuting zil_replay, in order to reduce the zil_replay latency. The proposed change puts the arc-prime behavior change behind a module parameter that is default disabled so that OpenZFS users who are not opting into the behavior experience no change in behavior.
Open issue: #17043
Description
This commit reduces zil_replay times by reading the zil and issuing arc_read requests in parallel with a taskq prior to performing the serial zil_replay. This converts read-modify-writes to the pool during serial zil_replay into ARC cache hits, improving performance by more than 20x from hours to single-digit minutes in extreme cases of large ZILs with many small unaligned TX_WRITE IOs and TX_LINK (hardlink) creations. The benefit is particularly acute when the primary pool is stored on high-latency devices, which increases the cost of pool read-modify-write in serial zil_replay.
The proposed change puts the arc-prime behavior change behind a module parameter that is default disabled so that OpenZFS users who are not opting into the behavior experience no change in behavior.
How Has This Been Tested?
These changes have been tested on machines ranging from 2 CPU cores to 128 CPU cores by running hardlink and heavy random write IO workloads, panicking the kernel, and measuring the latency of zpool import. For workloads running
TX_WRITE
, the following import latency improvements were measured, showing higher zil_replay records per second rates after priming the arc cache, and lower zpool import times:The change is more helpful for the case of
TX_LINK
- where hardlink entries in the zil take up very little space, and there can be hundreds of thousands or even over a million entries in the zil to be replayed. In this case, one production file system's zpool import time was reduced from 6 hours to 15 minutes, more than 20x improvement.This change is in use by FSx in production today for FSx Intelligent Tiering file systems, which use S3 storage-backed vdevs.
Types of changes
Checklist:
Signed-off-by
.