Prime arc to reduce zil_replay & import times #17044

markroper · 2025-02-11T12:48:43Z

Motivation and Context

The time it takes to import a zpool is dominated by the time it takes to replay the zfs intent log (zil) when the zil is large.
The zil is replayed serially, and some operations require read-modify-write to occur on the pool iteself as part of replaying the zil, for example in the case of TX_WRITE and TX_LINK logs.

The proposed change uses a taskq to issue arc_reads to the pool in parallel prior to exeucuting zil_replay, in order to reduce the zil_replay latency. The proposed change puts the arc-prime behavior change behind a module parameter that is default disabled so that OpenZFS users who are not opting into the behavior experience no change in behavior.

Open issue: #17043

Description

This commit reduces zil_replay times by reading the zil and issuing arc_read requests in parallel with a taskq prior to performing the serial zil_replay. This converts read-modify-writes to the pool during serial zil_replay into ARC cache hits, improving performance by more than 20x from hours to single-digit minutes in extreme cases of large ZILs with many small unaligned TX_WRITE IOs and TX_LINK (hardlink) creations. The benefit is particularly acute when the primary pool is stored on high-latency devices, which increases the cost of pool read-modify-write in serial zil_replay.

The proposed change puts the arc-prime behavior change behind a module parameter that is default disabled so that OpenZFS users who are not opting into the behavior experience no change in behavior.

How Has This Been Tested?

These changes have been tested on machines ranging from 2 CPU cores to 128 CPU cores by running hardlink and heavy random write IO workloads, panicking the kernel, and measuring the latency of zpool import. For workloads running TX_WRITE, the following import latency improvements were measured, showing higher zil_replay records per second rates after priming the arc cache, and lower zpool import times:

The change is more helpful for the case of TX_LINK - where hardlink entries in the zil take up very little space, and there can be hundreds of thousands or even over a million entries in the zil to be replayed. In this case, one production file system's zpool import time was reduced from 6 hours to 15 minutes, more than 20x improvement.

This change is in use by FSx in production today for FSx Intelligent Tiering file systems, which use S3 storage-backed vdevs.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

The time it takes to import a zpool is dominated by the time it take to replay the zfs intent log (zil) when the zil is large. The zil is replayed serially, and some operations require read-modify-write to occur, for example TX_WRITE and TX_LINK entries. This commit reduces zil_replay times by reading the zil and issuing arc_read requests in parallel using a taskq prior to performing the serial zil_replay. Doing so can reduce pool import times from hours to minutes in cases where the zil has many TX_WRITE and TX_LINK entries. The benefit is particularly acute when the primary pool is stored on high-latency devices, which increases the cost of pool read-modify-write in serial zil_replay. Signed-off-by: Mark Roper <[email protected]>

pcd1193182 · 2025-02-11T17:58:32Z

Did you investigate how much time the check and claim stages of import take? Those seem like they would 1) have similar issues, and 2) pull most of the relevant data into the ARC already? EDIT: For TX_WRITE, anyway. TX_LINK wouldn't get any data cached by (or significantly slow down the operation of) the earlier phases, from what I can see.

github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 11, 2025

markroper force-pushed the improve-zil-replay-latency branch 3 times, most recently from 776e3b3 to a5909ac Compare February 11, 2025 15:01

markroper force-pushed the improve-zil-replay-latency branch from a5909ac to ce155cd Compare February 11, 2025 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prime arc to reduce zil_replay & import times #17044

Prime arc to reduce zil_replay & import times #17044

markroper commented Feb 11, 2025 •

edited

Loading

pcd1193182 commented Feb 11, 2025 •

edited

Loading

Prime arc to reduce zil_replay & import times #17044

Are you sure you want to change the base?

Prime arc to reduce zil_replay & import times #17044

Conversation

markroper commented Feb 11, 2025 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

pcd1193182 commented Feb 11, 2025 • edited Loading

markroper commented Feb 11, 2025 •

edited

Loading

pcd1193182 commented Feb 11, 2025 •

edited

Loading