Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prime arc to reduce zil_replay & import times #17044

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

markroper
Copy link
Contributor

@markroper markroper commented Feb 11, 2025

Motivation and Context

The time it takes to import a zpool is dominated by the time it takes to replay the zfs intent log (zil) when the zil is large.
The zil is replayed serially, and some operations require read-modify-write to occur on the pool iteself as part of replaying the zil, for example in the case of TX_WRITE and TX_LINK logs.

The proposed change uses a taskq to issue arc_reads to the pool in parallel prior to exeucuting zil_replay, in order to reduce the zil_replay latency. The proposed change puts the arc-prime behavior change behind a module parameter that is default disabled so that OpenZFS users who are not opting into the behavior experience no change in behavior.

Open issue: #17043

Description

This commit reduces zil_replay times by reading the zil and issuing arc_read requests in parallel with a taskq prior to performing the serial zil_replay. This converts read-modify-writes to the pool during serial zil_replay into ARC cache hits, improving performance by more than 20x from hours to single-digit minutes in extreme cases of large ZILs with many small unaligned TX_WRITE IOs and TX_LINK (hardlink) creations. The benefit is particularly acute when the primary pool is stored on high-latency devices, which increases the cost of pool read-modify-write in serial zil_replay.

The proposed change puts the arc-prime behavior change behind a module parameter that is default disabled so that OpenZFS users who are not opting into the behavior experience no change in behavior.

How Has This Been Tested?

These changes have been tested on machines ranging from 2 CPU cores to 128 CPU cores by running hardlink and heavy random write IO workloads, panicking the kernel, and measuring the latency of zpool import. For workloads running TX_WRITE, the following import latency improvements were measured, showing higher zil_replay records per second rates after priming the arc cache, and lower zpool import times:

TX_WRITE_pool_import

TX_WRITE_import_latency

The change is more helpful for the case of TX_LINK - where hardlink entries in the zil take up very little space, and there can be hundreds of thousands or even over a million entries in the zil to be replayed. In this case, one production file system's zpool import time was reduced from 6 hours to 15 minutes, more than 20x improvement.

This change is in use by FSx in production today for FSx Intelligent Tiering file systems, which use S3 storage-backed vdevs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 11, 2025
@markroper markroper force-pushed the improve-zil-replay-latency branch 3 times, most recently from 776e3b3 to a5909ac Compare February 11, 2025 15:01
The time it takes to import a zpool is dominated
by the time it take to replay the zfs intent log (zil)
when the zil is large. The zil is replayed serially,
and some operations require read-modify-write to occur,
for example TX_WRITE and TX_LINK entries. This commit
reduces zil_replay times by reading the zil and issuing
arc_read requests in parallel using a taskq prior to
performing the serial zil_replay. Doing so can reduce pool
import times from hours to minutes in cases where the zil
has many TX_WRITE and TX_LINK entries. The benefit is
particularly acute when the primary pool is stored on
high-latency devices, which increases the cost of pool
read-modify-write in serial zil_replay.

Signed-off-by: Mark Roper <[email protected]>
@markroper markroper force-pushed the improve-zil-replay-latency branch from a5909ac to ce155cd Compare February 11, 2025 15:43
@pcd1193182
Copy link
Contributor

pcd1193182 commented Feb 11, 2025

Did you investigate how much time the check and claim stages of import take? Those seem like they would 1) have similar issues, and 2) pull most of the relevant data into the ARC already? EDIT: For TX_WRITE, anyway. TX_LINK wouldn't get any data cached by (or significantly slow down the operation of) the earlier phases, from what I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants