Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TXG timestamp database #16853

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

oshogbo
Copy link
Contributor

@oshogbo oshogbo commented Dec 11, 2024

Motivation and Context

This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations.

Description

To achieve this, we implemented a round-robin database that keeps track of time. We separate the tracking into minutes, days, and years. We believe this provides the best resolution for time management. This feature does not track the exact time of each transaction group (txg) but provides an estimate. The txg database can also be used in other scenarios where mapping dates to transaction groups is required.

How Has This Been Tested?

  • Create pool
  • write data
  • wait some time
  • write data
  • wait some time
  • try to scrub different times

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@oshogbo oshogbo force-pushed the oshogbo/scrub_data_range branch 7 times, most recently from 2a20b11 to 364f813 Compare December 11, 2024 14:01
@amotin amotin added the Status: Code Review Needed Ready for review and testing label Dec 11, 2024
@oshogbo oshogbo force-pushed the oshogbo/scrub_data_range branch from 364f813 to 891c8f2 Compare December 11, 2024 15:50
@amotin
Copy link
Member

amotin commented Dec 12, 2024

It crashes on VERIFY(!dmu_objset_is_dirty(dp->dp_meta_objset, txg)).

@amotin
Copy link
Member

amotin commented Dec 12, 2024

This reminds me we recently added ddp_class_start into the new dedup table entries format to be able to prune DDT based on time. I wonder if we could save some space would we have this mechanism back then.

Comment on lines +8441 to +8452
ret = sscanf(timestr, "%4d-%2d-%2d %2d:%2d", &tm.tm_year, &tm.tm_mon,
&tm.tm_mday, &tm.tm_hour, &tm.tm_min);
if (ret < 3) {
fprintf(stderr, gettext("Failed to parse the date.\n"));
usage(B_FALSE);
}

// Adjust struct
tm.tm_year -= 1900;
tm.tm_mon -= 1;

return (timegm(&tm));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if strptime() or something else specialized would be better.

module/zcommon/zfeature_common.c Outdated Show resolved Hide resolved
Comment on lines +4826 to +4832
/* Load time log */
error = spa_load_txg_log_time(spa);
if (error != 0)
return (spa_vdev_err(rvd, VDEV_AUX_CORRUPT_DATA, EIO));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could instead delete it and start from scratch. Not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hym, maybe. But I feel that when we can't load part of it, then something funky is happening and should be investigated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the question is between inability to access data at all vs loosing this data, I choose the first. I was thinking about the same for BRT on read-only imports, but haven't got to it yet.

module/zfs/spa.c Outdated Show resolved Hide resolved
include/zfs_crrd.h Outdated Show resolved Hide resolved
module/zfs/zfs_crrd.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Dec 12, 2024
@oshogbo oshogbo force-pushed the oshogbo/scrub_data_range branch from 891c8f2 to ba5ee33 Compare January 31, 2025 10:31
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Jan 31, 2025
@oshogbo oshogbo force-pushed the oshogbo/scrub_data_range branch 8 times, most recently from 963a5a3 to 33a7c27 Compare January 31, 2025 14:08
cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated
Comment on lines 3392 to 3397
VERIFY0(spa_rrd_write(spa, tx, &spa->spa_txg_log_time.dbr_minutes,
DMU_POOL_TXG_LOG_TIME_MINUTES, insert));
VERIFY0(spa_rrd_write(spa, tx, &spa->spa_txg_log_time.dbr_days,
DMU_POOL_TXG_LOG_TIME_DAYS, insert));
VERIFY0(spa_rrd_write(spa, tx, &spa->spa_txg_log_time.dbr_days,
DMU_POOL_TXG_LOG_TIME_MONTHS, insert));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to store a single unix timestamp for the log time, rather than discrete minutes/days/months?

Also, you're storing dbr_days into DMU_POOL_TXG_LOG_TIME_MONTHS. Typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

The idea was to have different resolutions, so as long as the ZFS pool runs, we keep fewer historic TXGs.
We are also limited by the size of the ZAP entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned you're going to run into timezone issues if you don't use Unix timestamps. If you write a timestamp at 10am EST in New York, and then later import the pool in San Francisco and write a timestamp at 10am PST, will there be two timestamps at "10am"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... You are right me goal was to store time in UTC, the zpool scrub converts this time to be UTC.
I was sure the gethrtime return a UTC time stamp, but you are right it doesn't.
I have to fix that.

Copy link
Contributor Author

@oshogbo oshogbo Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have double check, and I think that the implementation is correct.
The gethrestime_sec(void) function uses ktime_get_coarse_real_ts64 which from my understanding returns a UTC time.

module/zfs/zfs_crrd.c Outdated Show resolved Hide resolved
include/zfs_crrd.h Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
This feature enables tracking of when TXGs are committed to disk,
providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based
on specific date ranges, improving the granularity of
data management and recovery operations.

Signed-off-by: Mariusz Zaborski <[email protected]>
@oshogbo oshogbo force-pushed the oshogbo/scrub_data_range branch from 33a7c27 to 7797e3f Compare January 31, 2025 20:52
@tonyhutter
Copy link
Contributor

Forgot to mention this earlier - can you add a test case to exercise zpool scrub -S|-E? Please include all weird edge cases, like invalid dates/ranges, setting timezones forward/backwards, and testing -S|-E against pools where the feature isn't enabled.

@oshogbo
Copy link
Contributor Author

oshogbo commented Feb 3, 2025

Forgot to mention this earlier - can you add a test case to exercise zpool scrub -S|-E? Please include all weird edge cases, like invalid dates/ranges, setting timezones forward/backwards, and testing -S|-E against pools where the feature isn't enabled.

Unfortunately, I don't have an idea how to add such test, as to test it we would need to wait for rrd to be created. This will create very long test. Do you have some suggestions?

@tonyhutter
Copy link
Contributor

This will create very long test. Do you have some suggestions?

The test case could temporarily set the system clock forward to simulate the passage of time.

if (!spa_writeable(spa)) {
return;
}
if (txg == spa->spa_last_noted_txg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be true if it is called only once per TXG sync?

} dbrrd_rounding_t;

typedef struct {
hrtime_t rrdd_time;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how compatible this is, but in general it should not be hrtime_t here. On Illumos and FreeBSD gethrestime_sec() returns time_t, while on Linux -- uint64_t. I think uint64_t would be better here from every aspect.

{
rrd->rrd_head = ntohll(rrd->rrd_head);
rrd->rrd_tail = ntohll(rrd->rrd_tail);
rrd->rrd_length = ntohll(rrd->rrd_length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not incorrect, I can't recall other place where ZFS would store things in network order. DDT does it per-record, for example see ddt_zap_decompress(). I don't insist, just propose to think wider.

Also it seems like this will be a single > 4KB ZAP entry. I hope MOS ZAP has sufficiently big block size to store it efficiently.


rrd_ntoh(&spa->spa_txg_log_time.dbr_minutes);
rrd_ntoh(&spa->spa_txg_log_time.dbr_days);
rrd_ntoh(&spa->spa_txg_log_time.dbr_months);
Copy link
Member

@amotin amotin Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZAP has native support for byte swapping if you specify that you store 8-byte integers (same as 2- and 4-byte). But instead you are storing 8 1-byte integers to create yourself a problem. Array of 1-byte integers makes sense for opaque data blobs, but integers should be stored as integers. Actually, since all fields in the RRD value are uint64_t, I think you could store it as array of 8-byte integers and ZAP would do all the byte-swapping for you. It would be much cleaner than your manual byte swapping!

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants