Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs promote should delete livelist of origin #10652

Merged
merged 1 commit into from
Jul 31, 2020
Merged

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Jul 31, 2020

Motivation and Context

When a clone is promoted, its livelist is no longer accurate, so it is
discarded. If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:

  • Filesystem A
  • Clone B, a clone of A
  • Clone C, a clone of B

If we promote C, it discards C's livelist. It should discard B's
livelist, but that is not happening. The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C. The incorrectly-freed blocks can be reallocated causing
checksum errors. And when C is destroyed it can double-free the
incorrectly-freed blocks.

Description

The problem is that we remove the livelist of origin_ds->ds_dir, but
the origin snapshot has already been moved to the promoted dsl_dir. So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed. As explained in a comment in the beginning
of dsl_dataset_promote_sync(), we need to use the saved odd for the
origin's dsl_dir.

Note: this problem is not present in any release (including 0.8.x), because the livelist feature is only present in the main branch.

How Has This Been Tested?

new test case

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

When a clone is promoted, its livelist is no longer accurate, so it is
discarded.  If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B

If we promote C, it discards C's livelist.  It should discard B's
livelist, but that is not happening.  The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C.  The incorrectly-freed blocks can be reallocated causing
checksum errors.  And when C is destroyed it can double-free the
incorrectly-freed blocks.

The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir.  So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed.  As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.

Signed-off-by: Matthew Ahrens <[email protected]>
@ahrens ahrens added the Status: Code Review Needed Ready for review and testing label Jul 31, 2020
@ahrens ahrens changed the title zfs promote does not delete livelist of origin zfs promote should delete livelist of origin Jul 31, 2020
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #10652 into master will decrease coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10652      +/-   ##
==========================================
- Coverage   79.52%   79.49%   -0.04%     
==========================================
  Files         394      394              
  Lines      124631   124631              
==========================================
- Hits        99118    99076      -42     
- Misses      25513    25555      +42     
Flag Coverage Δ
#kernel 80.23% <100.00%> (-0.01%) ⬇️
#user 65.05% <0.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/zfs/dsl_dataset.c 92.18% <100.00%> (-0.05%) ⬇️
module/lua/lmem.c 83.33% <0.00%> (-4.17%) ⬇️
module/zcommon/zfs_fletcher.c 75.65% <0.00%> (-2.64%) ⬇️
module/zfs/dsl_synctask.c 92.40% <0.00%> (-2.54%) ⬇️
module/icp/algs/modes/gcm.c 79.25% <0.00%> (-2.22%) ⬇️
cmd/zed/agents/fmd_api.c 88.61% <0.00%> (-1.78%) ⬇️
cmd/ztest/ztest.c 73.71% <0.00%> (-1.19%) ⬇️
cmd/zed/agents/zfs_diagnosis.c 77.55% <0.00%> (-1.17%) ⬇️
module/zfs/bptree.c 92.70% <0.00%> (-1.05%) ⬇️
module/zfs/vdev_indirect.c 73.66% <0.00%> (-1.00%) ⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a15c6f3...850603d. Read the comment docs.

Copy link
Contributor

@shartse shartse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a long-lived, simple bug! Thank you for writing a new test to cover this as well.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 31, 2020
@behlendorf behlendorf merged commit 948423a into openzfs:master Jul 31, 2020
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
When a clone is promoted, its livelist is no longer accurate, so it is
discarded.  If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B

If we promote C, it discards C's livelist.  It should discard B's
livelist, but that is not happening.  The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C.  The incorrectly-freed blocks can be reallocated causing
checksum errors.  And when C is destroyed it can double-free the
incorrectly-freed blocks.

The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir.  So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed.  As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.

Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed by: Sara Hartse <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#10652
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
When a clone is promoted, its livelist is no longer accurate, so it is
discarded.  If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B

If we promote C, it discards C's livelist.  It should discard B's
livelist, but that is not happening.  The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C.  The incorrectly-freed blocks can be reallocated causing
checksum errors.  And when C is destroyed it can double-free the
incorrectly-freed blocks.

The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir.  So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed.  As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.

Reviewed-by: Pavel Zakharov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed by: Sara Hartse <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#10652
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants