-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvols: prevent overflow of minor device numbers #16006
zvols: prevent overflow of minor device numbers #16006
Conversation
currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Signed-off-by: Fabian Grünbichler <[email protected]>
FWIW - I'd still like the questions I raised in the Discussions thread answered (mostly to provide guidance for myself and other people in similar situations in the future - which kind of issues do you want to be reported privately, which are okay to file publically by default, what's the process like in general, ..) |
@Fabian-Gruenbichler I don't have any particular guidance for you regarding the discussions threads, but I will try to include this commit in the next 2.1.x and 2.2.x release. |
thanks! maybe it's something that could be discussed at one of the leadership meetings? I wasn't sure if it's okay to just put stuff on the agenda there as a "sometimes drive-by contributor" ;) especially since the timeslots are almost always unattendable for me, so it feels a bit like dumping without contributing. |
currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes #16006
currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Reviewed-by: Tony Hutter <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Fabian Grünbichler <[email protected]> Closes openzfs#16006
Motivation and Context
Linux allows at most 2^20 (~1 Mio) minor devices per major number. ZFS uses a single major number for all zvols (including snapshots, if they are exposed as block devices). ZFS also reserves 15 slots per zvol for exposing partitions of a zvol, so effectively ZFS can expose at most 2^16 zvols (or zvol snapshots) as block devices at a time.
This limit is not enforced in the ZFS module, as a result the minor device number will overflow, and ZFS will attempt to register a second block device for an already in-use minor device number, which the kernel (rightfully) rejects. ZFS doesn't handle the resulting error either, corrupting its internal housekeeping.
There are three symptoms of this issue:
I stumbled upon this while investigating the behaviour described in #15904 . I did initially try to report this privately both via the GH "Report a security vulnerability" feature, and by asking via a new Discussion thread. Since I received no ack on either channel, and given that this already requires permissions to allocate (a lot of) zvols to cause problems, I decided to publish it as a regular PR now.
Description
ZVOL_MINOR_BITS
is4
(first minor is used by the zvol itself, the other 15 by partitions depending onvolmode
)MINORBITS
in the Linux kernel is20
in
zvol_os_create_minor
( https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zvol_os.c#L1313-L1316 ), the next free index is assigned to the zvol. it's then shifted by4
(to also reserve the 15 other slots for the partitions mentioned above):These indices:
0
destroy
, or re- or de-initializing ifvolmode
is changed, or atzpool export
time)The resulting minor device numbers are just the zvol's index shifted by 4, so
with any gaps being recycled as soon as the next zvol bdev is initialized.
A bit further down in the same function we have the following code:
zvol_major
is the module parameter, defaulting to230
.minor
is theidx
derived minor device number from above.MKDEV
just combines the two into a single int by shiftingmajor
byMINORBITS
(20 atm) and OR-ingminor
.Since there are no safeguards implemented here, this means that if our index (which is basically just a counter of "currently 'mapped' zvols") shifted by
4
is bigger than2^20
, OR-ing theminor
value actually overflows into the part of the device number that represents the major device.zvol_alloc
itself extracts a minor again from this device number via masking, but in case of an overflow, this is not the original (too big) minor, but one colliding with an already existing zvol/block device (https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zvol_os.c#L1221-L1224):So both the zvols
first_minor
(which is used for partition block device creation by the kernel, among other things) as well as the device name itself (zdXX) is wrong and collides with a different zvol.The
major
part is discarded entirely and set tozvol_major
in any case, undoing the spillage of the overflow and OR-ing:zso->zvo_disk->major = zvol_major;
but the combined
dev
value is also stored:zso->zvo_dev = dev;
This last assignment causes confusion when destroying the zvol that caused the overflow, because in
zvol_os_free
:the wrong
minor
value is removed, removing the assignment of a different zvol that is still in use. The next zvol initialization that gets assigned this slot will again fail, even if not causing an overflow itself, since the index it will be assigned is already taken in practice.Destroying a zvol might also lead to traces such as these being printed if the kernel is confused about the mapping of zvols to block devices:
Back to the original flow of creating a zvol block device: at the end of
zvol_os_create_minor
, the zvol is actually passed to the kernel for device creation:in case of the overflow, this will result in an error like this:
since
zd0
already exists and represents a different zvol than the one we are currently handling. this error is ignored by ZFS, in particular, the zvol->index->minor assignment done earlier is not removed again.If the admin now notices something is fishy, and exports the pool, they might be greeted with messages like this:
or, if unlucky:
the latter causes the whole system to crash/become unresponsive (as expected from a kernel NULL pointer deref).
PoC
Fairly easy reproducer:
volmode
other thannone
It seems fairly likely that snapshots (with
snapdev
) and/orzfs recv
can also be used to trigger this. I have not investigated interactions withzvol_inhibit_dev
or runtime changing ofzvol_major
. The latter (if the docs are right and it is indeed runtime-changable) might cause additional issues.It's a bit easier to see what's going on with a few additional debug prints like this:
Impact
Obviously the kind of impact this can have really depends on wether an attacker can cause zvol datasets to be created. If they can (especially if they are thin provisioned, i.e. without a (ref)reservation) it's really easy to create a lot of them without running afoul of quotas.
Properly handling this and allowing actually using more than 64k block devices backed by ZFS probably requires reworking the whole assignment of major/minor device numbers.
How Has This Been Tested?
See PoC above.
Types of changes
Checklist:
Signed-off-by
.