Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvols: prevent overflow of minor device numbers #16006

Merged
merged 1 commit into from
Mar 29, 2024

Conversation

Fabian-Gruenbichler
Copy link
Contributor

Motivation and Context

Linux allows at most 2^20 (~1 Mio) minor devices per major number. ZFS uses a single major number for all zvols (including snapshots, if they are exposed as block devices). ZFS also reserves 15 slots per zvol for exposing partitions of a zvol, so effectively ZFS can expose at most 2^16 zvols (or zvol snapshots) as block devices at a time.

This limit is not enforced in the ZFS module, as a result the minor device number will overflow, and ZFS will attempt to register a second block device for an already in-use minor device number, which the kernel (rightfully) rejects. ZFS doesn't handle the resulting error either, corrupting its internal housekeeping.

There are three symptoms of this issue:

  • inability to use subsqequent zvols as block devices after the limit has been exhausted
  • failure to create usable new zvols even after situation has been remedied (zvol count < limit)
  • possibility of kernel null pointer deref and crash on pool export(!)

I stumbled upon this while investigating the behaviour described in #15904 . I did initially try to report this privately both via the GH "Report a security vulnerability" feature, and by asking via a new Discussion thread. Since I received no ack on either channel, and given that this already requires permissions to allocate (a lot of) zvols to cause problems, I decided to publish it as a regular PR now.

Description

ZVOL_MINOR_BITS is 4 (first minor is used by the zvol itself, the other 15 by partitions depending on volmode)
MINORBITS in the Linux kernel is 20

in zvol_os_create_minor ( https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zvol_os.c#L1313-L1316 ), the next free index is assigned to the zvol. it's then shifted by 4 (to also reserve the 15 other slots for the partitions mentioned above):

	idx = ida_simple_get(&zvol_ida, 0, 0, kmem_flags_convert(KM_SLEEP));
	if (idx < 0)
		return (SET_ERROR(-idx));
	minor = idx << ZVOL_MINOR_BITS;

These indices:

  • start at 0
  • are registered when a zvol-backed block device is initialized by the zfs module
  • are freed/removed again in error handling or when the block device is torn down again (as a result of destroy, or re- or de-initializing if volmode is changed, or at zpool export time)
  • represent a single, unique zvol

The resulting minor device numbers are just the zvol's index shifted by 4, so

  • 0
  • 16
  • 32
  • 48
  • ..

with any gaps being recycled as soon as the next zvol bdev is initialized.

A bit further down in the same function we have the following code:

	zv = zvol_alloc(MKDEV(zvol_major, minor), name);

zvol_major is the module parameter, defaulting to 230. minor is the idx derived minor device number from above. MKDEV just combines the two into a single int by shifting major by MINORBITS (20 atm) and OR-ing minor.

Since there are no safeguards implemented here, this means that if our index (which is basically just a counter of "currently 'mapped' zvols") shifted by 4 is bigger than 2^20, OR-ing the minor value actually overflows into the part of the device number that represents the major device.

zvol_alloc itself extracts a minor again from this device number via masking, but in case of an overflow, this is not the original (too big) minor, but one colliding with an already existing zvol/block device (https://github.com/openzfs/zfs/blob/master/module/os/linux/zfs/zvol_os.c#L1221-L1224):

	zso->zvo_disk->first_minor = (dev & MINORMASK);
	zso->zvo_disk->private_data = zv;
	snprintf(zso->zvo_disk->disk_name, DISK_NAME_LEN, "%s%d",
	    ZVOL_DEV_NAME, (dev & MINORMASK));

So both the zvols first_minor (which is used for partition block device creation by the kernel, among other things) as well as the device name itself (zdXX) is wrong and collides with a different zvol.

The major part is discarded entirely and set to zvol_major in any case, undoing the spillage of the overflow and OR-ing:

zso->zvo_disk->major = zvol_major;

but the combined dev value is also stored:

zso->zvo_dev = dev;

This last assignment causes confusion when destroying the zvol that caused the overflow, because in zvol_os_free:

ida_simple_remove(&zvol_ida,
	    MINOR(zv->zv_zso->zvo_dev) >> ZVOL_MINOR_BITS);

the wrong minor value is removed, removing the assignment of a different zvol that is still in use. The next zvol initialization that gets assigned this slot will again fail, even if not causing an overflow itself, since the index it will be assigned is already taken in practice.

Destroying a zvol might also lead to traces such as these being printed if the kernel is confused about the mapping of zvols to block devices:

Feb 27 05:26:26 debian kernel: ------------[ cut here ]------------
Feb 27 05:26:26 debian kernel: WARNING: CPU: 4 PID: 1096 at block/genhd.c:621 del_gendisk+0x2ac/0x2f0
Feb 27 05:26:26 debian kernel: Modules linked in: zfs(POE) spl(OE) binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd ccp kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd virtio_console pcspkr joydev evdev virtio_balloon button sg serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse dm_mod loop efi_pstore configfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 bochs crc_t10dif crct10dif_generic drm_vram_helper sr_mod drm_kms_helper cdrom virtio_net net_failover virtio_scsi failover ata_generic drm_ttm_helper ttm ata_piix crct10dif_pclmul crct10dif_common crc32_pclmul libata drm uhci_hcd scsi_mod ehci_hcd psmouse scsi_common usbcore crc32c_intel i2c_piix4 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev
Feb 27 05:26:26 debian kernel:  virtio virtio_ring usb_common floppy
Feb 27 05:26:26 debian kernel: CPU: 4 PID: 1096 Comm: spl_system_task Tainted: P           OE      6.1.0-17-amd64 #1  Debian 6.1.69-1
Feb 27 05:26:26 debian kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Feb 27 05:26:26 debian kernel: RIP: 0010:del_gendisk+0x2ac/0x2f0
Feb 27 05:26:26 debian kernel: Code: ff 48 89 ef 5b be 01 00 00 00 5d 41 5c 41 5d e9 3a 56 ff ff 48 8b 70 48 e9 11 ff ff ff f6 83 e8 01 00 00 02 0f 85 81 fd ff ff <0f> 0b 5b 5d 41 5c 41 5d e9 57 11 92 00 48 8b 43 40 48 c7 c6 9f 96
Feb 27 05:26:26 debian kernel: RSP: 0018:ffffb2ac80477db8 EFLAGS: 00010246
Feb 27 05:26:26 debian kernel: RAX: ffff944c80a36798 RBX: ffff944b3a047000 RCX: ffff944c008b3a28
Feb 27 05:26:26 debian kernel: RDX: 0000000080000000 RSI: 0000000000000246 RDI: ffff944b3a047000
Feb 27 05:26:26 debian kernel: RBP: ffff944c0f2336d8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 05:26:26 debian kernel: R10: 0000000000000004 R11: 0000000000000001 R12: ffff944c10afba30
Feb 27 05:26:26 debian kernel: R13: ffff944c10afba20 R14: 0000000000000000 R15: ffff944c008b3a00
Feb 27 05:26:26 debian kernel: FS:  0000000000000000(0000) GS:ffff944d37d00000(0000) knlGS:0000000000000000
Feb 27 05:26:26 debian kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 27 05:26:26 debian kernel: CR2: 00007fd17e1a43d8 CR3: 00000001f7410000 CR4: 0000000000750ee0
Feb 27 05:26:26 debian kernel: PKRU: 55555554
Feb 27 05:26:26 debian kernel: Call Trace:
Feb 27 05:26:26 debian kernel:  <TASK>
Feb 27 05:26:26 debian kernel:  ? __warn+0x7d/0xc0
Feb 27 05:26:26 debian kernel:  ? del_gendisk+0x2ac/0x2f0
Feb 27 05:26:26 debian kernel:  ? report_bug+0xe2/0x150
Feb 27 05:26:26 debian kernel:  ? handle_bug+0x41/0x70
Feb 27 05:26:26 debian kernel:  ? exc_invalid_op+0x13/0x60
Feb 27 05:26:26 debian kernel:  ? asm_exc_invalid_op+0x16/0x20
Feb 27 05:26:26 debian kernel:  ? del_gendisk+0x2ac/0x2f0
Feb 27 05:26:26 debian kernel:  ? del_gendisk+0x17/0x2f0
Feb 27 05:26:26 debian kernel:  zvol_os_free+0x74/0x1d0 [zfs]
Feb 27 05:26:26 debian kernel:  taskq_thread+0x2ff/0x6c0 [spl]
Feb 27 05:26:26 debian kernel:  ? wake_up_q+0x90/0x90
Feb 27 05:26:26 debian kernel:  ? taskq_thread_spawn+0x60/0x60 [spl]
Feb 27 05:26:26 debian kernel:  kthread+0xda/0x100
Feb 27 05:26:26 debian kernel:  ? kthread_complete_and_exit+0x20/0x20
Feb 27 05:26:26 debian kernel:  ret_from_fork+0x22/0x30
Feb 27 05:26:26 debian kernel:  </TASK>
Feb 27 05:26:26 debian kernel: ---[ end trace 0000000000000000 ]---

Back to the original flow of creating a zvol block device: at the end of zvol_os_create_minor, the zvol is actually passed to the kernel for device creation:

#ifdef HAVE_ADD_DISK_RET
		error = add_disk(zv->zv_zso->zvo_disk);
#else
		add_disk(zv->zv_zso->zvo_disk);
#endif

in case of the overflow, this will result in an error like this:

Feb 27 05:28:08 debian kernel: sysfs: cannot create duplicate filename '/devices/virtual/block/zd0'
Feb 27 05:28:08 debian kernel: CPU: 3 PID: 201410 Comm: zfs Tainted: P        W  OE      6.1.0-17-amd64 #1  Debian 6.1.69-1
Feb 27 05:28:08 debian kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Feb 27 05:28:08 debian kernel: Call Trace:
Feb 27 05:28:08 debian kernel:  <TASK>
Feb 27 05:28:08 debian kernel:  dump_stack_lvl+0x44/0x5c
Feb 27 05:28:08 debian kernel:  sysfs_warn_dup.cold+0x17/0x23
Feb 27 05:28:08 debian kernel:  sysfs_create_dir_ns+0xca/0xe0
Feb 27 05:28:08 debian kernel:  kobject_add_internal+0xba/0x260
Feb 27 05:28:08 debian kernel:  kobject_add+0x9b/0xd0
Feb 27 05:28:08 debian kernel:  device_add+0xe0/0x8b0
Feb 27 05:28:08 debian kernel:  device_add_disk+0xd6/0x3c0
Feb 27 05:28:08 debian kernel:  zvol_os_create_minor+0xa39/0xe00 [zfs]
Feb 27 05:28:08 debian kernel:  ? zvol_find_by_name_hash+0x4b0/0x4b0 [zfs]
Feb 27 05:28:08 debian kernel:  dmu_objset_create+0xe0/0xf0 [zfs]
Feb 27 05:28:08 debian kernel:  ? zvol_find_by_name_hash+0x4b0/0x4b0 [zfs]
Feb 27 05:28:08 debian kernel:  zfs_ioc_create+0x14e/0x400 [zfs]
Feb 27 05:28:08 debian kernel:  zfsdev_ioctl_common+0x5c4/0xb10 [zfs]
Feb 27 05:28:08 debian kernel:  zfsdev_ioctl+0x4f/0xd0 [zfs]
Feb 27 05:28:08 debian kernel:  __x64_sys_ioctl+0x90/0xd0
Feb 27 05:28:08 debian kernel:  do_syscall_64+0x5b/0xc0
Feb 27 05:28:08 debian kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Feb 27 05:28:08 debian kernel:  ? syscall_exit_to_user_mode+0x27/0x40
Feb 27 05:28:08 debian kernel:  ? do_syscall_64+0x67/0xc0
Feb 27 05:28:08 debian kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Feb 27 05:28:08 debian kernel:  ? syscall_exit_to_user_mode+0x27/0x40
Feb 27 05:28:08 debian kernel:  ? do_syscall_64+0x67/0xc0
Feb 27 05:28:08 debian kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
Feb 27 05:28:08 debian kernel:  entry_SYSCALL_64_after_hwframe+0x64/0xce
Feb 27 05:28:08 debian kernel: RIP: 0033:0x7f921de12c5b
Feb 27 05:28:08 debian kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Feb 27 05:28:08 debian kernel: RSP: 002b:00007ffe61a3bba0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Feb 27 05:28:08 debian kernel: RAX: ffffffffffffffda RBX: 0000000000005a17 RCX: 00007f921de12c5b
Feb 27 05:28:08 debian kernel: RDX: 00007ffe61a3bc20 RSI: 0000000000005a17 RDI: 0000000000000004
Feb 27 05:28:08 debian kernel: RBP: 00007ffe61a3f200 R08: 0000000000000000 R09: 000055848e25a360
Feb 27 05:28:08 debian kernel: R10: 00007f921dd2c358 R11: 0000000000000246 R12: 00007ffe61a3bc20
Feb 27 05:28:08 debian kernel: R13: 0000000000005a17 R14: 000055848e251f00 R15: 0000000000000000
Feb 27 05:28:08 debian kernel:  </TASK>
Feb 27 05:28:08 debian kernel: kobject_add_internal failed for zd0 with -EEXIST, don't try to register things with the same name in the same directory.

since zd0 already exists and represents a different zvol than the one we are currently handling. this error is ignored by ZFS, in particular, the zvol->index->minor assignment done earlier is not removed again.

If the admin now notices something is fishy, and exports the pool, they might be greeted with messages like this:

Feb 27 07:18:00 debian kernel: ------------[ cut here ]------------
Feb 27 07:18:00 debian kernel: ida_free called for id=0 which is not allocated.
Feb 27 07:18:00 debian kernel: WARNING: CPU: 0 PID: 211633 at lib/idr.c:525 ida_free+0x127/0x130
Feb 27 07:18:00 debian kernel: Modules linked in: zfs(POE) spl(OE) binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd ccp kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd virtio_console pcspkr joydev evdev virtio_balloon button sg serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse dm_mod loop efi_pstore configfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 bochs crc_t10dif crct10dif_generic drm_vram_helper sr_mod drm_kms_helper cdrom virtio_net net_failover virtio_scsi failover ata_generic drm_ttm_helper ttm ata_piix crct10dif_pclmul crct10dif_common crc32_pclmul libata drm uhci_hcd scsi_mod ehci_hcd psmouse scsi_common usbcore crc32c_intel i2c_piix4 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev
Feb 27 07:18:00 debian kernel:  virtio virtio_ring usb_common floppy
Feb 27 07:18:00 debian kernel: CPU: 0 PID: 211633 Comm: spl_system_task Tainted: P        W  OE      6.1.0-17-amd64 #1  Debian 6.1.69-1
Feb 27 07:18:00 debian kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Feb 27 07:18:00 debian kernel: RIP: 0010:ida_free+0x127/0x130
Feb 27 07:18:00 debian kernel: Code: 0f ad 8f ff 31 f6 48 89 e7 e8 45 df 01 00 eb 98 48 8b 3c 24 4c 89 e6 e8 07 3c 08 00 89 de 48 c7 c7 60 95 a0 ac e8 59 80 6e ff <0f> 0b eb 86 e8 90 20 07 00 41 57 41 56 41 55 41 54 41 89 f4 55 41
Feb 27 07:18:00 debian kernel: RSP: 0018:ffffb2ac8d13bd78 EFLAGS: 00010286
Feb 27 07:18:00 debian kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
Feb 27 07:18:00 debian kernel: RDX: ffff944d37c203a8 RSI: 0000000000000001 RDI: ffff944d37c203a0
Feb 27 07:18:00 debian kernel: RBP: ffff944c02330e00 R08: 0000000000000000 R09: ffffb2ac8d13bbf0
Feb 27 07:18:00 debian kernel: R10: 0000000000000003 R11: ffffffffad0d4428 R12: 0000000000000246
Feb 27 07:18:00 debian kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff944c008b3a00
Feb 27 07:18:00 debian kernel: FS:  0000000000000000(0000) GS:ffff944d37c00000(0000) knlGS:0000000000000000
Feb 27 07:18:00 debian kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 27 07:18:00 debian kernel: CR2: 000056357bba0808 CR3: 00000001f7410000 CR4: 0000000000750ef0
Feb 27 07:18:00 debian kernel: PKRU: 55555554
Feb 27 07:18:00 debian kernel: Call Trace:
Feb 27 07:18:00 debian kernel:  <TASK>
Feb 27 07:18:00 debian kernel:  ? __warn+0x7d/0xc0
Feb 27 07:18:00 debian kernel:  ? ida_free+0x127/0x130
Feb 27 07:18:00 debian kernel:  ? report_bug+0xe2/0x150
Feb 27 07:18:00 debian kernel:  ? handle_bug+0x41/0x70
Feb 27 07:18:00 debian kernel:  ? exc_invalid_op+0x13/0x60
Feb 27 07:18:00 debian kernel:  ? asm_exc_invalid_op+0x16/0x20
Feb 27 07:18:00 debian kernel:  ? ida_free+0x127/0x130
Feb 27 07:18:00 debian kernel:  ? ida_free+0x127/0x130
Feb 27 07:18:00 debian kernel:  zvol_os_free+0xa9/0x1d0 [zfs]
Feb 27 07:18:00 debian kernel:  taskq_thread+0x2ff/0x6c0 [spl]
Feb 27 07:18:00 debian kernel:  ? wake_up_q+0x90/0x90
Feb 27 07:18:00 debian kernel:  ? taskq_thread_spawn+0x60/0x60 [spl]
Feb 27 07:18:00 debian kernel:  kthread+0xda/0x100
Feb 27 07:18:00 debian kernel:  ? kthread_complete_and_exit+0x20/0x20
Feb 27 07:18:00 debian kernel:  ret_from_fork+0x22/0x30
Feb 27 07:18:00 debian kernel:  </TASK>
Feb 27 07:18:00 debian kernel: ---[ end trace 0000000000000000 ]---

or, if unlucky:

Feb 27 07:45:43 debian kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Feb 27 07:45:43 debian kernel: #PF: supervisor read access in kernel mode
Feb 27 07:45:43 debian kernel: #PF: error_code(0x0000) - not-present page
Feb 27 07:45:43 debian kernel: PGD 0 P4D 0
Feb 27 07:45:43 debian kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Feb 27 07:45:43 debian kernel: CPU: 1 PID: 745726 Comm: spl_system_task Tainted: P        W  OE      6.1.0-17-amd64 #1  Debian 6.1.69-1
Feb 27 07:45:43 debian kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Feb 27 07:45:43 debian kernel: RIP: 0010:ida_free+0xd0/0x130
Feb 27 07:45:43 debian kernel: Code: 8b 3c 24 4c 89 e6 e8 6f 3c 08 00 48 8b 44 24 38 65 48 2b 04 25 28 00 00 00 75 6a 48 83 c4 40 5b 5d 41 5c 41 5d e9 f0 f3 44 00 <4c> 0f a3 28 73 37 4c 0f b3 28 31 f6 48 89 e7 e8 4c c9 01 00 be 00
Feb 27 07:45:43 debian kernel: RSP: 0018:ffffb2ac8ebebd78 EFLAGS: 00010046
Feb 27 07:45:43 debian kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
Feb 27 07:45:43 debian kernel: RDX: 0000000000000000 RSI: ffff944c71acb908 RDI: ffffb2ac8ebebd78
Feb 27 07:45:43 debian kernel: RBP: 0000000000000000 R08: ffffe34448072dc8 R09: ffffe34447f20280
Feb 27 07:45:43 debian kernel: R10: 00000000000379c0 R11: ffff944d3ffd5000 R12: 0000000000000206
Feb 27 07:45:43 debian kernel: R13: 0000000000000003 R14: 00000000000003f8 R15: ffff944c008b3a00
Feb 27 07:45:43 debian kernel: FS:  0000000000000000(0000) GS:ffff944d37c40000(0000) knlGS:0000000000000000
Feb 27 07:45:43 debian kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 27 07:45:43 debian kernel: CR2: 0000000000000000 CR3: 000000011240a000 CR4: 0000000000750ee0
Feb 27 07:45:43 debian kernel: PKRU: 55555554
Feb 27 07:45:43 debian kernel: Call Trace:
Feb 27 07:45:43 debian kernel:  <TASK>
Feb 27 07:45:43 debian kernel:  ? __die_body.cold+0x1a/0x1f
Feb 27 07:45:43 debian kernel:  ? page_fault_oops+0xd2/0x2b0
Feb 27 07:45:43 debian kernel:  ? exc_page_fault+0x70/0x170
Feb 27 07:45:43 debian kernel:  ? asm_exc_page_fault+0x22/0x30
Feb 27 07:45:43 debian kernel:  ? ida_free+0xd0/0x130
Feb 27 07:45:43 debian kernel:  ? ida_free+0x75/0x130
Feb 27 07:45:43 debian kernel:  zvol_os_free+0xa9/0x1d0 [zfs]
Feb 27 07:45:43 debian kernel:  taskq_thread+0x2ff/0x6c0 [spl]
Feb 27 07:45:43 debian kernel:  ? wake_up_q+0x90/0x90
Feb 27 07:45:43 debian kernel:  ? taskq_thread_spawn+0x60/0x60 [spl]
Feb 27 07:45:43 debian kernel:  kthread+0xda/0x100
Feb 27 07:45:43 debian kernel:  ? kthread_complete_and_exit+0x20/0x20
Feb 27 07:45:43 debian kernel:  ret_from_fork+0x22/0x30
Feb 27 07:45:43 debian kernel:  </TASK>
Feb 27 07:45:43 debian kernel: Modules linked in: zfs(POE) spl(OE) binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd ccp kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd virtio_console pcspkr joydev evdev virtio_balloon button sg serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse dm_mod loop efi_pstore configfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 bochs crc_t10dif crct10dif_generic drm_vram_helper sr_mod drm_kms_helper cdrom virtio_net net_failover virtio_scsi failover ata_generic drm_ttm_helper ttm ata_piix crct10dif_pclmul crct10dif_common crc32_pclmul libata drm uhci_hcd scsi_mod ehci_hcd psmouse scsi_common usbcore crc32c_intel i2c_piix4 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev
Feb 27 07:45:43 debian kernel:  virtio virtio_ring usb_common floppy
Feb 27 07:45:43 debian kernel: CR2: 0000000000000000
Feb 27 07:45:43 debian kernel: ---[ end trace 0000000000000000 ]---
Feb 27 07:45:43 debian kernel: RIP: 0010:ida_free+0xd0/0x130
Feb 27 07:45:43 debian kernel: Code: 8b 3c 24 4c 89 e6 e8 6f 3c 08 00 48 8b 44 24 38 65 48 2b 04 25 28 00 00 00 75 6a 48 83 c4 40 5b 5d 41 5c 41 5d e9 f0 f3 44 00 <4c> 0f a3 28 73 37 4c 0f b3 28 31 f6 48 89 e7 e8 4c c9 01 00 be 00
Feb 27 07:45:43 debian kernel: RSP: 0018:ffffb2ac8ebebd78 EFLAGS: 00010046
Feb 27 07:45:43 debian kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
Feb 27 07:45:43 debian kernel: RDX: 0000000000000000 RSI: ffff944c71acb908 RDI: ffffb2ac8ebebd78
Feb 27 07:45:43 debian kernel: RBP: 0000000000000000 R08: ffffe34448072dc8 R09: ffffe34447f20280
Feb 27 07:45:43 debian kernel: R10: 00000000000379c0 R11: ffff944d3ffd5000 R12: 0000000000000206
Feb 27 07:45:43 debian kernel: R13: 0000000000000003 R14: 00000000000003f8 R15: ffff944c008b3a00
Feb 27 07:45:43 debian kernel: FS:  0000000000000000(0000) GS:ffff944d37c40000(0000) knlGS:0000000000000000
Feb 27 07:45:43 debian kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 27 07:45:43 debian kernel: CR2: 0000000000000000 CR3: 000000011240a000 CR4: 0000000000750ee0
Feb 27 07:45:43 debian kernel: PKRU: 55555554
Feb 27 07:45:43 debian kernel: note: spl_system_task[745726] exited with irqs disabled
Feb 27 07:45:43 debian kernel: note: spl_system_task[745726] exited with preempt_count 1

the latter causes the whole system to crash/become unresponsive (as expected from a kernel NULL pointer deref).

PoC

Fairly easy reproducer:

  • create more than 2^16 zvols with volmode other than none
  • observe the errors and missing zvol block devices

It seems fairly likely that snapshots (with snapdev) and/or zfs recv can also be used to trigger this. I have not investigated interactions with zvol_inhibit_dev or runtime changing of zvol_major. The latter (if the docs are right and it is indeed runtime-changable) might cause additional issues.

It's a bit easier to see what's going on with a few additional debug prints like this:

diff --git a/module/os/linux/zfs/zvol_os.c b/module/os/linux/zfs/zvol_os.c
index 8d5d1f06f..efb141ca2 100644
--- a/module/os/linux/zfs/zvol_os.c
+++ b/module/os/linux/zfs/zvol_os.c
@@ -1222,6 +1222,7 @@ zvol_alloc(dev_t dev, const char *name)
 	zso->zvo_disk->private_data = zv;
 	snprintf(zso->zvo_disk->disk_name, DISK_NAME_LEN, "%s%d",
 	    ZVOL_DEV_NAME, (dev & MINORMASK));
+	zfs_dbgmsg("zvol_alloc %s %s!", name, zso->zvo_disk->disk_name);
 
 	return (zv);
 
@@ -1307,16 +1308,29 @@ zvol_os_create_minor(const char *name)
 	uint64_t volthreading;
 	bool replayed_zil = B_FALSE;
 
+	zfs_dbgmsg("create minor: %s", name);
+
 	if (zvol_inhibit_dev)
 		return (0);
 
 	idx = ida_simple_get(&zvol_ida, 0, 0, kmem_flags_convert(KM_SLEEP));
 	if (idx < 0)
 		return (SET_ERROR(-idx));
+	zfs_dbgmsg("create minor: %s, idx %d", name, idx);
 	minor = idx << ZVOL_MINOR_BITS;
+	zfs_dbgmsg("create minor: %s, minor %u", name, minor);
+	if (MINOR(minor) != minor) {
+		/* too many zvols can cause an overflow */
+		zfs_dbgmsg("create minor OVERFLOW ERROR: %s, minor %u/%u", name, minor, MINOR(minor));
+		/*
+		ida_simple_remove(&zvol_ida, idx);
+		return (SET_ERROR(EINVAL));
+		*/
+	}
 
 	zv = zvol_find_by_name_hash(name, hash, RW_NONE);
 	if (zv) {
+		zfs_dbgmsg("create minor %s already exists!", name);
 		ASSERT(MUTEX_HELD(&zv->zv_state_lock));
 		mutex_exit(&zv->zv_state_lock);
 		ida_simple_remove(&zvol_ida, idx);
@@ -1337,8 +1351,10 @@ zvol_os_create_minor(const char *name)
 	if (error)
 		goto out_dmu_objset_disown;
 
+	zfs_dbgmsg("create minor %s allocating zvol!", name);
 	zv = zvol_alloc(MKDEV(zvol_major, minor), name);
 	if (zv == NULL) {
+		zfs_dbgmsg("create minor %s allocating zvol FAILED!", name);
 		error = SET_ERROR(EAGAIN);
 		goto out_dmu_objset_disown;
 	}
@@ -1488,15 +1504,18 @@ out_doi:
 	 * directly as well.
 	 */
 	if (error == 0) {
+		zfs_dbgmsg("creating minor %s OK, adding disk!", name);
 		rw_enter(&zvol_state_lock, RW_WRITER);
 		zvol_insert(zv);
 		rw_exit(&zvol_state_lock);
 #ifdef HAVE_ADD_DISK_RET
 		error = add_disk(zv->zv_zso->zvo_disk);
+		zfs_dbgmsg("creating minor %s add_disk returned %d!", name, error);
 #else
 		add_disk(zv->zv_zso->zvo_disk);
 #endif
 	} else {
+		zfs_dbgmsg("creating minor %s failed, error: %d!", name, error);
 		ida_simple_remove(&zvol_ida, idx);
 	}

Impact

Obviously the kind of impact this can have really depends on wether an attacker can cause zvol datasets to be created. If they can (especially if they are thin provisioned, i.e. without a (ref)reservation) it's really easy to create a lot of them without running afoul of quotas.

Properly handling this and allowing actually using more than 64k block devices backed by ZFS probably requires reworking the whole assignment of major/minor device numbers.

How Has This Been Tested?

See PoC above.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

currently, the linux kernel allows 2^20 minor devices per major device
number.  ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol
itself, the other 15 for the first partitions of that zvol. as a result,
only 2^16 such blocks are available for use.

there are no checks in place to avoid overflowing into the major device
number when more than 2^16 zvols are allocated (with volmode=dev or
default). instead of ignoring this limit, which comes with all sorts of
weird knock-on effects, detect this situation and simply fail allocating
the zvol block device early on.

without this safeguard, the kernel will reject the attempt to create an
already existing block device, but ZFS doesn't handle this error and
gets confused about which zvol occupies which minor slot, potentially
resulting in kernel NULL derefs and other issues later on.

Signed-off-by: Fabian Grünbichler <[email protected]>
@Fabian-Gruenbichler
Copy link
Contributor Author

FWIW - I'd still like the questions I raised in the Discussions thread answered (mostly to provide guidance for myself and other people in similar situations in the future - which kind of issues do you want to be reported privately, which are okay to file publically by default, what's the process like in general, ..)

@tonyhutter
Copy link
Contributor

@Fabian-Gruenbichler I don't have any particular guidance for you regarding the discussions threads, but I will try to include this commit in the next 2.1.x and 2.2.x release.

@Fabian-Gruenbichler
Copy link
Contributor Author

@Fabian-Gruenbichler I don't have any particular guidance for you regarding the discussions threads, but I will try to include this commit in the next 2.1.x and 2.2.x release.

thanks!

maybe it's something that could be discussed at one of the leadership meetings? I wasn't sure if it's okay to just put stuff on the agenda there as a "sometimes drive-by contributor" ;) especially since the timeslots are almost always unattendable for me, so it feels a bit like dumping without contributing.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Mar 29, 2024
@behlendorf behlendorf merged commit c0aab8b into openzfs:master Mar 29, 2024
23 of 25 checks passed
tonyhutter pushed a commit that referenced this pull request May 2, 2024
currently, the linux kernel allows 2^20 minor devices per major device
number.  ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol
itself, the other 15 for the first partitions of that zvol. as a result,
only 2^16 such blocks are available for use.

there are no checks in place to avoid overflowing into the major device
number when more than 2^16 zvols are allocated (with volmode=dev or
default). instead of ignoring this limit, which comes with all sorts of
weird knock-on effects, detect this situation and simply fail allocating
the zvol block device early on.

without this safeguard, the kernel will reject the attempt to create an
already existing block device, but ZFS doesn't handle this error and
gets confused about which zvol occupies which minor slot, potentially
resulting in kernel NULL derefs and other issues later on.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Fabian Grünbichler <[email protected]>
Closes #16006
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
currently, the linux kernel allows 2^20 minor devices per major device
number.  ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol
itself, the other 15 for the first partitions of that zvol. as a result,
only 2^16 such blocks are available for use.

there are no checks in place to avoid overflowing into the major device
number when more than 2^16 zvols are allocated (with volmode=dev or
default). instead of ignoring this limit, which comes with all sorts of
weird knock-on effects, detect this situation and simply fail allocating
the zvol block device early on.

without this safeguard, the kernel will reject the attempt to create an
already existing block device, but ZFS doesn't handle this error and
gets confused about which zvol occupies which minor slot, potentially
resulting in kernel NULL derefs and other issues later on.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Fabian Grünbichler <[email protected]>
Closes openzfs#16006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants