Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(probably already fixed) BUG: Bad page state #16594

Closed
snajpa opened this issue Oct 2, 2024 · 2 comments
Closed

(probably already fixed) BUG: Bad page state #16594

snajpa opened this issue Oct 2, 2024 · 2 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@snajpa
Copy link
Contributor

snajpa commented Oct 2, 2024

System information

Type Version/Name
Distribution Name vpsAdminOS
Distribution Version staging
Kernel Version 6.10.11
Architecture x86_64
OpenZFS Version b2ca510

Describe the problem you're observing

Got a crash a bit less than 24h after deploying - it's worth noting that we had the luck to pull from master just before b052035 was merged, but by the stack traces, I don't think it's the same issue, I think there's perhaps some bug lurking in the new direct IO paths. I'd be glad to be proven wrong (we're going to pull in that b052035 fix now)

Describe how to reproduce the problem

N/A at the moment

Include any warning/errors/backtraces from the system logs

BUG: Bad page state in process mongod  pfn:4bd34a4
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4bd34a4
flags: 0x6ffff8000002000(reserved|node=1|zone=2|lastcpupid=0x1ffff)
raw: 06ffff8000002000 ffffe47e6f4d2908 ffffe47e6f4d2908 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
CPU: 113 PID: 614202 Comm: mongod Not tainted 6.10.11 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7515/07PXPY, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.825/ct.9274/user-owned/lxc.payload.9274
Call Trace:
 <TASK>
 dump_stack_lvl+0x4f/0x70
 bad_page+0x70/0x100
 free_unref_page+0x2b5/0x4b0
 zfs_uio_free_dio_pages+0x62/0x110 [zfs]
 zfs_write+0xb46/0xd40 [zfs]
 zpl_iter_write+0x12c/0x1b0 [zfs]
 vfs_write+0x292/0x460
 ksys_write+0x6b/0xf0
 do_syscall_64+0x9a/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb7d14104bd
Code: bf 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae fc ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 f7 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffca958bb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007fb7d14104bd
RDX: 0000000000002000 RSI: 000000000b558000 RDI: 0000000000000006
RBP: 000000000b558000 R08: 0000000000040000 R09: 0000000000000000
R10: 00007ffca958bc10 R11: 0000000000000293 R12: 0000000000002000
R13: 000000000b558000 R14: 0000000000000000 R15: 00007ffca958bdd0
 </TASK>
Disabling lock debugging due to kernel taint
------------[ cut here ]------------
WARNING: CPU: 113 PID: 614202 at mm/gup.c:142 try_grab_folio+0x6e/0x90
CPU: 113 PID: 614202 Comm: mongod Tainted: G    B              6.10.11 #1-vpsAdminOS
Hardware name: Dell Inc. PowerEdge R7515/07PXPY, BIOS 2.14.1 12/17/2023
In memory cgroup /osctl/pool.tank/group.default/user.825/ct.9274/user-owned/lxc.payload.9274
RIP: 0010:try_grab_folio+0x6e/0x90
Code: 77 34 f0 01 77 5c 48 8b 07 48 63 d6 be 23 00 00 00 48 c1 e8 3a 48 8b 3c c5 a0 43 82 b4 e8 fa 9a fe ff 31 c0 e9 fe d2 a0 00 90 <0f> 0b 90 b8 f4 ff ff ff e9 f0 d2 a0 00 89 f0 c1 e0 0a f0 01 47 34
RSP: 0018:ffffa6ce696637e0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffe47e6f4d2900 RCX: 8000000000000225
RDX: 0000000000210002 RSI: 0000000000000001 RDI: ffffe47e6f4d2900
RBP: 0000000000210002 R08: ffff8c3750f09ac0 R09: ffff8c5c582a1a40
R10: 0000000000000000 R11: ffff8d364676500c R12: ffff8caef8f764d0
R13: 000000000b558000 R14: 8000004bd34a4225 R15: 0000000000000002
FS:  00007fb7d439ab80(0000) GS:ffff8d5abec40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000088b090 CR3: 0000009f41518006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x7c/0x120
 ? try_grab_folio+0x6e/0x90
 ? report_bug+0x160/0x190
 ? handle_bug+0x3b/0x70
 ? exc_invalid_op+0x13/0x70
 ? asm_exc_invalid_op+0x16/0x20
 ? try_grab_folio+0x6e/0x90
 follow_page_pte+0x11a/0x620
 follow_page_mask+0x1d8/0xcc0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? check_vma_flags+0xaa/0x150
 __get_user_pages+0x135/0x740
 __gup_longterm_locked+0xd6/0xd20
 gup_fast_fallback+0x136/0x1000
 get_user_pages_fast+0x43/0x60
 __iov_iter_get_pages_alloc+0xda/0x4f0
 iov_iter_get_pages2+0x19/0x30
 zfs_uio_get_dio_pages_alloc+0xe3/0x4b0 [zfs]
 zfs_setup_direct+0xc0/0x160 [zfs]
 zfs_write+0x231/0xd40 [zfs]
 zpl_iter_write+0x12c/0x1b0 [zfs]
 vfs_write+0x292/0x460
 ksys_write+0x6b/0xf0
 do_syscall_64+0x9a/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb7d14104bd
Code: bf 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae fc ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 f7 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffca958bb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007fb7d14104bd
RDX: 0000000000002000 RSI: 000000000b558000 RDI: 0000000000000006
RBP: 000000000b558000 R08: 0000000000040000 R09: 0000000000000000
R10: 00007ffca958bc10 R11: 0000000000000293 R12: 0000000000002000
R13: 000000000b558000 R14: 0000000000002000 R15: 00007ffca958bdd0
 </TASK>
---[ end trace 0000000000000000 ]---


@snajpa snajpa added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 2, 2024
@snajpa
Copy link
Contributor Author

snajpa commented Oct 2, 2024

Looking at it a bit more I think it does look like a problem that b052035 would fix... it was just unfortunate to hit the bad state while freeing in direct IO path, which made it into first culprit

@snajpa snajpa changed the title (likely) Direct IO: BUG: Bad page state (probably already fixed) BUG: Bad page state Oct 2, 2024
@snajpa
Copy link
Contributor Author

snajpa commented Oct 4, 2024

So far (by the uptime of the node) it does look like b052035 is the fix, sorry for pointing finger to the direct IO implementation, it's probably completely innocent here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

1 participant