-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stacked Borrows violation in macOS RwLock #121626
Comments
Looking at the code, what seems to be happening is that one iteration through the loop we create a shared reference here rust/library/std/src/sys/locks/rwlock/queue.rs Lines 348 to 350 in b58f647
I am surprised that the diagnostic says "created by a SharedReadOnly retag at offsets [0x0..0x8]". This type does have interior mutability, so parts of the retag should be SharedReadWrite. This is the type in question: rust/library/std/src/sys/locks/rwlock/queue.rs Lines 179 to 187 in b58f647
Everything except the miri-test-libstd uses |
So with random layout, I guess the Then apparently later it got invalidated at offsets 8 to 16 here
So that pointer must be the second field. I think this is morally equivalent to the following: use std::cell::UnsafeCell;
fn main() {
let mut c = UnsafeCell::new(42);
let ptr = c.get();
c = UnsafeCell::new(13);
unsafe { ptr.read(); }
} This is rejected by Miri with Stacked Borrows but accepted with Tree Borrows. The "blessed" way to do this is to never mix direct accesses to a local with accesses through derived pointers. Fixing that in queue.rs shouldn't be too hard. |
tree borrows: add a test to sb_fails This is something that happens in the wild (rust-lang/rust#121626), so TB accepting this is good. Let's make sure we notice if this ever changes.
From what @joboet said, the actual bug is that there's anyone even still using the old reference when we are doing the write that does the invalidation. According to the Miri log, we have these events in this order: A reference gets created and turned into a raw pointer rust/library/std/src/sys/locks/rwlock/queue.rs Lines 348 to 350 in b58f647
The memory that reference points to is written to via direct access to the local variable
Someone uses the pointer derived from the reference
We have a full backtrace only for the last event. Backtrace
|
AFAIK T-libs doesn't use the I-prioritize label, so probably fine to remove it. @rustbot label -I-prioritize |
I did managed to reproduce this on my machine once with that flag (with seed 2852 on Miri bbabee144ea12b210a7be9488f0eeed9efe3e24d), but when I run the testcase again with that seed it doesn't happen again. This is the test: use rand::Rng;
use std::sync::mpsc::channel;
use std::sync::{Arc, RwLock};
use std::thread;
fn main() {
const N: u32 = 10;
const M: usize = if cfg!(miri) { 100 } else { 1000 };
let r = Arc::new(RwLock::new(()));
let (tx, rx) = channel::<()>();
for _ in 0..N {
let tx = tx.clone();
let r = r.clone();
thread::spawn(move || {
let mut rng = rand::thread_rng();
for _ in 0..M {
if rng.gen_bool(1.0 / (N as f64)) {
drop(r.write().unwrap());
} else {
drop(r.read().unwrap());
}
}
drop(tx);
});
}
drop(tx);
let _ = rx.recv();
} Full command I used:
So... it uses actual randomness, and thus of course it does not reproduce even with the same seed. :/ (Disabling isolation means we give the program access to host randomness, which will be used to initialize the |
The backtrace was a bit different this time:
So the entry point of the bad use of the pointer is |
I finally found some seeds. :) 2998, 5308, 7680. On Miri bbabee144ea12b210a7be9488f0eeed9efe3e24d:
|
Turns out this diff fixes all the 3 seeds that I found above: --- a/library/std/src/sys/locks/rwlock/queue.rs
+++ b/library/std/src/sys/locks/rwlock/queue.rs
@@ -317,7 +317,7 @@ pub fn write(&self) {
fn lock_contended(&self, write: bool) {
let update = if write { write_lock } else { read_lock };
let mut node = Node::new(write);
- let mut state = self.state.load(Relaxed);
+ let mut state = self.state.load(Acquire);
let mut count = 0;
loop {
if let Some(next) = update(state) { Alternatively, this diff also does it: --- a/library/std/src/sys/locks/rwlock/queue.rs
+++ b/library/std/src/sys/locks/rwlock/queue.rs
@@ -286,7 +286,7 @@ pub const fn new() -> RwLock {
#[inline]
pub fn try_read(&self) -> bool {
- self.state.fetch_update(Acquire, Relaxed, read_lock).is_ok()
+ self.state.fetch_update(Acquire, Acquire, read_lock).is_ok()
}
#[inline] So I think something is wrong with the memory orderings used somewhere. I can't tell which of these I also found some other suspicious orderings, namely RMW operations where the success case has a weaker ordering than the failure case:
rust/library/std/src/sys/locks/rwlock/queue.rs Lines 492 to 497 in b58f647
IMO these justify a comment explaining why the orderings as so surprising, but they don't seem involved in this bug, at least not in the cases we have found so far. |
print thread name in miri error backtraces; add option to track read/write accesses This came up while debugging rust-lang/rust#121626. It didn't end up being useful there but still seems like good tools to have around.
print thread name in miri error backtraces; add option to track read/write accesses This came up while debugging rust-lang/rust#121626. It didn't end up being useful there but still seems like good tools to have around.
print thread name in miri error backtraces; add option to track read/write accesses This came up while debugging rust-lang/rust#121626. It didn't end up being useful there but still seems like good tools to have around.
Turn out I went down the wrong rabbit hole. Changing these orders means the program took a different path somewhere earlier in the execution so we didn't hit the bad interleaving any more. I found a seed that reproduces this issue even without weak memory emulation, so the orderings can't be the problem: Miri 9272474b755dca7dfea4f96fd8344dd12e223ae8,
What makes this particularly strange is that everything relevant seems to happen on the same thread... not sure what to make of that. |
…=RalfJung tree borrows: add a test to sb_fails This is something that happens in the wild (rust-lang#121626), so TB accepting this is good. Let's make sure we notice if this ever changes.
…RalfJung print thread name in miri error backtraces; add option to track read/write accesses This came up while debugging rust-lang#121626. It didn't end up being useful there but still seems like good tools to have around.
I found the bug! It's an ABA problem. When a new node is added, its thread basically executes: let next = state.load();
node.next = next;
state.compare_exchange(next, &node); In the failing executions, the thread gets interrupted for a long time, long enough that the thread at the head of the queue gets woken up (invalidating all shared references to it) and gets readded to the queue (with a new shared reference). The CAS succeeds, because the pointer has the same bitpattern but once we try to perform an access through the pointer, we obviously incur UB. This is very hard to fix, since the bug can occur anytime a |
Nice catch! Closing in favor of #121950. |
Today's run of the std test suite in Miri failed on macOS:
The diff to the last successful run is this, but since this is a concurrency test, there's a high chance that the bug already existed before and only surfaces randomly. So #110211 is the most likely culprit.
So far I haven't managed to reproduce this outside the context of the standard library test suite.
The text was updated successfully, but these errors were encountered: