Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in rayon_core::latch::TickleLatch::set #768

Closed
rocallahan opened this issue Jun 15, 2020 · 8 comments
Closed

Crash in rayon_core::latch::TickleLatch::set #768

rocallahan opened this issue Jun 15, 2020 · 8 comments

Comments

@rocallahan
Copy link
Contributor

We're getting a SIGSEGV in rayon under load with cross-ThreadPool dispatch.

   3: <rayon_core::latch::TickleLatch<L> as rayon_core::latch::Latch>::set
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/latch.rs:195
      <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/job.rs:123
   4: rayon_core::job::JobRef::execute
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/job.rs:59
      rayon_core::registry::WorkerThread::execute
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/registry.rs:681
      rayon_core::registry::WorkerThread::wait_until_cold
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/registry.rs:665
   5: rayon_core::registry::WorkerThread::wait_until
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/registry.rs:639
      rayon_core::registry::main_loop
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/registry.rs:759
      rayon_core::registry::ThreadBuilder::run
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/registry.rs:56
   6: <rayon_core::registry::DefaultSpawn as rayon_core::registry::ThreadSpawn>::spawn::{{closure}}
             at root/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.7.0/src/registry.rs:101
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at rustc/49cae55760da0a43428eba73abcb659bb70cf2e4/src/libstd/sys_common/backtrace.rs:130

The faulting RIP is 11cd2c0 here:

00000000011cd2c0 <_ZN10rayon_core5sleep5Sleep6tickle17h2e956c4db5bbecdaE>:
 11cd2c0:       48 8b 07                mov    (%rdi),%rax
 11cd2c3:       48 85 c0                test   %rax,%rax
 11cd2c6:       74 05                   je     11cd2cd <_ZN10rayon_core5sleep5Sleep6tickle17h2e956c4db5bbecdaE+0xd>
 11cd2c8:       e9 03 00 00 00          jmpq   11cd2d0 <_ZN10rayon_core5sleep5Sleep11tickle_cold17ha65d63499113cf3aE.llvm.14863800856586354016>
 11cd2cd:       c3                      retq   

The fault address (RDI) is zero.

@rocallahan
Copy link
Contributor Author

Looking at rayon code it looks like there might be a data race that could cause this crash.

Registry::in_worker_cross does

        current_thread.wait_until(&job.latch);

and then returns. Meanwhile TickleLatch::set does

        self.inner.set();
        self.sleep.tickle(usize::MAX);

Is it possible that TickleLatch::set sets the inner latch, which releases the in_worker_cross thread's wait_until, so that thread returns, destroying the StackJob, after which the TickleLatch::set continues and dereferences garbage to get self.sleep? I can't see anything that would prevent such a race...

@rocallahan
Copy link
Contributor Author

Needless to say this crash is highly intermittent and we have no way to reliably reproduce it.

@rocallahan
Copy link
Contributor Author

Assuming this analysis is correct, I guess one way to fix it would be to change TickleLatch::sleep to TickleLatch::registry with type &Arc<Registry> and clone it in TickleLatch::set before setting the inner latch.

@cuviper
Copy link
Member

cuviper commented Jun 15, 2020

Please see #740, and test that if you're able.

That was forgotten because we hoped to land Niko's scheduler rewrite, which had a similar problem and fix already. But we should probably just do a patch release with #740 while we finish the rewrite...

@rocallahan
Copy link
Contributor Author

I think it probably would be a good idea to ship a fix given that the UAF has been known for over two months now.

@rocallahan
Copy link
Contributor Author

rocallahan commented Jun 15, 2020

If there are other known unfixed UAF/safety bugs then it would be helpful if you list them somewhere for future reference.

@cuviper
Copy link
Member

cuviper commented Jun 15, 2020

I guess we could use an "unsound" label or something, but really we just shouldn't have let this fix linger at all. I'll merge that PR and prepare a point release.

I'm not aware of any other such bugs. There are deadlock hazards like #592, but that's a different category of problem.

@cuviper
Copy link
Member

cuviper commented Jun 15, 2020

I've released the fix in rayon-core 1.7.1, and rayon 1.3.1 also bumped its requirement.

Assuming this analysis is correct, I guess one way to fix it would be to change TickleLatch::sleep to TickleLatch::registry with type &Arc<Registry> and clone it in TickleLatch::set before setting the inner latch.

Kudos for coming to the same analysis and fix as I had -- that's nice validation!

@cuviper cuviper closed this as completed Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants