Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax atomic orderings + fix parking/unparking UB #17

Merged
merged 13 commits into from
Aug 20, 2022

Conversation

Cassy343
Copy link
Contributor

I found this crate when looking for a simple oneshot implementation and decided to try to make some improvements. I removed the UB caused by taking multiple mutable references to the channel concurrently, and fixed a couple UAF errors which could crop up due to unlucky thread interleaving (specifically if the sending thread yielded while unparking the receiving thread).

I'm making this a draft pull request since I'm not very satisfied with the implementation of Future::poll. I think there are some improvements that could be made there and I have not spent as much mental energy verifying its correctness. I noticed that the receiver itself currently implements Future, but I feel like it would be a lot more ergonomic to add a method called into_future which would then return something you could await. This would be a breaking API change, but would give you a compile-time guarantee that the receiver could never be used in a sync context once polled asynchronously.

I'd love to hear your thoughts.

Should close #10

@faern
Copy link
Owner

faern commented Jun 23, 2022

Oh, there is UB here? I have too look closer at that, but this is a pretty large PR. Could you possibly elaborate further on how the UB can be triggered? Thanks for analyzing this and sending this improvement!

I like the idea of moving the impl Future to a separate type to guarantee it's not being synchronously received after async polling started!

@Cassy343
Copy link
Contributor Author

Cassy343 commented Jun 23, 2022

The first form of UB here is the multiple mutable reference issue. This is triggered if one thread calls recv or some other method on the receiver at the same time that another thread calls send on the sender. When this happens each thread has a mutable reference to the same memory location which is defined to be instant UB. This PR fixes that by only using shared references and going through UnsafeCell.

The second issue was extremely subtle, and occurs under the following scenario:

  • Receiving thread begins receiving on an empty channel
  • Sending thread writes the message to the channel, sets the state to MESSAGE, and gets pre-empted before waking the receiving thread
  • Receiving thread is spuriously unparked, sees the state is set to MESSAGE, takes the message and deallocates the channel in memory.
  • Sending thread resumes execution, and attempts to read the waker from the freed channel

This issue was resolved by adding the intermediate UNPARKING state. I was able to reproduce this in valgrind by inserting a liberal amount of std::thread::yield_now()s in inconvenient places.

@faern
Copy link
Owner

faern commented Jul 4, 2022

Thanks for the explanation. I'll try to take the time to dive deeper into this PR later in the week, and next week!

I noticed that the receiver itself currently implements Future, but I feel like it would be a lot more ergonomic to add a method called into_future which would then return something you could await

Yes, this sounds like a very good idea! And the soon-to-be-stable IntoFuture trait seems to solve this beautifully. Then the Receiver can be awaited directly, but it's still a separate type. Until then, and maybe as a long term alternative to relying on this trait, we can provide an explicit method. IMO Receiver<T>::recv_async(self) -> AsyncReceiver<T> sounds like good names. This is a great API improvement as it can remove possible panics from the existing sync recv* calls 🥳

This PR touches on many more or less unrelated subjects at once. Such as deduplicating the crate documentation and a few other things. If possible, I would prefer to have this split into multiple PRs. Mostly to simplify reviewing, and get stuff merged in bite size chunks.

@Cassy343
Copy link
Contributor Author

Cassy343 commented Jul 5, 2022

Sounds good. Once you get a feel for the changes made (to be honest most of them are just de-duplicating code, changing code to use more up-to-date unsafe patterns that are the same at run-time, and comments), let me know what chunks you'd like to see this split into and I can get on that when I have time. Once that is done I'll tackle the async API change.

@faern
Copy link
Owner

faern commented Jul 7, 2022

I would love if this PR could be reviewed commit for commit. I started reading the first commit, but I hit the following issue: e1d5fbd#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R735, which is supposed to be Err(DISCONNECTED) as you change it to later.

Would it be possible for you to rewrite the history a bit here to split it up into multiple steps? Preferably I'd want the UB fix merged first. I see you also fix a lot of documentation etc. But that is less critical, so I'd prefer that to be a separate commit at least, if not separate PR.

@faern
Copy link
Owner

faern commented Jul 7, 2022

I have now reviewed the entire first commit in this PR (e1d5fbd). I think it looks correct, except the miss with Ok vs Err on the conversion to compare_exchange in a bunch of places.

Since the compare_and_swap -> compare_exchange conversion is not needed to fix the UB that the commit is otherwise dealing with, may I suggest that you move that conversion to a separate commit?

Thank you a lot for this PR! Even if I have some feedback on the PR format, I deeply appreciate that you have found and fixed these UBs!

@Cassy343
Copy link
Contributor Author

Cassy343 commented Jul 8, 2022

I'm working on rewriting the history a bit right now to make it more clear. The first commit is basically the same but with that Ok/Err issue fixed. I changed the chunky commit that fixes the UB to just contain code changes, and I removed the documentation changes since that can be handled after the code changes are settled.

@Cassy343
Copy link
Contributor Author

Cassy343 commented Jul 8, 2022

Okay I think I did that correctly. I'm not super great with git so if that really screwed things up let me know and I'll do my best to fix it!

Thank you a lot for this PR! Even if I have some feedback on the PR format, I deeply appreciate that you have found and fixed these UBs!

It's my pleasure! :)

@faern
Copy link
Owner

faern commented Jul 8, 2022

Thanks. However, I see that you put the change to UnsafeCell and NonNull, which I had reviewed, together with all the new atomic operations, which I have not reviewed. The former are way simpler code changes and are fixing the first and most glaring UB you found. Can we focus on having just that in one commit, so I can merge that to master before we start looking at the intricate atomics changes?

A reason for having them separate, apart from being easier to review, is that it becomes easier to bisect bugs. If a single commit changes all of this at once and we later find a bug, it's going to be harder to pinpoint which code change caused it, if it's all in one commit.

src/errors.rs Outdated Show resolved Hide resolved
src/errors.rs Outdated Show resolved Hide resolved
src/errors.rs Outdated Show resolved Hide resolved
@Cassy343
Copy link
Contributor Author

Cassy343 commented Jul 8, 2022

Thanks. However, I see that you put the change to UnsafeCell and NonNull, which I had reviewed, together with all the new atomic operations, which I have not reviewed.

I see that, that was a mistake on my part. When I'm done with work today I'll get that sorted out. I'd like to put those simple changes before the atomics since my ordering/state changes are kind of built off of those prior changes. Sorry for the inconvenience!

@faern
Copy link
Owner

faern commented Jul 10, 2022

Thanks for the improved commits! I had forgotten to approve the CI to run on this PR. Now when I did that I see that there are both rustfmt formatting to fix. And also, you use code that was stabilized after the current MSRV (1.56). It's fine to bump the MSRV, but it should be noted in the changelog, and the github actions workflow file needs to be updated also.

I added an extra CI job to also run Clippy. The CI on this repository was a bit lacking, and a bit outdated. Could you please cater to the current complaints of the CI and rebase on latest master where I have added Clippy?

@faern
Copy link
Owner

faern commented Jul 10, 2022

I cherry-picked your commit on the migration to compare_exchange. The commit was pretty trivial, it was your first commit, and it made Clippy happy. So that part of this PR is now "merged" already :) You can drop that commit when you rebase on latest master. Thanks for the first contribution! 🥳

@Cassy343
Copy link
Contributor Author

I fixed the formatting and clippy issue. Also it's definitely not needed to bump the MSRV for MaybeUninit::assume_init_drop since that can be pretty trivially re-implemented, so that's what I did. I also rebased on the latest master and dropped that first commit.

As always don't hesitate to point things out as you go through and review these changes!

@faern
Copy link
Owner

faern commented Jul 10, 2022

Can you maybe squash Implement latest feedback and Deduplicate dealloc code into Improve unsafe hygiene ? As pointed out earlier, I would like to be able to get the UB fix regarding multiple mutable references merged in one go, without all the changes to the atomic orderings and other fixes. That commit will also need a rustfmt run. And it also has Clippy errors.

src/errors.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@Cassy343
Copy link
Contributor Author

Yeah I'll take care of that and implement your latest feedback in the next couple of days - work/life got a little busy in the interim.

Copy link
Owner

@faern faern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to merge all bugfix changes that does not change the public API in any way first. And release a patch release of the crate. Then have a separate PR that makes any API changes, for example the fact that async receiving is moved to a separate type.

src/lib.rs Show resolved Hide resolved
@Cassy343
Copy link
Contributor Author

So just to clarify, you still want me to squash "Implement latest feedback," "Deduplicate dealloc code," and "Improve unsafe hygiene," and then we'll work on merging the rest, or did you just want to all the changes into the patch then make API/docs changes?

@faern
Copy link
Owner

faern commented Jul 13, 2022

I still want the somewhat unrelated changes to be different PRs. So squashing is not strictly necessary, but they have to come sorted one feature/fix after another, and preferably different PRs. I think the stuff already done in this PR and discussed in this PR should be at least three PRs:

  1. Change *mut into NonNull and wrapping fields in UnsafeCell in order to mitigate one UB. Including all code comment updates, SAFETY documentation and everything related to fixing just this UB.
  2. Make the changes to the updates of the Channel::state field. Including relaxing the memory ordering, changing the values of the states to allow the fetch_add trick.
  3. Move the async receiving into a separate type (This breaks the API and will be a semver breaking release)

And your unrelated changes to the documentation could also be a separate PR. Such as moving everything out from lib.rs into README.md and including it in the code with #[doc ..include..]. In general, keep one PR to one bugfix or feature addition. Or there can be multiple fixes in one PR if they are very related. But here I feel like we are changing a bunch of unrelated things.

@Cassy343
Copy link
Contributor Author

Okay so "Improve unsafe hygiene" through the first "cargo fmt" (409c498 through 3df9227) contain the unsafe hygiene improvements but none of the algorithm/state changes, basically item 1 in your list. All the commits after that are item 2, and I'll make a separate PR for item 3.

@Cassy343
Copy link
Contributor Author

That CI failure is a little strange...I'm not able to replicate that locally

@Cassy343
Copy link
Contributor Author

Looks like that CI failure I mentioned was a fluke. Edited the history to fix a couple of safety comments and CI passed this time. For reference, the recv_deadline_and_timeout_time_should_elapse test failed, but I couldn't work out which flags were on for it since the log was kind of messy.

@faern
Copy link
Owner

faern commented Jul 14, 2022

Thanks. Would you like to submit a separate PR on step 1 (unsafe code hygiene)? Or do you want me to just pull those commits into master? I see that you still have some changes to that code in the later commits. If I pull it in manually I will backport those changes, so they end up where they are supposed to be. For example stuff like this: 1906dda#diff-6ff19a1ff611f3835b5a025663cf4dec901ed914626f78847414836439bd1a79R41 and 1906dda#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R232-R233.

If I do it, it means that you also have to deal with this collision when rebasing. So it's going to be in total less work you do you want to do it.

src/lib.rs Outdated Show resolved Hide resolved
@Cassy343
Copy link
Contributor Author

Hi, I'll get to it as soon as I can, I've been really busy these last couple days but hope to have some time this week.

Copy link
Owner

@faern faern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow reviewing! I want this merged and released as much as you do, I have just been busy with other stuff.

I have now reviewed everything down to (not including) recv_ref. From there on and down it seemed to be some extra trickery, so I'll save that for tomorrow.

The feedback I left now is almost only just documentation/comment updates. Everything so far seems sound and really good!

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
Copy link
Owner

@faern faern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now reviewed the rest! I could not stop myself. Too exciting reading material ;)

This is great work! I have some questions and suggestions, but overall I think this is very close to being merged.

src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@Cassy343
Copy link
Contributor Author

Thanks for putting in so much time to look over all of that! I am also really busy right now so I will get to that over the course of this week.

@Cassy343
Copy link
Contributor Author

All feedback should be addressed except that last review there. Once that's finalized I think this is pretty good to go. IntoFuture is still on track to land in 1.64 AFAIK, so I think what I might do is just draft up a PR using the beta feature and I don't think it's unreasonable to just wait until September to land that.

Copy link
Owner

@faern faern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aim to review this again/go over the changes during this weekend. Is there anything left you want to do or do you feel it's ready for a (hopefully) final review?

Regarding IntoFuture: It does not hurt to have an explicit fn recv_async(&self) -> AsyncReceiver. This works on stable today and can be implemented and merged without consideration for MSRV or anything else. Then we can add an IntoFuture impl which does the same thing, but implicitly. Depending on when that happens we can decide if it should be behind a feature gate or not.

src/lib.rs Show resolved Hide resolved
mem::drop(receiver);
});
t1.join().unwrap();
t2.join().unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to test the race condition that the library had before this PR? I can't make this test fail even when I backport it to current master. Of course a normal run won't trigger it very often at all. But if it really was a thread interleaving issue loom should pick it up, no?

Would be nice to understand the situation and have a test that actually catch the issue. Otherwise it won't really protect against regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was to test the interleaving of recv fails, send happens, receiver is dropped. I think I added this test while I was working out how memory reclamation worked in this crate since I didn't see a test that tested that ordering.

Copy link
Owner

@faern faern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm about to merge this. There is obviously some work left on the Future impl before we can publish. But I think we can do that in a separate PR. Mostly because this one is large and can easily have bitrot.

@faern faern merged commit 305d105 into faern:master Aug 20, 2022
@Cassy343
Copy link
Contributor Author

Thanks! I'm very glad to see this merged. I'll figure out async receiving and draft something up over the next week or two. I'm moving back to college soon so it's a busy time for me

@faern
Copy link
Owner

faern commented Aug 20, 2022

@Cassy343 Thanks for all your hard work and clever solutions! ❤️

I'm currently looking at the Future impl to see if there are any low hanging fruit I can fix right now. Just so we are on the same page. The current impl is unsound, right? It's not only that it has stricter orderings than necessary, right?

@Cassy343
Copy link
Contributor Author

I don't remember off the top of my head if it's unsound or not. If memory serves me correctly then I think there's issues with the writing/unparking of the waker being racy. I think a correct implementation might involve using something like atomic-waker, but we might be able to avoid depending on/recreating that since we already keep track of channel state atomically. The tricky part is that we generally need to write the new waker to memory before observing the state, otherwise between the time we observed the state and wrote the waker, the state may have changed causing a race or causing us to write the new waker after the sending party woke the old waker.

@faern
Copy link
Owner

faern commented Aug 20, 2022

I guess upon polling, if we need to store a waker but one is already written, we might need to steal the waker in a way that guarantees the Sender is not also trying to read/drop that memory. First resetting the state to one where the other entity (in this case the Sender) is not allowed to read/drop the waker. If we succeed with that state transition we take out the old waker and drop it properly, then we write the new waker and set the state to RECEIVING again.

It's going to be quite the dance.

@faern
Copy link
Owner

faern commented Aug 20, 2022

I created a follow up PR that fixes the Future implementation as well as some other improvements. See #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relax atomic memory ordering constraints
2 participants