Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support for automatically shimming dynamic libraries #813

Closed
wants to merge 1 commit into from

Conversation

Aaron1011
Copy link
Member

@Aaron1011 Aaron1011 commented Jun 30, 2019

I've opened this PR to get some feedback on my approach for automatically shimming dynamic libraries (as per #11)

The current implementation uses libffi to generate a call to the underlying function, using the type information provided by the arguments. Using ripgrep as a test, this code successfully shimmed a call to 'libc::readlink'

The major unresolved question is how to deal with pointers. There are two cases we need to handle:

  • 'Input pointers' - raw pointers created in Miri and passed into foreign code, either directly or through a struct/wrapper
  • 'Output pointers' - raw pointers created in foreign code and returned out to Miri, either as a return value or by writing into a structure.

To handle 'output pointers', we could add a new variant to rustc_mir::interpret::place::Place (e.g. `Place::Raw') to handle ffi pointers. We wouldn't be able to track any extra information for these kinds of pointers, since they would be a raw address pointing somewhere into the Miri process.

A Place::Raw would be created for any pointers returned by a foreign function (either directly or wrapped in another type). All reads/writes to a Place::Raw would be implemented by directly writing/reading from the underlying pointer, since we can't know when C code might write/read from it.

'Input pointers' will be much tricker. We need to be able to support C code writing directly into Miri-controlled memory. Obviously, this will need to bypass all of the normal StackedBorrows code, since we can't control how the pointer is written to. We might want to create a separate struct in Miri to hold the backing allocations for all such pointers - this would reduce the chance of C code corrupting internal Miri state, and would make it easier to reason about the lifetimes of the raw pointers we create into such allocations.

However, there are tricky scenarios that we need to handle. Consider code like this:

#[repr(C)]
#[derive(Debug)]
struct MyStruct {
	my_field: libc::c_int
}

extern "C" {
	fn extern_one(ptr: *mut MyStruct);
	fn extern_two();
}

fn foo() {
	let mut my_struct = MyStruct { my_field: 27 };
	unsafe { extern_one(&mut my_struct as *mut MyStruct); }
	println!("My struct: {:?}", my_struct);
	unsafe { extern_two() };
	println!("My struct: {:?}", my_struct);
}

What can we say about this code, especially w.r.t how it should interact with stacked borrows?

As soon as we call 'extern_one', we have to assume that my_struct can be modified by C code at any point, for the rest of its lifetime. That is, we need to assume that the raw pointer it receives might be stored somewhere, and later used by extern_two. This means that any reads/writes from normal Miri will need to go through the same memory used by C code, so that we properly observe/propogate modifications to it. This reminds me a little of intptrcast - when we expose a pointer to foreign code, we force it to take on a 'definite' representation for the rest of its existence.

Notes:

  • Currently, my shimming code doesn't handle aggregate types. However, it shouldb't be too difficult to support shimming arbitrary structures.
  • The rust libffi crate has a high-level API for making function calls. However, it requires that the return type be specfified at compile-time, which makes it useless for our purposes. As a result, I use a combination of the higher-level API and the raw bindings to the C libffi library to create and call the shims.
  • This PR currently only supports Linux. However, it should be possible to port this to Windows and OS X without too much effort.
  • I make no attempt to dlopen any dynamic libraries that the target application links to. We should be able to extract this information from the arguments passed to miri by cargo
  • On Linux, we could attempt to support staticly linked C libraries as follows: link all of the required static libraries into a single archive, and then build a shared object which exports all of the symbols from that static library. However, I have no idea if this would actually work for any real-world applications.

@RalfJung
Copy link
Member

RalfJung commented Jul 1, 2019

Wow, that's pretty impressive! I called @oli-obk crazy when he said this might even be possible. :D

In terms of "how to do this", I don't have many things to say as I haven't tought about this very much. It's also not currently a priority for me, so I am basically fine with anything that works. ;) However, there's the usual caveat that this should have as little impact as possible on the remaining code. An interpreter is a subtle thing to write, getting good test coverage is hard, and we defend against that by striving for maximal code clarity. We don't always achieve that though.^^ But good abstractions are key, and I feel our place/operand system is currently working very well.

I think this is a feature that I'd like to see off-by-default; by allowing the interpreted code to do FFI calls it can cause arbitrary UB, that seems like something that the user should opt-in to.

'Output pointers'

Why does this need a new Place variant? If this is really one-way communication, it could just take the result that it got back, turn it into a byte slice, and then create a regular Miri allocation for that byte slice.

'Input pointers'

Yeah, this is hard. My feeling is it is better to consider this out-of-scope for now.

And as you noted, for such allocation there is basically no hope of doing any checks -- I am not just talking about Stacked Borrows, even uninitialized memory cannot be tracked any more. That seems in direct contradiction with Miri's goals as a tool for detecting UB. If you need that kind of FFI, maybe a sanitizer or valgrind are better tools?

@RalfJung RalfJung added the S-waiting-on-author Status: Waiting for the PR author to address review comments label Jul 1, 2019
@Aaron1011
Copy link
Member Author

Why does this need a new Place variant? If this is really one-way communication, it could just take the result that it got back, turn it into a byte slice, and then create a regular Miri allocation for that byte slice.

That prevents us from observing any other writes made by C code. As a real-world example, consider libc::__errno_location. We need code like this to work:

let errno_ptr = unsafe { libc::__errno_location(); }
unsafe { libc::read(/* something */); }
let read_errno = unsafe { *errno_ptr };
unsafe { libc::write(/* something else */); }
let write_errno = unsafe { *errno_ptr };

With your approach, read_errno would always equal write_errno, which is incorrect.

@Aaron1011
Copy link
Member Author

Aaron1011 commented Jul 1, 2019

That seems in direct contradiction with Miri's goals as a tool for detecting UB. If you need that kind of FFI, maybe a sanitizer or valgrind are better tools?

I agree that this should be considered out of scope for now. However, I think it would make sense to consider adding it at some point in the future. Given that Miri is currently the only way of detecting UB in rust code, I think it would be useful to eventually be able to run Miri on arbitary rust code (even if the overhead can be very high).

Specifically, I think suporting 'input pointers' would still be compatible with detecting UB in the rest of a Rust program. While FFI usage necessary 'infects' (in the sense of preventing us from doing UB checks) anything we take a pointer to, it need not propagate further through the program. If we have code like:

#[repr(C)]
struct MyFFIType { ... }

struct MyWrapper {
	normal_field: u8,
	ffi_field: MyFFIType
}

We won't be able to run proper UB checks on ffi_field, assuming that a pointer to it gets passed to C code. However, we can certainly enable stacked borrows for normal_field - as well as any struct containg a MyWrapper.

From what I've seen, most FFI code in libstd and other crates tends to already be written in this fashion - normal repr(Rust) types are exposed through the API, while dedicated #[repr(C)] structs are used to communicate with C. Miri could still check the vast majority of the program, assuming that FFI usage is properly encapsulated (as it really should be).

Tl;dr - I think it could be very useful to have Miri support 'input' pointers - even if we can never check everything, we can still check a lot of things. This would enable usages like running a Tokio echo server and client through Miri - while it would probably be incredibly slow, it would allow us to check an end-to-end test of the whole async io stack.

EDIT: This would also allow for us to incrementally improve what FFI could Miri can check. We can continue to add exlicit shims for well-known C functions (e.g. libc::getrandom), while using the automatic shimming as a fallback. While we would still need automatic shims in the general case, we could gradually expand the fraction of libc that Miri is explicitly aware of, allowing stacked borrows and other checks to be run on an increasingly large fraction of a target program.

@RalfJung
Copy link
Member

RalfJung commented Jul 2, 2019

That prevents us from observing any other writes made by C code. As a real-world example, consider libc::__errno_location. We need code like this to work:

Oh, so this is two-way then. But once it's two-way, how does that not have all the same difficulties as 'input pointers'? It's C writing into memory that Miri "controls". Once you have that mechanism, you can implement 'input pointers' by enabling Miri to turn a "normal" allocation into a Raw one. Since all our pointers in Miri are "symbolic", existing pointers just keep working.

Your proposal suggested to have Place::Raw, but wouldn't it make more sense to have "raw allocations"? Otherwise you'll also need Pointer::Raw and I don't know what else.

And I think raw allocations could be implemented with just a few extra machine hooks in librustc_mir, so the complexity would be very manageable.

Given that Miri is currently the only way of detecting UB in rust code

That's not true, there's also valgrind and various sanitizers.

I think it would be useful to eventually be able to run Miri on arbitary rust code (even if the overhead can be very high).

The plan we (Niko Matsakis and me) had during my internship was that eventually there would be a valgrind mode that implements Stacked Borrows, and that should be able to handle FFI. That sounds more realistic to me than, say, being able to run the part of Firefox that is written in Rust in Miri to check it for UB...

Tl;dr - I think it could be very useful to have Miri support 'input' pointers - even if we can never check everything, we can still check a lot of things.

Agreed!

@RalfJung
Copy link
Member

@Aaron1011 are you still working on this?

My thinking is that some more design work is needed before implementation can start in earnest. In that case, I'd prefer to close the PR (to keep the PR list clean) and discuss the design on the issue.

@RalfJung
Copy link
Member

Closing for now to keep the PR list clean; @Aaron1011 feel free to reopen if you want to pick this up again.

@RalfJung RalfJung closed this Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: Waiting for the PR author to address review comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants