-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush Icache on AArch64 Windows #4997
Flush Icache on AArch64 Windows #4997
Conversation
This was previously done on bytecodealliance#3426 for linux.
This was previously done on bytecodealliance#3426 for linux.
This allows us to keep the icache flushing code self-contained and not leak implementation details. This also changes the windows icache flushing code to only flush pages that were previously unflushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good - you might not have realized it, but you are fixing #3310 on Windows.
BTW the purpose of the membarrier()
call is not cache maintenance, but rather flushing the processor pipeline (refer to the discussion in #3426 for more details), so this means that there is still a gap in the implementation on Windows. Looking at Microsoft's documentation, FlushProcessWriteBuffers() seems to have the necessary semantics.
The documentation for FlushInstructionCache says:
So it seems that FlushInstructionCache must always be called no matter if FlushProcessWriteBuffers is called or not. |
Yes, I am not claiming that P.S. The same considerations about single-threaded applications apply to |
Thanks for reviewing this! ❤️
So, if I understand #3310 correctly we need to both flush the icache ( Is that right? If that is the case I might as well just clean up the terminology and add a |
Yes, your understanding is correct. Technically the pipeline flush is always mandatory, but all implementations of The ideal solution, which I had in mind when I opened #3310, would be to have a generic Note that it might appear that on Linux we don't flush instruction caches, so there might be a correctness issue. Actually we are relying on an implementation detail (as discussed in #3426), but I'd rather see an explicit operation, hence I have kept #3310 open. However, I would advise against implementing the cache flushing instruction sequence inside Cranelift and/or Wasmtime because it is a bit involved and a solution that is easily accessible to all Rust users on AArch64 is a much better option. In the meantime we can continue relying on an implementation detail; any other cleanup would be appreciated, of course. |
It should rather be in libcore as compiler-builtins is an implementation detail of rustc to be used by the compiler backend and never directly called by the user. |
Alright I think I understand this a little bit more! Thank you for your patience dealing with this.
Yeah I also doubt I could write anything like that correctly 😆 So I'm not very inclined to go that route. What I think we could do is something along the lines of what @cfallin suggested on zulip create a For What do you guys think about this?
Yeah that confused me for a while! This whole thing is not easy to understand. I'd really like to get this in a central place and write all of these details in comments around this code. |
That is certainly an option, but my impression is that we are trying to move away from relying on C files as much as possible (refer to the refactoring that has been done in the As for creating a crate to be used by both |
Mutually exclusive features cause issues.
We now use it via jit-icache-coherence
I've implemented a new crate It does change our call's to It also includes the changes made by @cfallin in #4987 since I thought it was just easier to do it here instead of having to deal with the ensuing merge conflicts. I've tried to explain the whole cache coherency thing as best as I could in the documentation, hopefully I didn't miss anything! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much nicer to have the abstracted crate shared in both places -- thanks for doing this work (and subsuming my PR too)!
One thought on crate naming, and a thought on RISC-V; and let's make sure @akirilov-arm is happy with the final version here too before merging.
//! This crate provides utilities for instruction cache maintenance for JIT authors. | ||
//! | ||
//! In self modifying codes such as when writing a JIT, special care must be taken when marking the | ||
//! code as ready for execution. On fully coherent architectures (X86, S390X) the data cache (D-Cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary for this PR, but I wonder how RISC-V fits into this -- it looks like at the ISA level it has a fence.i
instruction, so it is closer to AArch64 in this regard (weaker coherence by default). Is it enough to do the same membarrier
calls as on aarch64
? (cc @yuyang-ok)
In the absence of any other information, perhaps we could perform the same membarrier
calls on RISC-V as we do on aarch64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we do need to do something, from what I've read RISCV is allowed to have incoherent I and D caches. From this documentation of the kernel, it looks like CORE_SYNC is not yet implemented for RISCV. I'm not sure they support GLOBAL either.
I've tried to read the kernel a bit, and from what I understand they have a custom syscall that does sort of what we want? But it looks like it does not guarantee anything regarding pipelines.
Edit: That syscall ends up doing something very similar to AArch64 where they execute a fence.i
on all cores. (link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an architectural detail - I am not familiar with RISC-V at all, but it is possible that the architecture specifies that if instruction caches are flushed, then the pipeline might be flushed as well if necessary, hence no need to do anything in addition; on AArch64 these actions are decoupled. Or to put it another way - an architecture having incoherent data and instruction caches does not imply that it behaves in exactly the same way as the 64-bit Arm architecture (and hence requiring exactly the same sequence of actions); possibly there are nuances.
BTW the system call you have linked to says that it can be made to apply to all threads in the process, not just the caller, which might be what you are looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to put it another way - an architecture having incoherent data and instruction caches does not imply that it behaves in exactly the same way as the 64-bit Arm architecture (and hence requiring exactly the same sequence of actions); possibly there are nuances.
Yeah, that's right, we should go and double check that!
I've opened #5033 to track this, but I'm going to look at the ISA manual to check if they guarantee anything like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
#[cfg(all(not(target_os = "windows"), not(feature = "rustix")))] | ||
mod libc; | ||
#[cfg(all(not(target_os = "windows"), feature = "rustix"))] | ||
mod rustix; | ||
#[cfg(target_os = "windows")] | ||
mod win; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way I'd recommend writing this to make this more easily maintainable over time is:
cfg_if::cfg_if! {
if #[cfg(target_os = "windows")] {
mod win;
use win as imp;
} else if #[cfg(feature = "rustix")] {
mod rustix;
use rustix as imp;
} else {
mod libc;
use libc as imp;
}
}
and below just use imp::the_method()
instead of duplicating the #[cfg]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how come there's a rustix
and a libc
implementation? Would it be reasonable to pick one as the only non-windows implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for cranelift-jit
we don't want to import rustix since that adds a bunch of dependencies for a couple of membarriers (#3395). For wasmtime I think we do want the safety guarantees of rustix.
So we ended up with both, one implementation in wasmtime-jit
and another in cranelift-jit
.
Would it be reasonable to pick one as the only non-windows implementation?
Sure, I'm happy to go with whatever people choose, I just wanted to minimize the changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a maintainability point of view though I don't think it makes sense to have two different versions of this code. If cranelift-jit
doesn't want to use rustix
because it's too big of a dependency then that seems like an equivalent argument could be made for wasmtime dropping rustix, but the same arguments for why wasmtime uses rustix I feel can be used in reverse as well to motivate the usage of rustix
.
Overall I assume the actual compiled-down code is basically the same modulo what function does the syscall
instruction so at least form my perspective I would prefer to only have one implementation to maintain rather than two.
Also a bit more broadly I feel that the cranelift-jit
crate doesn't really fit well with this repository right now. Nothing in Wasmtime uses it and it does not see heavy usage in tests I believe, but it's quite a complicated an nontrivial crate at the same time. Basically the support/maintenance story for it seems somewhat unclear but I ideally don't want it to place further burdens on other code elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cranelift-jit is used and tested by cg_clif.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I assume the actual compiled-down code is basically the same modulo what function does the syscall instruction so at least form my perspective I would prefer to only have one implementation to maintain rather than two.
I hope so too! Or its probably a bug. 😄 I don't really have an opinion on what we should do here, but I'm happy with whatever.
Also a bit more broadly I feel that the cranelift-jit crate doesn't really fit well with this repository right now. Nothing in Wasmtime uses it and it does not see heavy usage in tests I believe, but it's quite a complicated an nontrivial crate at the same time. Basically the support/maintenance story for it seems somewhat unclear but I ideally don't want it to place further burdens on other code elsewhere.
We do use the jit for all runtests in cranelift (in the filetest suite) and we also use the jit when fuzzing the cranelift-fuzzgen
target.
I do think we could go the other way and try to use cranelift-jit
in wasmtime instead of wasmtime-jit
. That would probably be a big project, and I'm not sure if there is anything that would be fundamentally incompatible. But I think it would make sense in terms of a code sharing perspective, and as a bonus all cranelift users would get a better JIT!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could share the core for handling mapping as executable and relocating, but the user facing interface of cranelift-module is designed for the C linkage model and is not compatible with the wasm linkage model.
3d77de4
to
eb57bee
Compare
This is implied by the target_os = "windows" above
Forgot to post a general comment with my review - the overall code structure looks great and is definitely cleaner than what we had before; my remarks are mostly about improving the code comments. Also, as I have said before I am fine with just implementing the |
This is redundant as it is done in non_protected_allocations_iter
Thanks @akirilov-arm!
This reverts commit 21165d8.
I think I got all of them, let me know if I missed any of the changes that you requested. And thanks for you patience in dealing with this! I'm still very much learning about all of this stuff.
I'm okay with libc too, but would like an ack from someone on the wasmtime side if we want to do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - thanks for tackling this! I have a final set of nits, but none of them are showstoppers.
Technically the `clear_cache` operation is a lie in AArch64, so move the pipeline flush after the `mprotect` calls so that it benefits from the implicit cache cleaning done by it.
I had to add a flags arg to membarrier on libc since on my aarch64 machine it wasn't being properly added, somehow cranelift never triggered this. I've confirmed with
Is there a way to test the GLOBAL path on modern kernels? i.e. fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to test the GLOBAL path on modern kernels? i.e. fail
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
withEINVAL
?
I am not sure if I understand you correctly, but AFAIK if the membarrier()
system call is supported at all, then MEMBARRIER_CMD_GLOBAL
(AKA MEMBARRIER_CMD_SHARED
) is supported as well.
Right, I just wanted to check if there was an easy way to add a test that forces us to use GLOBAL so that we at least have some CI coverage on that branch. But I tested that manually and it worked, so I guess that's ok. |
Hey, @alexcrichton @sunfishcode we discussed this in the cranelift meeting today, and @cfallin mentioned that you might be interested in this PR as (in its current form) it changes wasmtime to use libc on this particular As a summary: We got to this situation when I merged this particular piece of code from wasmtime already has libc on its dependency tree, but cranelift does not have rustix yet, so to avoid adding more stuff to the compilation path, I decided to keep libc. I'd like to know if this is okay with you guys. |
Yes, that makes sense! |
Indeed seems reasonable to me! |
👋 Hey,
I tried to run the cranelift filetest suite on a
aarch64-pc-windows-msvc
machine, and it crashes withSTATUS_ILLEGAL_INSTRUCTION
. All of the tests pass individually, and if I run it on a single core, it sometimes passes the entire test suite.I think this is due to us not clearing the icache after writing the new code as required by arm. This PR adds a call to
FlushInstructionCache
where we already have amembarrier
on linux (See #3426).I'm not too knowledgeable about this, but it was what I saw recommended in a Arm Community blog post, although I would really appreciate if someone could doublecheck if this is the correct approach. This also seems to be what firefox does for their jit.
With this patch we can now pass the entire filetest suite without crashing!
I applied the same solution to the wasmtime side of things, but it's worth noting that I was never able to get a
STATUS_ILLEGAL_INSTRUCTION
there!I tested with
cargo test -p wasmtime-cli wast::Cranelift::spec::simd_i
and all 48 tests pass, no matter how many times I try to run them.I can't test the entire test suite since that fails due to #4992 (I think)
cc: @cfallin @akirilov-arm