-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Implement stable symbol-name generation algorithm. #31539
WIP: Implement stable symbol-name generation algorithm. #31539
Conversation
r? @Aatch (rust_highfive has picked a reviewer for you, use r? to override) |
SipHash it is an algorithm designed by cryptographers, but it's not in the same category of hash functions as SHA-2 which is a cryptographic hash function. For example, the SipHash paper says
and
This use case does not use a secret key with SipHash (it always uses the default key), and it wants to have collision resistance on public inputs. Yes, in practice, we can find collisions for either of them if the hash is only 64 bits long, but only SHA-2 is designed for this case. |
@bluss Those are good references. It really sounds like SipHash is not a good candidate for this use case. I'll switch things to SHA-2. |
What's the impact of a collision? |
☔ The latest upstream changes (presumably #31487) made this pull request unmergeable. Please resolve the merge conflicts. |
f6a5588
to
cc8d19d
Compare
Symbol conflicts: Two functions with different signatures could end up with the same name. If it's in the same crate, compilation will probably just fail. If it's in different crates, either linking fails due to a conflict or the wrong symbol might be linked, resulting most probably in untraceable, subtle memory corruption. |
Nice. |
This looks fantastic to me, nice work @michaelwoerister! I wouldn't have too many opinions about SHA2 vs SipHash, I suspect either choice will be suitable for our purposes. I'd personally love to jettison the SHA2 implementation in the compiler, and I highly doubt we can find a SipHash practical collision with the same path, different metadata, and yet something like different type substitutions. We should also be able to change this if we ever need to really as we can only guarantee no recompiles within one version of Rust anyway. |
Thanks for the comment @alexcrichton. It's true, we have been using SipHash for function monomorphization for quite a while now and I've never seen a problem with that. My assumption about the requirements for this hash function (and later also one used for fingerprinting incremental compilation artifacts) is that it should provide good distribution so that collision probability is as low as it can theoretically be for a given number of hash value bits, BUT that we don't care about "security", i.e. whether it's easy to find hash collisions on purpose. Or is there some kind of "attack scenario" that I'm overlooking? For maximum performance (and if the above assumptions hold), we could also use something like CRC64. |
Here's the issue about collisions in type_id, but that's only tangential to this: #10389. As long as we go into this with the knowledge that SipHash is just a regular hash function, I'm fine. |
@bluss Thanks for the link! |
As an aside, I think "-C metadata" is a bit of a misnomer. Something like "-C salt" or "-C disambiguator" would be clearer (although the later one definitely is too long). |
Another side note: Now that we get predictable symbol names for monomorphizations, we might want to start sharing them cross-crate. |
fc2c4d8
to
7d7958b
Compare
Ah as another data point in Cargo we extensively use SipHash for fingerprinting crates and packages. I haven't at least seen anyone run into any problems with that. I also agree that at this point |
☔ The latest upstream changes (presumably #31524) made this pull request unmergeable. Please resolve the merge conflicts. |
What's the problem with the compiler SHA2 implementation? Making a collision with SipHash with a known key only requires some modifiable string on both ends - e.g. an item name, or comment. |
@arielb1 I don't think anything is wrong with the SHA-2 implementation per se. It's just incomplete (only supports SHA256) and it's one more piece of code that needs to be maintained. I'd guess that second point is the main motivation for wanting to get rid of it. |
7d7958b
to
0a2b84a
Compare
So, here's something interesting: Some of the test cases (e.g. The tests now fail with 'duplicate symbol' linker errors, but I guess the sensible thing to do here, is to emit a compilation error when the crate loader discovers two crates with the same name/metadata but different SVHs. |
0a2b84a
to
13a5ea0
Compare
Up until that last commit this passes |
That's the error in question: https://travis-ci.org/rust-lang/rust/builds/109449298#L6591 |
Oh you should just be able to update the test to check for |
@alexcrichton I'm afraid it's something more complicated than that. If you look at the actual backtrace that the subprocess provides (see below), it looks like there is a panic while creating the backtrace (in
|
Hm, that looks normal to me though? It looks like the assertion message originates from here which was testing a panic originating here. The bad output happened because it expected to find In other words we should probably really look into skipping the top few frames when printing a backtrace, but other than that it should be good! |
Yes, you're right of course |
…s with the same crate-name and crate-salt but different SVHs.
…runtime. We want to prevent compiling something against one version of a dynamic library and then, at runtime accidentally using a different version of the dynamic library. With the old symbol-naming scheme this could not happen because every symbol had the SVH in it and you'd get an error by the dynamic linker when using the wrong version of a dylib. With the new naming scheme this isn't the case any more, so this patch adds the "link-guard" to prevent this error case. This is implemented as follows: - In every crate that we compile, we emit a function called "__rustc_link_guard_<crate-name>_<crate-svh>" - The body of this function contains calls to the "__rustc_link_guard" functions of all dependencies. - An executable contains a call to it's own "__rustc_link_guard" function. As a consequence the "__rustc_link_guard" function call graph mirrors the crate graph and the dynamic linker will fail if a wrong dylib is loaded somewhere because its "__rustc_link_guard" function will contain a different SVH in its name.
d534c75
to
f6b0f17
Compare
@bors r=nikomatsakis (after rebase) |
📌 Commit f6b0f17 has been approved by |
WIP: Implement stable symbol-name generation algorithm. This PR changes the way symbol names are generated by the compiler. The new algorithm reflects the current state of the discussion over at rust-lang/rfcs#689. Once it is done, it will also fix issue #30330. I want to add a test case for that before closing it though. I also want to do some performance tests. The new algorithm does a little more work than the previous one due to various reasons, and it might make sense to adapt it in a way that allows it to be implemented more efficiently. @nikomatsakis: It would be nice if there was a way of finding out if a `DefPath` refers to something in the current crate or in an external one. The information is already there, it's just not accessible at the moment. I'll probably propose some minor changes there, together with some facilities to allow for accessing `DefPaths` without allocating a `Vec` for them. **TODO** - ~~Actually "crate qualify" symbols, as promised in the docs.~~ - ~~Add a test case showing that symbol names are deterministic~~. - Maybe add a test case showing that symbol names are stable against small code changes. ~~One thing that might be interesting to the @rust-lang/compiler team: I've used SipHash exclusively now for generating symbol hashes. Previously it was only used for monomorphizations and the rest of the code used a truncated version on SHA256. Is there any benefit to sticking to SHA? I don't really see one since we only used 64 bits of the digest anyway, but maybe I'm missing something?~~ ==> Just switched things back to SHA-2 for now.
💔 Test failed - auto-mac-64-nopt-t |
@michaelwoerister and I chatted a bit over e-mail. I'm going to close this PR and open one of my own that includes his commits plus a few more. |
Revamp symbol names for impls (and make them deterministic, etc) This builds on @michaelwoerister's epic PR #31539 (note that his PR never landed, so I just incorporated it into this one). The main change here is that we remove the "name" from `DefPathData` for impls, since that name is synthetic and not sufficiently predictable for incr comp. However, just doing that would cause bad symbol names since those are based on the `DefPath`. Therefore, I introduce a new mechanism for getting symbol names (and also paths for user display) called `item_path`. This is kind of simplistic for now (based on strings) but I expect to expand it later to support richer types, hopefully generating C++-mangled names that gdb etc can understand. Along the way I cleaned up how we track the path that leads to an extern crate. There is still some cleanup left undone here. Notably, I didn't remove the impl names altogether -- that would probably make sense. I also didn't try to remove the `item_symbols` vector. Mostly I want to unblock my other incr. comp. work. =) r? @eddyb cc @eddyb @alexcrichton @michaelwoerister
…xcrichton Revamp symbol names for impls (and make them deterministic, etc) This builds on @michaelwoerister's epic PR #31539 (note that his PR never landed, so I just incorporated it into this one). The main change here is that we remove the "name" from `DefPathData` for impls, since that name is synthetic and not sufficiently predictable for incr comp. However, just doing that would cause bad symbol names since those are based on the `DefPath`. Therefore, I introduce a new mechanism for getting symbol names (and also paths for user display) called `item_path`. This is kind of simplistic for now (based on strings) but I expect to expand it later to support richer types, hopefully generating C++-mangled names that gdb etc can understand. Along the way I cleaned up how we track the path that leads to an extern crate. There is still some cleanup left undone here. Notably, I didn't remove the impl names altogether -- that would probably make sense. I also didn't try to remove the `item_symbols` vector. Mostly I want to unblock my other incr. comp. work. =) r? @eddyb cc @eddyb @alexcrichton @michaelwoerister
…xcrichton Revamp symbol names for impls (and make them deterministic, etc) This builds on @michaelwoerister's epic PR #31539 (note that his PR never landed, so I just incorporated it into this one). The main change here is that we remove the "name" from `DefPathData` for impls, since that name is synthetic and not sufficiently predictable for incr comp. However, just doing that would cause bad symbol names since those are based on the `DefPath`. Therefore, I introduce a new mechanism for getting symbol names (and also paths for user display) called `item_path`. This is kind of simplistic for now (based on strings) but I expect to expand it later to support richer types, hopefully generating C++-mangled names that gdb etc can understand. Along the way I cleaned up how we track the path that leads to an extern crate. There is still some cleanup left undone here. Notably, I didn't remove the impl names altogether -- that would probably make sense. I also didn't try to remove the `item_symbols` vector. Mostly I want to unblock my other incr. comp. work. =) r? @eddyb cc @eddyb @alexcrichton @michaelwoerister
This PR changes the way symbol names are generated by the compiler. The new algorithm reflects the current state of the discussion over at rust-lang/rfcs#689.
Once it is done, it will also fix issue #30330. I want to add a test case for that before closing it though.
I also want to do some performance tests. The new algorithm does a little more work than the previous one due to various reasons, and it might make sense to adapt it in a way that allows it to be implemented more efficiently.
@nikomatsakis: It would be nice if there was a way of finding out if a
DefPath
refers to something in the current crate or in an external one. The information is already there, it's just not accessible at the moment. I'll probably propose some minor changes there, together with some facilities to allow for accessingDefPaths
without allocating aVec
for them.TODO
Actually "crate qualify" symbols, as promised in the docs.Add a test case showing that symbol names are deterministic.One thing that might be interesting to the @rust-lang/compiler team:==> Just switched things back to SHA-2 for now.I've used SipHash exclusively now for generating symbol hashes. Previously it was only used for monomorphizations and the rest of the code used a truncated version on SHA256. Is there any benefit to sticking to SHA? I don't really see one since we only used 64 bits of the digest anyway, but maybe I'm missing something?