-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify symbol version mangling #689
Comments
There is further reason to change the hashing scheme. In particular, using the entire crate in the name of every symbol is completely incompatible with incremental compilation. To make things work, I think you cannot use the data from other symbols at all (or at least only very, very carefully). That means we can:
@alexcrichton and I talked about this at one point but I sort of have forgotten our conclusion. We should settle on something and update the incr. comp. RFC to reflect the consensus. |
If we just always built the hashes that we plan to use for incremental compilation (that is, hashes that also include the dependencies of a symbol) that would solve the problem without much additional cost in complexity. And it should shield against name clashes when using two versions of the same item. Using the crate version number will also solve this problem though (if somewhat less reliably because it depends on really adhering to proper version numbering) |
Just had a conversation with @alexcrichton on IRC. The synopsis:
Based on my read of the prior comments and things, this still seems to address the main use-cases for hashing. But there is a lot of prior commentary, so if something is being overlooked, please do bring it up. |
Will it? Let's say we have the following function and struct in two versions of the same crate: struct SomeStruct(SomeOtherStruct);
fn foo() -> SomeStruct {
return bar();
} The only difference in the two versions of the crate is the definition of // in version A
struct SomeOtherStruct(u64);
// in version B
struct SomeOtherStruct(u32); This makes the two versions of |
Yes, sorry. I should have written "will likely cause". |
@alexcrichton points out that we could hash not only the path to the struct's name, but also its contents, which would address this issue. Essentially we create a hash for the struct definition, and when you reference that struct, you include that hash (or something like that). |
Because it seems that @alexcrichton didn't know what I was talking about, I just wanted to elaborate on this "dummy symbol" idea I was tossing around. The idea would be to encode the hash of the crate's AST not on EVERY symbol, but just on ONE symbol we make at the end. So for each crate we'd make a symbol like: fn crate$HASH() {
dep1$HASH1();
dep2$HASH2();
dep3$HASH3();
} where The entire purpose of this would be to generate "defensive" link errors where a version is updated in place. You'd still need a hash on every symbol with crate+version-number to allow for multiple versions of a library to co-exist. |
I know, but at that point we would effectively also hash all the dependencies of a type, right? |
That's quite a neat idea |
Is it a viable alternative to just append some random characters (rather than a hash) every time you generate code for a function, and store the name in the output file's AST (and incremental compilation database)? |
This has always been a complicated topic. The requirements and implementation have shifted a lot over the years, without I think anybody really understanding them, myself included. TL;DR I don't think the way we are attempting to encode the entire dag/ast structure into symbols is buying us much and we can do a simpler scheme that is compatible w/ incremental compilation. This is the current description of symbol hashing. From what I can tell the motivations under 'few issues to handle' are still accurate, except that I don't know what
means, and we're definitely violating it. The 'hash' in the filenames is now just an arbitarary string that the build tooling may optionally pass to rustc via the I suspect the technical description under 'here is what we do' is very wrong. Just some quick notes from reviewing what symbol hashing actually does: The symbol hashing function is symbol_hash. It hashes a type, along with same crate-global data called LinkMeta contains the crate name and the Strict Version Hash. The SVH is a hash of the crate contents. Note that it also hashes the command-line provided metadata. It also hashes the crate attributes. The SVH is designed to allow the compile-time crate-loader to detect all upstream changes so it can generate errors and demand a recompile. We did this as a blunt way to fix years of weird errors reading metadata, with the understanding that it prevented any sort of binary upgrading. I believe the primary effect of the SVH today is that, in dependency chain A -> B -> C, if A is rebuilt with modifications, then B is not rebuilt, then C is rebuilt, the build of C will fail because the crate A that B references contains a different SVH than when B was originally built and thus B's deps don't resolve. I'm surprised that I don't see anywhere where the SVH is transitively hashing the SVH of its dependencies. I would expect such transitive hashing to sort of 'lock' the crates together, ensure that two crates that have the same source but resolve two different versions of an upstream lib are different. As it is it looks like all the hashes within a crate only depend on crate-local information. This can't be right... Cargo uses
NB: mangle_exported_name is doing some weird hacks. There are at least three paths in trans that define symbol names:
It's quite an ad-hoc clusterfuck. From my understanding the current requirements here are actually pretty simple:
The old system had a notion of symbol versioning as well, for binary upgrades, but we've never used it and no longer mangle a version into the symbol name. Such a system is moot as long as we're using something SVH-like to force non-binary-compatibility when crates change. I'm very inclined to redesign the whole thing from scratch. Some quick ideas:
cc @graydon You probably remember better than I the original motivations. I do think we've basically given up on some of them. |
One thing that today's Rust no longer does is guarantee anything about whether the crate loaded at runtime is the same as the crate seen at compiletime. We used to have stronger guarantees when the filename was derived from the crate version/contents. I don't recall exactly why we changed this behavior, but it was probably because having the filenames change constantly was annoying, or maybe because cargo couldn't predict them (this could have something to do with the cryptic 'tooling' comments in the old description of version hashing). |
I don't recognize any of the code there; I think it's all been worked over many times since I worked on it. Plus it appears that cargo took over half the things rustc was trying to do in the meantime, and a significant number of design assumptions in linkage and versioning have changed. And I haven't a clue what "a post-cargo world" might refer to. I encourage you to err on the side of hashing both types and (transitive) dependencies into symbols, and to try to ensure that multiple crates with the same self-chosen symbolic name can coexist in a single program. It might also be worth trying to recover the logic underlying separate hashes and version fields in names; I believe it had to do with the ability (indeed, requirement on linuxes) to have a major-number soname pointing to one of many different minor-version dylibs. Beyond that I'm not sure I'll be much help untangling the current technical concerns. (I'm also on vacation presently, so...) |
Hm. Re-reading and reconsidering. I don't get what the motive is for the proposed "Accept that having unique symbol names requires having unique crate names and versions". This seems like going back to a world of a global linkage namesapce, which we worked quite hard to avoid. Am I misunderstanding? If not, why? I would point out that the SVH is the code that seems new to me. The other bits seem, on re-reading, somewhat similar to how things were when I last looked at this (years ago!) ... the "ad-hoc clusterfuck" part seems a bit exaggerated. It has a simple structure: it's a funnel into mangle, either directly or via a couple functions that work out the string-name and hash to provide:
|
It is, and I considered some sort of UUID-based approach, but I'd really prefer if incr. comp. and "from scratch compilation" produce the same, deterministic results. Having deterministic symbol names has other advantages too. |
Yes. But just for types, and it's not the "dependencies" of the type, it's the contents of the type. This is well-defined, whereas dependencies are an approximation that the compiler gathers up internally. This means that the hash would be portable amongst many implementations. It also means we could implement it today before we have a firm notion of dependencies. |
PS, I think that I just recognized a new (potential) requirement I hadn't fully considered before. That is, portability amongst implementations. It seems ideal to me if compiling the same thing twice would give the same result, and even more so if it would give predictable, deterministic results (even if those results are not easily predictable). |
Isn't this somewhat ensured by the SVH? That is, if the name of every symbol reflects the entirety of the crate...? |
This seems like a key point I'd like to dig into a bit more. First off, I think it gets at @brson's questions about a "post cargo" world -- I think the more accurate term is "post-crates.io". Given that we have a central repository of code, most crate names are fairly unique. But it seems clearly better if we can accommodate two crates with same name, but I want to try and understand how similar those crates can be. For example, even the SVH will fail if I copy-and-paste the same code and compile it twice under two distinct crate names, even if there is no logical problem there. It seems like at the extreme end, a random UUID is required. OTOH, the proposal of hashing the signatures (and perhaps we even throw in the local AST contents, but not the contents of any referenced symbols) of every symbol seems pretty likely to succeed. Crates would have to have the same fn with the same impl and the same types in order to have a conflict. Now, this could certainly occur: like, there might be some common crate extern crate foo; // same in both crates
fn frobify() -> foo::Data { // same in both crates
foo::frobify()
} But barring this, I think the two crates would not conflict (assuming no hash collisions, of course). Before I go further, let me check: Am I missing some other scenario? In any case, it seems like we are sort of converging on one potential design here, where we hash the contents of each item along with some set of dependencies. I think it's important that this set of dependencies be pretty easy to specify and not be based on the compiler's internal dependency set. But it could be a bigger set than I specified above: for example, when hashing a function The fact that all this effort still doesn't give a guarantee (unless I'm confused, which is quite possible) does kind of bother me, I have to say. But maybe it's good enough in practice. It also depends I guess on how much importance we plan on shared dylibs and so forth. I'm rapidly coming to the conclusion that binary distributions don't carry their weight, for a variety of reasons. So if we entertain the notion that we are always building from source (perhaps "pre-compiled" source like optimized MIR), then it seems like we might get away with just having user-provided "salt". Cargo gets this from crates.io, but some linux distro might add their own salt when compiling rust packages based on whatever universal database they use. In other words, perhaps just exporting the problem IS the most reliable solution? |
PS @graydon thanks for commenting even on your vacation ;) that was enlightening. |
Just to clarify what I wrote here:
What I mainly mean is that if I am authoring some Rust library, I would not distribute pre-compiled forms. I would post the source somewhere (perhaps in mangled form). This is of course the Python/Ruby model, but it's also the Java model ( |
Yes, that's basically right, it is saying that Cargo package names are the global namespace. It's a simplifying assumption to avoid having do redesign exactly what dependencies get hashed into symbols. I do think it would be better to include dependencies, but also that (effectively) everything will work if we don't. |
Yes, that's probably right. |
Would one still be able to rename a cargo package locally, at the site of use? IOW build two packages that call themselves "foo" as different names in the crate using them, in order to link them together into a composite work? My concern is that we worked hard to ensure this capability -- no global cross-crate namespace -- and it seems valuable in terms of allowing people naming freedom without having to always consult a global registry. (Especially if you're planning on supporting multiple such registries, private registries, or crates developed in house that might otherwise collide with cargo central names) |
The different hashing for monomorphization probably indicates additional requirements that I don't recall offhand. But, for example, when I was working on collapsing identical monomorphizations from different crates, I needed the symbol names to hash to the same value without the two crates having knowledge of each others' existance. |
I agree wholeheartedly with @brson's summary and his action items at the end are exactly what I would propose as well. To add a bit, I think a crate having a "unique name and version" can be generalized to what @nikomatsakis was saying by saying a crate should have a "unique salt". Cargo will calculate this salt by hashing the (name, version, source), and other systems can create the salt differently. Either way it's not talking specifically about names or versions, just a hopefully-unique string going into compilation. I also think that we may not want to dive too much into the area about future-proofing ourselves for strong ABI compatibility just yet. We're so far away from that place that I'd fear we don't even know what to plan for in terms of symbol naming. We can always change the names of the symbols at any time with the way we compile today, so I wouldn't be too worried. Especially if we create a simple system it'll be easier to change!
Within the world of Cargo, yeah this should always be possible. Cargo's unit of deduplication is a triple of (name, version, source) where the source includes the URL of the kind of source in question, and this unit is hashed and passed down to the compiler to be mixed into all symbols (e.g. the metadata @brson was alluding to). Along those lines this use case should be covered already so long as the compiler guarantees that the command-line-given-metadata is mixed into the symbol hashes somehow. |
My assumption would be that this hash is to disambiguate monomorphic instances of the same generic function. It's interesting though that this seems to sidestep the whole SVH business, making monomorphized symbols the only thing not dependent on it. Looks like an oversight more than a conscious decision. |
Gotcha, yeah, the triples you're describing from cargo are serving the same role as pkgids were in the earlier design (and in earlier comments in the source -- some of which have I think been erroneously changed to refer to crate names alone, which is a bit too crude). So long as two symbolic crate names "foo" and "foo" can be segregated by different source strings input to the hash, you should be fine. |
So I want to see this get implemented soon. I just re-read the thread and I think the following proposal seems to be what we all collectively agreed upon:
The major thing which this scheme does not attempt to prevent is user error in terms of guaranteeing correct compilation of dependencies. In particular, if there is a crate A used by a crate B used by a crate C and crate A changes, it may be that one can recompile crate C without noticing that crate B has not been recompiled. Today this (in principle) fails because crate B's SVH would have to change (although as @brson noted this may not be true if the dependencies are not incorporated into the SVH). We could in principle sort of prevent this case with a dummy symbol or something like that, however, we choose not to because:
One consideration not raised here is the "debugger experience". It'd be nice to get these hashes into the symbol name in an unintrusive-way. We also might want to have symbols include the crate name as a prefix, e.g., Does this seem like an accurate summary to everyone? UPDATE: Replaced |
(Note that we can continue to keep the SVH in the metadata if want, so that compiles continue to fail in the same fashion that they do now. We just don't want it to be a part of all symbol names.) |
(Note also that I changed from what I originally wrote, in that I substituted the |
@nikomatsakis sounds like an accurate summary to me! |
I'm happy to try this, but worried about going back to this state. I remember that adding the SVH was the thing that stopped us having bizarre linker errors all the time. You suggest that in a cargo world this doesn't matter, but the problem it solved for us was our own pain in tree. In an incremental compilation world I might expect, instead of a crate-level SVH, that all items hash their upstream item dependencies (not crate dependencies), which can continue to be validated for consistency like the SVH. |
I guess you mean "made those errors fail faster"? That is, it seems like this only helps if our makefiles are buggy...or am I missing something?
Yes, @michaelwoerister suggested something like this earlier. Perhaps we should do this as an intermediate step. I didn't want to because I wanted an algorithm we could actually spec someday, but maybe it's not worth jumping all the way to that, and better to take incremental steps (no pun intended) that preserve some of the properties of the current system. |
Oh, also: if we keep the SVH -- or something like it -- in crate metadata (which I think we can do), then it would presumably also help give clean errors when you forget to compile some intermediate step, just as it does today. But I'm assuming here that the compiler actually checks the SVH's recorded in the transitive crate graph for compatibility... |
I actually wanted to work on this next and hopefully come up with a PR by the end of the week. I think Niko gave an accurate summary of the suggested solution and also Brian's concern about it is valid. Let's take a look at the requirements again:
The suggested solution makes it possible to have (1), (2) and (3), while (4) and (5) are more troublesome. Point (4) falls out of this approach naturally, but only if we pass on (5). That is, if we only use the commandline provided metadata to salt symbol names, it's easy to end up with two incompatible symbols with the same name and the runtime linker has no way of noticing. The more sophisticated approach, with symbol hashes also incorporating dependencies, would mean that we have to very carefully define what gets incorporated. The hash would need to include anything that changes the binary interface of a function but it should not depend on the function body. Otherwise we lose (4) again. But it would lead to incompatible functions always having different symbol names, thus guaranteeing (5). Another thing to note is monomorphizations: The DefPath of a generic function alone is not sufficient to distinguish two instantiations of that function. We'll have to incorporate type arguments too. That's not hard, but it's something where the current algorithm actually is a bit strange and does things differently from 'normal' symbols AFAIR. I think it would be a good idea to implement the simple version described by Niko now, as it will let us proceed with the incremental compilation work. Having good error messages in the compiler should be doable by storing SVHs in crate metadata and checking those when pulling in dependencies. And I'd suggest to punt on binary compatibility for the time being and add an SVH symbol that will make inconsistent linkage setups fail. |
@nikomatsakis wrote:
Niko and I talked about this some more. A particular case where I believe we did indeed get errors to fail faster (which may or may not fall into the category that @brson is referencing) was when our makefiles were not setting up the LD_LIBRARY_PATH correctly, and we would get semi-useful errors during attempts to dynamically link build products from the wrong stage of the boot-strapped compiler. If the symbols didn't have distinct so-called "salt" for different source code origina, those errors could be masked or yield much harder to diagnose issues Niko and I agreed that this may be an important case to consider However, we should be able to retain this rough safe guard if we perhaps do something like include distinct salt in the distinct stages of the bootstrapped compilers |
Yes I think that would be a good idea in any case. Additionally, in the first implementation (PR rust-lang/rust#31539) of the changes proposed here, I'd like to include the even stronger safe-guard mentioned above: Add an |
Yes, I supposed that's right. It may just not be a problem in our build system anymore. |
WIP: Implement stable symbol-name generation algorithm. This PR changes the way symbol names are generated by the compiler. The new algorithm reflects the current state of the discussion over at rust-lang/rfcs#689. Once it is done, it will also fix issue #30330. I want to add a test case for that before closing it though. I also want to do some performance tests. The new algorithm does a little more work than the previous one due to various reasons, and it might make sense to adapt it in a way that allows it to be implemented more efficiently. @nikomatsakis: It would be nice if there was a way of finding out if a `DefPath` refers to something in the current crate or in an external one. The information is already there, it's just not accessible at the moment. I'll probably propose some minor changes there, together with some facilities to allow for accessing `DefPaths` without allocating a `Vec` for them. **TODO** - ~~Actually "crate qualify" symbols, as promised in the docs.~~ - ~~Add a test case showing that symbol names are deterministic~~. - Maybe add a test case showing that symbol names are stable against small code changes. ~~One thing that might be interesting to the @rust-lang/compiler team: I've used SipHash exclusively now for generating symbol hashes. Previously it was only used for monomorphizations and the rest of the code used a truncated version on SHA256. Is there any benefit to sticking to SHA? I don't really see one since we only used 64 bits of the digest anyway, but maybe I'm missing something?~~ ==> Just switched things back to SHA-2 for now.
Closing in favor of https://internals.rust-lang.org/t/pre-rfc-a-new-symbol-mangling-scheme/8501. |
It's happened once or twice before, but in general I don't think we should be closing RFC issues in favor of internals threads -- one of the main ways in which RFC issues are different from internals threads (and thereby can justify their own existence) is that they are less ephemeral, and therefore more useful for tracking and cross-referencing purposes. |
@glaebhoerl To me, the problem is that there's so much clutter here that it obscures what tracks actionable things that we may want to do in the future (instead of 100s of issues that will likely go nowhere); I expect the linked IRLO thread to eventually become a proper RFC. |
Yeah, the "hazard" is just that if for whatever reason it doesn't, the issue wouldn't then get reopened. So in general it seems like a sounder rule to close-in-favor-of once the RFC or whatever has actually happened. (If we apply case-by-case reasoning, I do agree that it seems unlikely in this one.) And yes, reducing clutter is a noble goal, thanks for working on it :) |
@glaebhoerl Indeed it's not a slam dunk either way I think and it has the risks / drawbacks you mention; so I agree on the general idea. That said, it is my experience that this issue tracker is mainly used to file low-effort low-quality requests that are eventually abandoned, never closed, and never revisited. IRLO seems like a better venue for discussing such topics since it is designed for ephemeral things (as you mentioned before). |
Issue by brson
Thursday Oct 31, 2013 at 23:27 GMT
For earlier discussion, see rust-lang/rust#10208
This issue was labelled with: A-linkage, B-RFC, P-low in the Rust repository
In the spirit of stripping down and simplifying our broken versioning scheme (#10188, #10207), this is a suggestion to remove a bunch of complexity from our name mangling.
The description of how we mangle symbols is mostly correct. The most important thing this mangling gets us is that it allows multiple, different crates to define the functions by throwing in some hashes, and that's worth preserving. The other versioning benefits are theoretical and don't work.
Where the current mangling includes a hash of the crate metadata and the type signature and also a version tag, instead we can just add a unique hash that either represents the crate metadata as currently defined or the strict version hash as suggested in #10207.
cc #2166
The text was updated successfully, but these errors were encountered: