Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify symbol version mangling #689

Closed
steveklabnik opened this issue Jan 21, 2015 · 44 comments
Closed

Simplify symbol version mangling #689

steveklabnik opened this issue Jan 21, 2015 · 44 comments
Labels
T-compiler Relevant to the compiler team, which will review and decide on the RFC.

Comments

@steveklabnik
Copy link
Member

Issue by brson
Thursday Oct 31, 2013 at 23:27 GMT

For earlier discussion, see rust-lang/rust#10208

This issue was labelled with: A-linkage, B-RFC, P-low in the Rust repository


In the spirit of stripping down and simplifying our broken versioning scheme (#10188, #10207), this is a suggestion to remove a bunch of complexity from our name mangling.

The description of how we mangle symbols is mostly correct. The most important thing this mangling gets us is that it allows multiple, different crates to define the functions by throwing in some hashes, and that's worth preserving. The other versioning benefits are theoretical and don't work.

Where the current mangling includes a hash of the crate metadata and the type signature and also a version tag, instead we can just add a unique hash that either represents the crate metadata as currently defined or the strict version hash as suggested in #10207.

cc #2166

@nikomatsakis
Copy link
Contributor

There is further reason to change the hashing scheme. In particular, using the entire crate in the name of every symbol is completely incompatible with incremental compilation. To make things work, I think you cannot use the data from other symbols at all (or at least only very, very carefully).

That means we can:

  1. use the crate-name and version-number
  2. use the signature of the symbol itself
  3. if we want to get fancy, use the info from things that the symbol depends on, since if those change, the symbol must be rebuilt anyhow

@alexcrichton and I talked about this at one point but I sort of have forgotten our conclusion. We should settle on something and update the incr. comp. RFC to reflect the consensus.

cc @brson @michaelwoerister

@michaelwoerister
Copy link
Member

If we just always built the hashes that we plan to use for incremental compilation (that is, hashes that also include the dependencies of a symbol) that would solve the problem without much additional cost in complexity. And it should shield against name clashes when using two versions of the same item.

Using the crate version number will also solve this problem though (if somewhat less reliably because it depends on really adhering to proper version numbering)

@nikomatsakis
Copy link
Contributor

Just had a conversation with @alexcrichton on IRC. The synopsis:

  1. We should hash the symbols with the combination of:
    • the package id supplied by cargo (or, if using command line, just the crate name);
    • the types of its local symbols (which are persistently hashable, since they are based on DefIds)
  2. It's not worth hashing the types of all dependent symbols, because:
    • publishing two versions of same crate with different types will still cause conflicts locally
    • in a post-cargo world, it's not clear how realistic this scenario is anyway
  3. It occurs to me that if we REALLY wanted the full crate hash, we could use a single dummy symbol.

Based on my read of the prior comments and things, this still seems to address the main use-cases for hashing. But there is a lot of prior commentary, so if something is being overlooked, please do bring it up.

@michaelwoerister
Copy link
Member

publishing two versions of same crate with different types will still cause conflicts locally

Will it? Let's say we have the following function and struct in two versions of the same crate:

struct SomeStruct(SomeOtherStruct);

fn foo() -> SomeStruct {
    return bar();
}

The only difference in the two versions of the crate is the definition of SomeOtherStruct, e.g.:

// in version A
struct SomeOtherStruct(u64);

// in version B
struct SomeOtherStruct(u32);

This makes the two versions of foo() incompatible on the ABI level, but their hash would be the same because (solely) going through DefIds will hide the different binary layouts of SomeStruct, since the DefId of SomeOtherStruct is the same for both versions (unless a correct crate version number is part of the DefId, which we cannot rely on in general).

@nikomatsakis
Copy link
Contributor

@michaelwoerister

Will it? Let's say we have the following function and struct in two versions of the same crate:

Yes, sorry. I should have written "will likely cause".

@nikomatsakis
Copy link
Contributor

@alexcrichton points out that we could hash not only the path to the struct's name, but also its contents, which would address this issue. Essentially we create a hash for the struct definition, and when you reference that struct, you include that hash (or something like that).

@nikomatsakis
Copy link
Contributor

Because it seems that @alexcrichton didn't know what I was talking about, I just wanted to elaborate on this "dummy symbol" idea I was tossing around. The idea would be to encode the hash of the crate's AST not on EVERY symbol, but just on ONE symbol we make at the end. So for each crate we'd make a symbol like:

fn crate$HASH() { 
    dep1$HASH1();
    dep2$HASH2();
    dep3$HASH3();
}

where depN$HASHN are the dummy symbols from our dependencies.

The entire purpose of this would be to generate "defensive" link errors where a version is updated in place. You'd still need a hash on every symbol with crate+version-number to allow for multiple versions of a library to co-exist.

@michaelwoerister
Copy link
Member

@alexcrichton points out that we could hash not only the path to the struct's name, but also its contents, which would address this issue. Essentially we create a hash for the struct definition, and when you reference that struct, you include that hash (or something like that).

I know, but at that point we would effectively also hash all the dependencies of a type, right?

@michaelwoerister
Copy link
Member

The idea would be to encode the hash of the crate's AST not on EVERY symbol, but just on ONE symbol we make at the end

That's quite a neat idea :)

@comex
Copy link

comex commented Nov 10, 2015

Is it a viable alternative to just append some random characters (rather than a hash) every time you generate code for a function, and store the name in the output file's AST (and incremental compilation database)?

@brson
Copy link
Contributor

brson commented Nov 11, 2015

This has always been a complicated topic. The requirements and implementation have shifted a lot over the years, without I think anybody really understanding them, myself included.

TL;DR I don't think the way we are attempting to encode the entire dag/ast structure into symbols is buying us much and we can do a simpler scheme that is compatible w/ incremental compilation.

This is the current description of symbol hashing. From what I can tell the motivations under 'few issues to handle' are still accurate, except that I don't know what

the hash shown in the filename needs to be predictable and stable for build tooling integration. It also needs to be using a hash function which is easy to use from Python, make, etc.

means, and we're definitely violating it. The 'hash' in the filenames is now just an arbitarary string that the build tooling may optionally pass to rustc via the -C extra-filename flag.

I suspect the technical description under 'here is what we do' is very wrong.

Just some quick notes from reviewing what symbol hashing actually does:

The symbol hashing function is symbol_hash. It hashes a type, along with same crate-global data called LinkMeta, as well as the ambiguously-named crate_metadata, which is just an optional vec of strings passed on the command line via -C metadata.

LinkMeta contains the crate name and the Strict Version Hash.

The SVH is a hash of the crate contents. Note that it also hashes the command-line provided metadata. It also hashes the crate attributes.

The SVH is designed to allow the compile-time crate-loader to detect all upstream changes so it can generate errors and demand a recompile. We did this as a blunt way to fix years of weird errors reading metadata, with the understanding that it prevented any sort of binary upgrading. I believe the primary effect of the SVH today is that, in dependency chain A -> B -> C, if A is rebuilt with modifications, then B is not rebuilt, then C is rebuilt, the build of C will fail because the crate A that B references contains a different SVH than when B was originally built and thus B's deps don't resolve.

I'm surprised that I don't see anywhere where the SVH is transitively hashing the SVH of its dependencies. I would expect such transitive hashing to sort of 'lock' the crates together, ensure that two crates that have the same source but resolve two different versions of an upstream lib are different. As it is it looks like all the hashes within a crate only depend on crate-local information. This can't be right...

Cargo uses -C metadata to, well, do whatever it wants:

  • I'm not sure where Cargo's 'metadata' originates from - it's specified in the constructors of most kinds of targets.
  • @alexcrichton says it boils down to the crate name, version and the 'source' (to distinguish the exact same source built from e.g. git vs crates.io), and the extra data being mixed in is mostly about changing the filename of test-crates to avoid on-disk conflicts.
  • This seems really suspicious that we're relying on the build tool to some non-trivial work to prevent symbol collisions. It would seem more robust to the non-Cargo world for rustc to understand the possible configurations that produce conflicts and require Cargo to specify them semantically, not for Cargo to just throw a blob at rustc to fix the problem. It strikes me as making up for an insufficient definition of the SVH.

NB: mangle_exported_name is doing some weird hacks.

There are at least three paths in trans that define symbol names:

It's quite an ad-hoc clusterfuck.

From my understanding the current requirements here are actually pretty simple:

  • Avoid symbol name collisions globally. Ideally the information that goes into these hashes derives only from the structure of the crate DAG and its ASTS, but we see with the crate_metadata that some of the uniquifying information is provided from without.
  • Detect when dependencies of dependencies have changed and thus the metadata is no longer accurate. This is what the SVH does, and actually has little to do with symbol naming, though the SVH is hashed into symbols.

The old system had a notion of symbol versioning as well, for binary upgrades, but we've never used it and no longer mangle a version into the symbol name. Such a system is moot as long as we're using something SVH-like to force non-binary-compatibility when crates change.

I'm very inclined to redesign the whole thing from scratch.

Some quick ideas:

  • Accept that having unique symbol names requires having unique crate names and versions. That is, two different (or the same) crates named "foo=1.0.0" may have symbol collisions. We don't need to worry about transitively hashing dependencies of things; we're not trying to hash the structure of the entire upstream dag. In a Cargo-based world, Cargo can ensure that crate-identifying information is unique enough.
  • Continue creating an SVH for a crate, but now as the final step of building a crate, encoding its complete structure. Downstream deps will continue put their upstream SVH's in their metadata, solely for the crate loader to detect inconsistencies. Though I didn't understand it fully, this sounds like what niko was proposing. The change here is that SVH's are no longer encoded into symbol hashes, but just passed around in the metadat for the crate loader.
  • The symbol hashes only encode the crate version (including whatever cargo-derived metadata is deemed necessary), the monomorphized function type, and whatever other additional crate-local info is needed to make it unique (as seen in the mangle_exported_name hack).

cc @graydon You probably remember better than I the original motivations. I do think we've basically given up on some of them.

@brson
Copy link
Contributor

brson commented Nov 11, 2015

One thing that today's Rust no longer does is guarantee anything about whether the crate loaded at runtime is the same as the crate seen at compiletime. We used to have stronger guarantees when the filename was derived from the crate version/contents. I don't recall exactly why we changed this behavior, but it was probably because having the filenames change constantly was annoying, or maybe because cargo couldn't predict them (this could have something to do with the cryptic 'tooling' comments in the old description of version hashing).

@graydon
Copy link

graydon commented Nov 11, 2015

I don't recognize any of the code there; I think it's all been worked over many times since I worked on it. Plus it appears that cargo took over half the things rustc was trying to do in the meantime, and a significant number of design assumptions in linkage and versioning have changed. And I haven't a clue what "a post-cargo world" might refer to.

I encourage you to err on the side of hashing both types and (transitive) dependencies into symbols, and to try to ensure that multiple crates with the same self-chosen symbolic name can coexist in a single program. It might also be worth trying to recover the logic underlying separate hashes and version fields in names; I believe it had to do with the ability (indeed, requirement on linuxes) to have a major-number soname pointing to one of many different minor-version dylibs. Beyond that I'm not sure I'll be much help untangling the current technical concerns.

(I'm also on vacation presently, so...)

@graydon
Copy link

graydon commented Nov 11, 2015

Hm. Re-reading and reconsidering. I don't get what the motive is for the proposed "Accept that having unique symbol names requires having unique crate names and versions". This seems like going back to a world of a global linkage namesapce, which we worked quite hard to avoid. Am I misunderstanding? If not, why?

I would point out that the SVH is the code that seems new to me. The other bits seem, on re-reading, somewhat similar to how things were when I last looked at this (years ago!) ... the "ad-hoc clusterfuck" part seems a bit exaggerated. It has a simple structure: it's a funnel into mangle, either directly or via a couple functions that work out the string-name and hash to provide:

  • When there's a path, use that as the string name, otherwise derive a name from a type and sequence (eg. for "glue" functions for types; though I doubt we have anything like that anymore and suspect those helpers could be removed)
  • When there's exported linkage, derive hash from global disambiguation material to avoid collisions among linkage objects, otherwise use crate-local type-context material to avoid collisions among internal instantiations of a polymorphic function

@nikomatsakis
Copy link
Contributor

@comex

Is it a viable alternative to just append some random characters (rather than a hash) every time you generate code for a function, and store the name in the output file's AST (and incremental compilation database)?

It is, and I considered some sort of UUID-based approach, but I'd really prefer if incr. comp. and "from scratch compilation" produce the same, deterministic results. Having deterministic symbol names has other advantages too.

@nikomatsakis
Copy link
Contributor

@michaelwoerister

I know, but at that point we would effectively also hash all the dependencies of a type, right?

Yes. But just for types, and it's not the "dependencies" of the type, it's the contents of the type. This is well-defined, whereas dependencies are an approximation that the compiler gathers up internally. This means that the hash would be portable amongst many implementations. It also means we could implement it today before we have a firm notion of dependencies.

@nikomatsakis
Copy link
Contributor

PS, I think that I just recognized a new (potential) requirement I hadn't fully considered before. That is, portability amongst implementations. It seems ideal to me if compiling the same thing twice would give the same result, and even more so if it would give predictable, deterministic results (even if those results are not easily predictable).

@nikomatsakis
Copy link
Contributor

@brson

One thing that today's Rust no longer does is guarantee anything about whether the crate loaded at runtime is the same as the crate seen at compiletime.

Isn't this somewhat ensured by the SVH? That is, if the name of every symbol reflects the entirety of the crate...?

@nikomatsakis
Copy link
Contributor

@graydon

try to ensure that multiple crates with the same self-chosen symbolic name can coexist in a single program

This seems like a key point I'd like to dig into a bit more. First off, I think it gets at @brson's questions about a "post cargo" world -- I think the more accurate term is "post-crates.io". Given that we have a central repository of code, most crate names are fairly unique. But it seems clearly better if we can accommodate two crates with same name, but I want to try and understand how similar those crates can be. For example, even the SVH will fail if I copy-and-paste the same code and compile it twice under two distinct crate names, even if there is no logical problem there. It seems like at the extreme end, a random UUID is required.

OTOH, the proposal of hashing the signatures (and perhaps we even throw in the local AST contents, but not the contents of any referenced symbols) of every symbol seems pretty likely to succeed. Crates would have to have the same fn with the same impl and the same types in order to have a conflict. Now, this could certainly occur: like, there might be some common crate foo and both the conflicting crates might define the same helper:

extern crate foo; // same in both crates
fn frobify() -> foo::Data { // same in both crates
    foo::frobify()
}

But barring this, I think the two crates would not conflict (assuming no hash collisions, of course). Before I go further, let me check: Am I missing some other scenario?

In any case, it seems like we are sort of converging on one potential design here, where we hash the contents of each item along with some set of dependencies. I think it's important that this set of dependencies be pretty easy to specify and not be based on the compiler's internal dependency set. But it could be a bigger set than I specified above: for example, when hashing a function foo, we might also include the hash of any functions that it calls (figuring out something for cycles, of course).

The fact that all this effort still doesn't give a guarantee (unless I'm confused, which is quite possible) does kind of bother me, I have to say. But maybe it's good enough in practice. It also depends I guess on how much importance we plan on shared dylibs and so forth. I'm rapidly coming to the conclusion that binary distributions don't carry their weight, for a variety of reasons. So if we entertain the notion that we are always building from source (perhaps "pre-compiled" source like optimized MIR), then it seems like we might get away with just having user-provided "salt". Cargo gets this from crates.io, but some linux distro might add their own salt when compiling rust packages based on whatever universal database they use. In other words, perhaps just exporting the problem IS the most reliable solution?

@nikomatsakis
Copy link
Contributor

PS @graydon thanks for commenting even on your vacation ;) that was enlightening.

@nikomatsakis
Copy link
Contributor

Just to clarify what I wrote here:

I'm rapidly coming to the conclusion that binary distributions don't carry their weight, for a variety of reasons. So if we entertain the notion that we are always building from source (perhaps "pre-compiled" source like optimized MIR), then it seems like we might get away with just having user-provided "salt"

What I mainly mean is that if I am authoring some Rust library, I would not distribute pre-compiled forms. I would post the source somewhere (perhaps in mangled form). This is of course the Python/Ruby model, but it's also the Java model (.class files), and it seems to be how most C packages work these days too. It gives us the advantage that we can re-compile the source for a variety of environments and optimization scenarios (think: cross-compilation, -O0 vs -O3, -g, landing pads, LTO, etc). Now a linux distro might still compile a binary version and distribute that, but they would already be introducing their salt or what have you to ensure a lack of overlap. I guess this basically corresponds to how many C packages let you define some prefix to use for disambiguating their symbols, except that it's invisible to the user.

@brson
Copy link
Contributor

brson commented Nov 11, 2015

Hm. Re-reading and reconsidering. I don't get what the motive is for the proposed "Accept that having unique symbol names requires having unique crate names and versions". This seems like going back to a world of a global linkage namesapce, which we worked quite hard to avoid. Am I misunderstanding? If not, why?

Yes, that's basically right, it is saying that Cargo package names are the global namespace. It's a simplifying assumption to avoid having do redesign exactly what dependencies get hashed into symbols. I do think it would be better to include dependencies, but also that (effectively) everything will work if we don't.

@brson
Copy link
Contributor

brson commented Nov 11, 2015

@brson

One thing that today's Rust no longer does is guarantee anything about whether the crate loaded at runtime is the same as the crate seen at compiletime.

Isn't this somewhat ensured by the SVH? That is, if the name of every symbol reflects the entirety of the crate...?

Yes, that's probably right.

@graydon
Copy link

graydon commented Nov 11, 2015

Would one still be able to rename a cargo package locally, at the site of use? IOW build two packages that call themselves "foo" as different names in the crate using them, in order to link them together into a composite work? My concern is that we worked hard to ensure this capability -- no global cross-crate namespace -- and it seems valuable in terms of allowing people naming freedom without having to always consult a global registry.

(Especially if you're planning on supporting multiple such registries, private registries, or crates developed in house that might otherwise collide with cargo central names)

@brson
Copy link
Contributor

brson commented Nov 11, 2015

The different hashing for monomorphization probably indicates additional requirements that I don't recall offhand. But, for example, when I was working on collapsing identical monomorphizations from different crates, I needed the symbol names to hash to the same value without the two crates having knowledge of each others' existance.

@alexcrichton
Copy link
Member

I agree wholeheartedly with @brson's summary and his action items at the end are exactly what I would propose as well. To add a bit, I think a crate having a "unique name and version" can be generalized to what @nikomatsakis was saying by saying a crate should have a "unique salt". Cargo will calculate this salt by hashing the (name, version, source), and other systems can create the salt differently. Either way it's not talking specifically about names or versions, just a hopefully-unique string going into compilation.

I also think that we may not want to dive too much into the area about future-proofing ourselves for strong ABI compatibility just yet. We're so far away from that place that I'd fear we don't even know what to plan for in terms of symbol naming. We can always change the names of the symbols at any time with the way we compile today, so I wouldn't be too worried. Especially if we create a simple system it'll be easier to change!


@graydon

Would one still be able to rename a cargo package locally, at the site of use?

Within the world of Cargo, yeah this should always be possible. Cargo's unit of deduplication is a triple of (name, version, source) where the source includes the URL of the kind of source in question, and this unit is hashed and passed down to the compiler to be mixed into all symbols (e.g. the metadata @brson was alluding to). Along those lines this use case should be covered already so long as the compiler guarantees that the command-line-given-metadata is mixed into the symbol hashes somehow.

@michaelwoerister
Copy link
Member

The different hashing for monomorphization probably indicates additional requirements that I don't recall offhand.

My assumption would be that this hash is to disambiguate monomorphic instances of the same generic function. It's interesting though that this seems to sidestep the whole SVH business, making monomorphized symbols the only thing not dependent on it. Looks like an oversight more than a conscious decision.

@graydon
Copy link

graydon commented Nov 11, 2015

Gotcha, yeah, the triples you're describing from cargo are serving the same role as pkgids were in the earlier design (and in earlier comments in the source -- some of which have I think been erroneously changed to refer to crate names alone, which is a bit too crude). So long as two symbolic crate names "foo" and "foo" can be segregated by different source strings input to the hash, you should be fine.

@nikomatsakis
Copy link
Contributor

So I want to see this get implemented soon. I just re-read the thread and I think the following proposal seems to be what we all collectively agreed upon:

  1. Compiler accepts some metadata on the command line
  2. Each symbol name gains a hash that is computed from:
    • its DefPath, which is a stable, persistent path derived from the module path (but made unique)
      • this is basically the path that the compiler itself uses to link things in incr. comp.
    • the crate name (just because)
    • compiler metadata from the command line
      • when using cargo, this would include the version number, source, and maybe some other stuff
      • if using some alternative build manager, it'll include that

The major thing which this scheme does not attempt to prevent is user error in terms of guaranteeing correct compilation of dependencies. In particular, if there is a crate A used by a crate B used by a crate C and crate A changes, it may be that one can recompile crate C without noticing that crate B has not been recompiled. Today this (in principle) fails because crate B's SVH would have to change (although as @brson noted this may not be true if the dependencies are not incorporated into the SVH).

We could in principle sort of prevent this case with a dummy symbol or something like that, however, we choose not to because:

  1. Cargo makes this scenario far less likely in practice.
  2. If we ever want to support binary compatibility, then we DON'T necessarily want to force B to be recompiled. In particular, we would want to be able to redeploy a version of A that has a bug fix without recompiling B (or even C).

One consideration not raised here is the "debugger experience". It'd be nice to get these hashes into the symbol name in an unintrusive-way. We also might want to have symbols include the crate name as a prefix, e.g., crate::foo::bar::zed::hash.

Does this seem like an accurate summary to everyone?

UPDATE: Replaced DefIndex with DefPath, which is what I meant.

@nikomatsakis
Copy link
Contributor

(Note that we can continue to keep the SVH in the metadata if want, so that compiles continue to fail in the same fashion that they do now. We just don't want it to be a part of all symbol names.)

@nikomatsakis
Copy link
Contributor

(Note also that I changed from what I originally wrote, in that I substituted the DefPath in place of hashing types that appear in the signature or anything like that. I don't think that this hashing serves any particular purpose.)

@alexcrichton
Copy link
Member

@nikomatsakis sounds like an accurate summary to me!

@brson
Copy link
Contributor

brson commented Feb 8, 2016

In particular, if there is a crate A used by a crate B used by a crate C and crate A changes, it may be that one can recompile crate C without noticing that crate B has not been recompiled.

I'm happy to try this, but worried about going back to this state. I remember that adding the SVH was the thing that stopped us having bizarre linker errors all the time. You suggest that in a cargo world this doesn't matter, but the problem it solved for us was our own pain in tree.

In an incremental compilation world I might expect, instead of a crate-level SVH, that all items hash their upstream item dependencies (not crate dependencies), which can continue to be validated for consistency like the SVH.

@nikomatsakis
Copy link
Contributor

@brson

I remember that adding the SVH was the thing that stopped us having bizarre linker errors all the time.

I guess you mean "made those errors fail faster"? That is, it seems like this only helps if our makefiles are buggy...or am I missing something?

In an incremental compilation world I might expect, instead of a crate-level SVH, that all items hash their upstream item dependencies (not crate dependencies), which can continue to be re-validated for consistency like the SVH.

Yes, @michaelwoerister suggested something like this earlier. Perhaps we should do this as an intermediate step. I didn't want to because I wanted an algorithm we could actually spec someday, but maybe it's not worth jumping all the way to that, and better to take incremental steps (no pun intended) that preserve some of the properties of the current system.

@nikomatsakis
Copy link
Contributor

Oh, also: if we keep the SVH -- or something like it -- in crate metadata (which I think we can do), then it would presumably also help give clean errors when you forget to compile some intermediate step, just as it does today. But I'm assuming here that the compiler actually checks the SVH's recorded in the transitive crate graph for compatibility...

@michaelwoerister
Copy link
Member

I actually wanted to work on this next and hopefully come up with a PR by the end of the week. I think Niko gave an accurate summary of the suggested solution and also Brian's concern about it is valid. Let's take a look at the requirements again:

  1. We want two versions of the same crate be able to live next to each other without conflict.
  2. We want symbol names to be stable across multiple incremental compilation runs.
  3. We want the compiler to notice if something is wrong with the graph of crates being linked and give an understandable error message instead of some obscure linker failure.
  4. We want, in principle, to support binary compatibility of crates, so one binary can be swapped out for another if that is semantically correct.
  5. We want to prevent unnoticed linkage inconsistencies where the binary interface of a function changes without downstream crates noticing it.

The suggested solution makes it possible to have (1), (2) and (3), while (4) and (5) are more troublesome. Point (4) falls out of this approach naturally, but only if we pass on (5). That is, if we only use the commandline provided metadata to salt symbol names, it's easy to end up with two incompatible symbols with the same name and the runtime linker has no way of noticing.

The more sophisticated approach, with symbol hashes also incorporating dependencies, would mean that we have to very carefully define what gets incorporated. The hash would need to include anything that changes the binary interface of a function but it should not depend on the function body. Otherwise we lose (4) again. But it would lead to incompatible functions always having different symbol names, thus guaranteeing (5).
Note that by itself this would not prevent linker conflicts: Two similar versions of some crate A would have the same symbol names except for the differences (this could probably solved with something like COMDAT sections or "linkonce" linkage). All in all this approach would be more user-friendly; although some additional work would probably be needed to provide good error messages in case of inconsistencies.

Another thing to note is monomorphizations: The DefPath of a generic function alone is not sufficient to distinguish two instantiations of that function. We'll have to incorporate type arguments too. That's not hard, but it's something where the current algorithm actually is a bit strange and does things differently from 'normal' symbols AFAIR.

I think it would be a good idea to implement the simple version described by Niko now, as it will let us proceed with the incremental compilation work. Having good error messages in the compiler should be doable by storing SVHs in crate metadata and checking those when pulling in dependencies. And I'd suggest to punt on binary compatibility for the time being and add an SVH symbol that will make inconsistent linkage setups fail.
The second, more sophisticated approach can be done later and would then enable binary compatibility too. The salt could still be incorporated into symbol names then, it just would not be necessary anymore.

@pnkfelix
Copy link
Member

@nikomatsakis wrote:

I guess you mean "made those errors fail faster"? That is, it seems like this only helps if our makefiles are buggy...or am I missing something?

Niko and I talked about this some more.

A particular case where I believe we did indeed get errors to fail faster (which may or may not fall into the category that @brson is referencing) was when our makefiles were not setting up the LD_LIBRARY_PATH correctly, and we would get semi-useful errors during attempts to dynamically link build products from the wrong stage of the boot-strapped compiler. If the symbols didn't have distinct so-called "salt" for different source code origina, those errors could be masked or yield much harder to diagnose issues

Niko and I agreed that this may be an important case to consider

However, we should be able to retain this rough safe guard if we perhaps do something like include distinct salt in the distinct stages of the bootstrapped compilers

@michaelwoerister
Copy link
Member

However, we should be able to retain this rough safe guard if we perhaps do something like include distinct salt in the distinct stages of the bootstrapped compilers

Yes I think that would be a good idea in any case.

Additionally, in the first implementation (PR rust-lang/rust#31539) of the changes proposed here, I'd like to include the even stronger safe-guard mentioned above: Add an SVH_safeguard_<actual SVH> symbol to every crate and link to that in every dependent crate. That way the dynamic linker should fail when trying to link to any other version of a dependency (like now), but we'd still have stable symbols for incremental compilation. These safeguard links would just need to be in their own compilation unit, so they don't interfere with regular compilation unit invalidation.

@brson
Copy link
Contributor

brson commented Feb 20, 2016

I guess you mean "made those errors fail faster"? That is, it seems like this only helps if our makefiles are buggy...or am I missing something?

Yes, I supposed that's right. It may just not be a problem in our build system anymore.

bors added a commit to rust-lang/rust that referenced this issue Mar 15, 2016
WIP: Implement stable symbol-name generation algorithm.

This PR changes the way symbol names are generated by the compiler. The new algorithm reflects the current state of the discussion over at rust-lang/rfcs#689.

Once it is done, it will also fix issue #30330. I want to add a test case for that before closing it though.

I also want to do some performance tests. The new algorithm does a little more work than the previous one due to various reasons, and it might make sense to adapt it in a way that allows it to be implemented more efficiently.

@nikomatsakis: It would be nice if there was a way of finding out if a `DefPath` refers to something in the current crate or in an external one. The information is already there, it's just not accessible at the moment. I'll probably propose some minor changes there, together with some facilities to allow for accessing `DefPaths` without allocating a `Vec` for them.

**TODO**
 - ~~Actually "crate qualify" symbols, as promised in the docs.~~
 - ~~Add a test case showing that symbol names are deterministic~~.
 - Maybe add a test case showing that symbol names are stable against small code changes.

~~One thing that might be interesting to the @rust-lang/compiler team:
I've used SipHash exclusively now for generating symbol hashes. Previously it was only used for monomorphizations and the rest of the code used a truncated version on SHA256. Is there any benefit to sticking to SHA? I don't really see one since we only used 64 bits of the digest anyway, but maybe I'm missing something?~~ ==> Just switched things back to SHA-2 for now.
@Centril Centril added the T-compiler Relevant to the compiler team, which will review and decide on the RFC. label Feb 23, 2018
@Centril
Copy link
Contributor

Centril commented Oct 7, 2018

@Centril Centril closed this as completed Oct 7, 2018
@glaebhoerl
Copy link
Contributor

It's happened once or twice before, but in general I don't think we should be closing RFC issues in favor of internals threads -- one of the main ways in which RFC issues are different from internals threads (and thereby can justify their own existence) is that they are less ephemeral, and therefore more useful for tracking and cross-referencing purposes.

@Centril
Copy link
Contributor

Centril commented Oct 8, 2018

@glaebhoerl To me, the problem is that there's so much clutter here that it obscures what tracks actionable things that we may want to do in the future (instead of 100s of issues that will likely go nowhere); I expect the linked IRLO thread to eventually become a proper RFC.

@glaebhoerl
Copy link
Contributor

Yeah, the "hazard" is just that if for whatever reason it doesn't, the issue wouldn't then get reopened. So in general it seems like a sounder rule to close-in-favor-of once the RFC or whatever has actually happened. (If we apply case-by-case reasoning, I do agree that it seems unlikely in this one.)

And yes, reducing clutter is a noble goal, thanks for working on it :)

@Centril
Copy link
Contributor

Centril commented Oct 8, 2018

@glaebhoerl Indeed it's not a slam dunk either way I think and it has the risks / drawbacks you mention; so I agree on the general idea. That said, it is my experience that this issue tracker is mainly used to file low-effort low-quality requests that are eventually abandoned, never closed, and never revisited. IRLO seems like a better venue for discussing such topics since it is designed for ephemeral things (as you mentioned before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-compiler Relevant to the compiler team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests