Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should char have the same layout as u32? #462

Closed
djkoloski opened this issue Sep 25, 2023 · 15 comments
Closed

Should char have the same layout as u32? #462

djkoloski opened this issue Sep 25, 2023 · 15 comments
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) S-not-opsem Despite being in this repo, this is not primarily a T-opsem question T-lang

Comments

@djkoloski
Copy link

djkoloski commented Sep 25, 2023

I'd like to be able to make the assumption that char is ABI compatible with u32 for some validation code in bytecheck. I have actually been making this assumption for a while and never had problems (including with MIRI), so I think this is de-facto the case. I thought that char and u32 were supposed to have the same layout but when I went looking I realized I couldn't find anything. As far as I know, it doesn't really make sense for chars to be anything other than a u32 with fewer valid bit patterns.

For completeness:

@thomcc
Copy link
Member

thomcc commented Sep 25, 2023

Yes, I've written plenty of code that relies on this. If we don't write this down anywhere, we should.

@saethlin saethlin added A-layout Topic: Related to data structure layout (`#[repr]`) S-not-opsem Despite being in this repo, this is not primarily a T-opsem question T-lang labels Sep 25, 2023
@thomcc
Copy link
Member

thomcc commented Sep 25, 2023

32-bit unsigned word ... I don't think this can be interpreted as having the same ABI as u32.

I'm not sure how it couldn't?

@djkoloski
Copy link
Author

For the purposes of justifying unsafe code, I think it's not tightly worded enough. My interpretation is that char could be 1-aligned but still represent a 32-bit unsigned word. I also looked for examples of transmuting &u32 to &char or vice-versa and didn't find any in the standard library.

@RalfJung
Copy link
Member

Note that you are asking two different questions:

  • Do they have the same layout (size and alignment)? This seems strongly implied by the docs, but indeed the docs do not talk about alignment.
  • Do they have the same function ABI? That's not discussed at all in the current docs. In particular, given that u32 and i32 are not ABI-compatible on all targets, the docs would at least have to clarify which of the two char should be ABI-compatible with.

However, this is really more of a t-lang thing than a t-opsem thing. It's about making a stable commitment about the layout and ABI of char going forward.

@thomcc
Copy link
Member

thomcc commented Sep 25, 2023

given that u32 and i32 are not ABI-compatible on all targets

Is that true? Which targets is this the case? (Is it just the case due to CFI stuff?)

@RalfJung
Copy link
Member

I think it was on s390x that u32 gets zero-extended and i32 gets sign-extended, both passed in a larger register. So if a function takes a u32 it can assume that all the high bits are 0, and if the caller passes an i32 then it can violate that assumption, making them not ABI-compatible.

See rust-lang/rust#115476 for which types we can actually guarantee are ABI-compatible.

@RalfJung
Copy link
Member

For this particular issue I would suggest opening a PR which adds a statement to the docs saying explicitly that char is fully layout and ABI-compatible with u32. We then nominate that for t-lang discussion and have them FCP this decision.

I didn't add char in rust-lang/rust#115476 since I wanted to only document what are basically existing guarantees, not add new ones.

I have actually been making this assumption for a while and never had problems (including with MIRI)

FWIW latest Miri does not consider u32 and char to be ABI-compatible, it will error in such cases.

@djkoloski
Copy link
Author

Thanks for clarifying @RalfJung, I understand the difference between layout compatibility and ABI compatibility much better now.

I only care about char having the same layout and representation as u32 so I can cast a *const char to a *const u32 to read and validate it (given that the original pointer is aligned for char and points to size_of::<char>() initialized bytes).

@thomcc
Copy link
Member

thomcc commented Sep 25, 2023

I would like char to effectively be #[repr(transparent)] struct char(u32); (with significant additional magic), so I'd prefer if the ABI were the same as well, as it would be given that definition.

@RalfJung
Copy link
Member

Yeah that makes sense. For t-opsem there's nothing to do here since the operational semantics already do what you are asking for. You just need to get the lang team to commit to this as a stable guarantee. Filing the PR to make that request should be very easy. :)

@joshlf
Copy link

joshlf commented Sep 27, 2023

We would also benefit from the alignment guarantee in zerocopy.

IIUC, the only gaps are:

  • char isn't guaranteed to have the same alignment as u32
  • char isn't guaranteed to have all of its bytes initialized (u32 does have this guarantee)

Are there any other facts about char's layout or representation which a) we expect to be equivalent to u32 but, b) aren't guaranteed?

@RalfJung
Copy link
Member

The function call ABI of char is also completely unspecified. So if a function takes a u32 but you call it with a function pointer that has char type for that argument, you have UB (and vice versa).

See rust-lang/rust#115476.

@zachs18
Copy link

zachs18 commented Apr 1, 2024

rust-lang/rust#116894 and rust-lang/rust#118032 are in 1.75 and 1.76, respectively, and guarantee layout-compatbility (size and alignment) and function call ABI-compatibility, respectively.

"are all of the bytes of char initailized?" is not explicitly mentioned anywhere that I've found, but IMO it could be considered as following from "all bytes of u32 are initailized" + "char is function call ABI-compatible with u32", since otherwise it would be unsound to pass a char to a function expecting a u32 (this reasoning isn't super solid, since obviously the reverse (passing a u32 to a function expecting a char) is not sound in all cases because of chars bit-valitity requirements).

@RalfJung
Copy link
Member

RalfJung commented Apr 1, 2024

Yeah this question is settled, char is exactly like u32 but with a more restrictive validity invariant.

@RalfJung RalfJung closed this as completed Apr 1, 2024
@joshlf
Copy link

joshlf commented Apr 1, 2024

rust-lang/rust#116894 and rust-lang/rust#118032 are in 1.75 and 1.76, respectively, and guarantee layout-compatbility (size and alignment) and function call ABI-compatibility, respectively.

"are all of the bytes of char initailized?" is not explicitly mentioned anywhere that I've found, but IMO it could be considered as following from "all bytes of u32 are initailized" + "char is function call ABI-compatible with u32", since otherwise it would be unsound to pass a char to a function expecting a u32 (this reasoning isn't super solid, since obviously the reverse (passing a u32 to a function expecting a char) is not sound in all cases because of chars bit-valitity requirements).

It's documented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) S-not-opsem Despite being in this repo, this is not primarily a T-opsem question T-lang
Projects
None yet
Development

No branches or pull requests

6 participants