Should `char` have the same layout as `u32`? #462

djkoloski · 2023-09-25T00:05:52Z

I'd like to be able to make the assumption that char is ABI compatible with u32 for some validation code in bytecheck. I have actually been making this assumption for a while and never had problems (including with MIRI), so I think this is de-facto the case. I thought that char and u32 were supposed to have the same layout but when I went looking I realized I couldn't find anything. As far as I know, it doesn't really make sense for chars to be anything other than a u32 with fewer valid bit patterns.

For completeness:

"The Rust Reference: Type layout": States that both u32 and char are four bytes.
"The Rust Reference: Textual types": States that it "is a Unicode scalar value (i.e. a code point that is not a surrogate), represented as a 32-bit unsigned word in the 0x0000 to 0xD7FF or 0xE000 to 0x10FFFF range.". I don't think this can be interpreted as having the same ABI as u32.

The text was updated successfully, but these errors were encountered:

thomcc · 2023-09-25T00:07:16Z

Yes, I've written plenty of code that relies on this. If we don't write this down anywhere, we should.

thomcc · 2023-09-25T01:03:05Z

32-bit unsigned word ... I don't think this can be interpreted as having the same ABI as u32.

I'm not sure how it couldn't?

djkoloski · 2023-09-25T01:50:51Z

For the purposes of justifying unsafe code, I think it's not tightly worded enough. My interpretation is that char could be 1-aligned but still represent a 32-bit unsigned word. I also looked for examples of transmuting &u32 to &char or vice-versa and didn't find any in the standard library.

RalfJung · 2023-09-25T06:04:46Z

Note that you are asking two different questions:

Do they have the same layout (size and alignment)? This seems strongly implied by the docs, but indeed the docs do not talk about alignment.
Do they have the same function ABI? That's not discussed at all in the current docs. In particular, given that u32 and i32 are not ABI-compatible on all targets, the docs would at least have to clarify which of the two char should be ABI-compatible with.

However, this is really more of a t-lang thing than a t-opsem thing. It's about making a stable commitment about the layout and ABI of char going forward.

thomcc · 2023-09-25T06:14:46Z

given that u32 and i32 are not ABI-compatible on all targets

Is that true? Which targets is this the case? (Is it just the case due to CFI stuff?)

RalfJung · 2023-09-25T06:20:10Z

I think it was on s390x that u32 gets zero-extended and i32 gets sign-extended, both passed in a larger register. So if a function takes a u32 it can assume that all the high bits are 0, and if the caller passes an i32 then it can violate that assumption, making them not ABI-compatible.

See rust-lang/rust#115476 for which types we can actually guarantee are ABI-compatible.

RalfJung · 2023-09-25T06:23:04Z

For this particular issue I would suggest opening a PR which adds a statement to the docs saying explicitly that char is fully layout and ABI-compatible with u32. We then nominate that for t-lang discussion and have them FCP this decision.

I didn't add char in rust-lang/rust#115476 since I wanted to only document what are basically existing guarantees, not add new ones.

I have actually been making this assumption for a while and never had problems (including with MIRI)

FWIW latest Miri does not consider u32 and char to be ABI-compatible, it will error in such cases.

djkoloski · 2023-09-25T06:27:21Z

Thanks for clarifying @RalfJung, I understand the difference between layout compatibility and ABI compatibility much better now.

I only care about char having the same layout and representation as u32 so I can cast a *const char to a *const u32 to read and validate it (given that the original pointer is aligned for char and points to size_of::<char>() initialized bytes).

thomcc · 2023-09-25T06:29:55Z

I would like char to effectively be #[repr(transparent)] struct char(u32); (with significant additional magic), so I'd prefer if the ABI were the same as well, as it would be given that definition.

RalfJung · 2023-09-25T07:01:15Z

Yeah that makes sense. For t-opsem there's nothing to do here since the operational semantics already do what you are asking for. You just need to get the lang team to commit to this as a stable guarantee. Filing the PR to make that request should be very easy. :)

joshlf · 2023-09-27T05:22:01Z

We would also benefit from the alignment guarantee in zerocopy.

IIUC, the only gaps are:

char isn't guaranteed to have the same alignment as u32
char isn't guaranteed to have all of its bytes initialized (u32 does have this guarantee)

Are there any other facts about char's layout or representation which a) we expect to be equivalent to u32 but, b) aren't guaranteed?

RalfJung · 2023-09-27T05:34:11Z

The function call ABI of char is also completely unspecified. So if a function takes a u32 but you call it with a function pointer that has char type for that argument, you have UB (and vice versa).

See rust-lang/rust#115476.

zachs18 · 2024-04-01T18:15:42Z

rust-lang/rust#116894 and rust-lang/rust#118032 are in 1.75 and 1.76, respectively, and guarantee layout-compatbility (size and alignment) and function call ABI-compatibility, respectively.

"are all of the bytes of char initailized?" is not explicitly mentioned anywhere that I've found, but IMO it could be considered as following from "all bytes of u32 are initailized" + "char is function call ABI-compatible with u32", since otherwise it would be unsound to pass a char to a function expecting a u32 (this reasoning isn't super solid, since obviously the reverse (passing a u32 to a function expecting a char) is not sound in all cases because of chars bit-valitity requirements).

RalfJung · 2024-04-01T18:21:52Z

Yeah this question is settled, char is exactly like u32 but with a more restrictive validity invariant.

joshlf · 2024-04-01T18:46:38Z

rust-lang/rust#116894 and rust-lang/rust#118032 are in 1.75 and 1.76, respectively, and guarantee layout-compatbility (size and alignment) and function call ABI-compatibility, respectively.

"are all of the bytes of char initailized?" is not explicitly mentioned anywhere that I've found, but IMO it could be considered as following from "all bytes of u32 are initailized" + "char is function call ABI-compatible with u32", since otherwise it would be unsound to pass a char to a function expecting a u32 (this reasoning isn't super solid, since obviously the reverse (passing a u32 to a function expecting a char) is not sound in all cases because of chars bit-valitity requirements).

It's documented here.

saethlin added A-layout Topic: Related to data structure layout (`#[repr]`) S-not-opsem Despite being in this repo, this is not primarily a T-opsem question T-lang labels Sep 25, 2023

RalfJung closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `char` have the same layout as `u32`? #462

Should `char` have the same layout as `u32`? #462

djkoloski commented Sep 25, 2023 •

edited

Loading

thomcc commented Sep 25, 2023

thomcc commented Sep 25, 2023

djkoloski commented Sep 25, 2023

RalfJung commented Sep 25, 2023

thomcc commented Sep 25, 2023

RalfJung commented Sep 25, 2023

RalfJung commented Sep 25, 2023

djkoloski commented Sep 25, 2023

thomcc commented Sep 25, 2023

RalfJung commented Sep 25, 2023

joshlf commented Sep 27, 2023

RalfJung commented Sep 27, 2023

zachs18 commented Apr 1, 2024

RalfJung commented Apr 1, 2024

joshlf commented Apr 1, 2024

Should char have the same layout as u32? #462

Should char have the same layout as u32? #462

Comments

djkoloski commented Sep 25, 2023 • edited Loading

thomcc commented Sep 25, 2023

thomcc commented Sep 25, 2023

djkoloski commented Sep 25, 2023

RalfJung commented Sep 25, 2023

thomcc commented Sep 25, 2023

RalfJung commented Sep 25, 2023

RalfJung commented Sep 25, 2023

djkoloski commented Sep 25, 2023

thomcc commented Sep 25, 2023

RalfJung commented Sep 25, 2023

joshlf commented Sep 27, 2023

RalfJung commented Sep 27, 2023

zachs18 commented Apr 1, 2024

RalfJung commented Apr 1, 2024

joshlf commented Apr 1, 2024

Should `char` have the same layout as `u32`? #462

Should `char` have the same layout as `u32`? #462

djkoloski commented Sep 25, 2023 •

edited

Loading