-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795
Comments
I don’t think there’s a downside in doing this, but I don’t really think it’s worth spending much time on either. Anyway, you don’t need to convince me. |
Right. A main point here is the educational value of it and being able to be more explicit about these numbers in the documentation of This will guide users to allocate the correct amount of memory (authoring time, vs waiting until runtime error) without them needing to know the details of the encodings. Also, it increases readability for anyone reading such code. Let me add this info to the proposal, to make the value more clear. Thanks, Simon. |
Seems plausible to me! |
Triage: Marking as E-easy since there is an abandon PR with remaining concerns that seems relatively easy to resolve: #98198 |
Thanks, @Dylan-DPC, for preparing the PR! Now that some time is past, here's my take on this improvement, adding to the notes from https://github.com/rust-lang/rust/pull/98198/files#r906356932 : First step) Let's see if we can add the consts limited to Second step) After that, discuss here to see if we really like to take the two new consts public. Also, if we decide to move forward with the second step, we might like to reconsider the naming, to better match the existing public API—like, since we already have |
@rustbot claim |
Background
UTF-8 encoding on any character can take up to 4 bytes (
u8
). UTF-16 encoding can take up to 2 words (u16
). This is a promise from the encoding specs, and an assumption made in many places inside rust libs and applications.Currently, there's lots of magic numbers 4 and 2 everywhere in the code, creating buffer long enough to encode a character into as UTF-8 or UTF-16.
Examples
rust/src/libcore/tests/char.rs
Lines 236 to 239 in b7041bf
rust/src/libcore/tests/char.rs
Lines 253 to 256 in b7041bf
Proposal
Add the followings public definitions to
std::char
andcore::char
to be used inside the rust codebase and publicly.Why should we do this?
This will allow the code to be written like this:
This will guide users—without them knowing too much details of UTF-8/UTF-16 encodings—to allocate the correct amount of memory while writing the code, instead of waiting until some runtime error is raise, which actually may not happen in basic tests and discovered externally. Also, it increases readability for anyone reading such code.
Besides using these max-length values for char-level allocations, user can also use them for pre-allocate memory for encoding some chars list into UTF-8/UTF-16.
How we teach this?
The std/core libs will be updated to use these values wherever possible (see this list), and docs for encoding-related functions in
char
module are updated to evangelize using these values when allocating memory to be used by the encoding functions.Alternatives
1) Only update the docs
We can just update the function docs to talk about these max-length values, but not name them as a
const
value.2) New functions for allocations with max limit
Although this can be handy for some users, it would be limited to only one use-case of these numbers and not helpful for other operations.
What do you think?
The text was updated successfully, but these errors were encountered: