[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795

behnam · 2017-11-05T22:11:03Z

Background

UTF-8 encoding on any character can take up to 4 bytes (u8). UTF-16 encoding can take up to 2 words (u16). This is a promise from the encoding specs, and an assumption made in many places inside rust libs and applications.

Currently, there's lots of magic numbers 4 and 2 everywhere in the code, creating buffer long enough to encode a character into as UTF-8 or UTF-16.

Examples

rust/src/libcore/tests/char.rs

Lines 236 to 239 in b7041bf

    
           fn check(input: char, expect: &[u8]) { 
        
               let mut buf = [0; 4]; 
        
               let ptr = buf.as_ptr(); 
        
               let s = input.encode_utf8(&mut buf);

rust/src/libcore/tests/char.rs

Lines 253 to 256 in b7041bf

    
           fn check(input: char, expect: &[u16]) { 
        
               let mut buf = [0; 2]; 
        
               let ptr = buf.as_mut_ptr(); 
        
               let b = input.encode_utf16(&mut buf);

Proposal

Add the followings public definitions to std::char and core::char to be used inside the rust codebase and publicly.

pub const MAX_UTF8_LEN: usize = 4;
pub const MAX_UTF16_LEN: usize = 2;

Why should we do this?

This will allow the code to be written like this:

let mut buf = [0; char::MAX_UTF16_LEN];
let b = input.encode_utf16(&mut buf);

This will guide users—without them knowing too much details of UTF-8/UTF-16 encodings—to allocate the correct amount of memory while writing the code, instead of waiting until some runtime error is raise, which actually may not happen in basic tests and discovered externally. Also, it increases readability for anyone reading such code.

Besides using these max-length values for char-level allocations, user can also use them for pre-allocate memory for encoding some chars list into UTF-8/UTF-16.

How we teach this?

The std/core libs will be updated to use these values wherever possible (see this list), and docs for encoding-related functions in char module are updated to evangelize using these values when allocating memory to be used by the encoding functions.

Alternatives

1) Only update the docs

We can just update the function docs to talk about these max-length values, but not name them as a const value.

2) New functions for allocations with max limit

Although this can be handy for some users, it would be limited to only one use-case of these numbers and not helpful for other operations.

What do you think?

The text was updated successfully, but these errors were encountered:

behnam · 2017-11-05T22:12:19Z

/cc @SimonSapin, @sfackler, @alexcrichton

SimonSapin · 2017-11-05T22:49:49Z

I don’t think there’s a downside in doing this, but I don’t really think it’s worth spending much time on either. Anyway, you don’t need to convince me.

behnam · 2017-11-05T23:13:40Z

Right. A main point here is the educational value of it and being able to be more explicit about these numbers in the documentation of char:: encode_utf8() and char::encode_utf16().

This will guide users to allocate the correct amount of memory (authoring time, vs waiting until runtime error) without them needing to know the details of the encodings. Also, it increases readability for anyone reading such code.

Let me add this info to the proposal, to make the value more clear.

Thanks, Simon.

alexcrichton · 2017-11-06T15:58:38Z

Seems plausible to me!

Enselic · 2023-09-26T17:06:35Z

Triage: Marking as E-easy since there is an abandon PR with remaining concerns that seems relatively easy to resolve: #98198

behnam · 2023-09-26T18:12:46Z

Thanks, @Dylan-DPC, for preparing the PR!

Now that some time is past, here's my take on this improvement, adding to the notes from https://github.com/rust-lang/rust/pull/98198/files#r906356932 :

First step) Let's see if we can add the consts limited to std, making changes only to library/core, library/alloc and library/std files.

Second step) After that, discuss here to see if we really like to take the two new consts public.

Also, if we decide to move forward with the second step, we might like to reconsider the naming, to better match the existing public API—like, since we already have char.len_utf16(), maybe these should be MAX_LEN_UTF16, considering that it provides the maximum value that can be possibly returned from that function.

HTGAzureX1212 · 2024-01-27T14:51:00Z

@rustbot claim

kennytm added C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 6, 2017

dtolnay added C-feature-accepted Category: A feature request that has been accepted pending implementation. and removed C-feature-request Category: A feature request, i.e: not implemented / a PR. labels Nov 14, 2017

Dylan-DPC mentioned this issue Jun 17, 2022

Add MAX_UTF{8, 16}_LEN constants #98198

Closed

Enselic added the E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. label Sep 26, 2023

rustbot assigned HTGAzureX1212 Jan 27, 2024

HTGAzureX1212 mentioned this issue Feb 2, 2024

Add MAX_LEN_UTF8 and MAX_LEN_UTF16 Constants #120580

Open

HTGAzureX1212 mentioned this issue Feb 27, 2024

Tracking Issue for char_max_len #121714

Open

3 tasks

This comment has been minimized.

Sign in to view

rustbot assigned NoobProgrammer31 and unassigned HTGAzureX1212 Sep 17, 2024

jieyouxu assigned HTGAzureX1212 and unassigned NoobProgrammer31 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795

[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795

behnam commented Nov 5, 2017 •

edited by rustbot

Loading

behnam commented Nov 5, 2017

SimonSapin commented Nov 5, 2017

behnam commented Nov 5, 2017

alexcrichton commented Nov 6, 2017

Enselic commented Sep 26, 2023

behnam commented Sep 26, 2023 •

edited

Loading

HTGAzureX1212 commented Jan 27, 2024

This comment has been minimized.

[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795

[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795

Comments

behnam commented Nov 5, 2017 • edited by rustbot Loading

Background

Examples

Proposal

Why should we do this?

How we teach this?

Alternatives

1) Only update the docs

2) New functions for allocations with max limit

behnam commented Nov 5, 2017

SimonSapin commented Nov 5, 2017

behnam commented Nov 5, 2017

alexcrichton commented Nov 6, 2017

Enselic commented Sep 26, 2023

behnam commented Sep 26, 2023 • edited Loading

HTGAzureX1212 commented Jan 27, 2024

This comment has been minimized.

behnam commented Nov 5, 2017 •

edited by rustbot

Loading

behnam commented Sep 26, 2023 •

edited

Loading