From 9cf35a6c06da5699efcf417eecaa2d33b3a0d13b Mon Sep 17 00:00:00 2001 From: Mark Lodato Date: Wed, 9 Mar 2022 21:43:04 -0500 Subject: [PATCH] Rework String UTF-8 Documentation **This Commit** Adds some clarity around indexing into Strings and the constraints driving various decisions there. **Why?** The [`String` documentation][0] mentions how `String`s can't be indexed but `Range` has an implementation for `SliceIndex`. This can be confusing. There are also several statements to explain the lack of `String` indexing: - the inability to index into a `String` is an implication of UTF-8 encoding - indexing into a `String` could not be constant-time with UTF-8 encoding - indexing into a `String` does not have an obvious return type This last statement made sense but the first two seemed contradictory to the documentation around [`SliceIndex`][1] which mention: - one can index into a `String` with a `Range` (also called substring slicing but it uses the same syntax and the method name is `index`) - `Range` indexing into a `String` is constant-time To resolve this seeming contradiction the documentation is reworked to more clearly explain what factors drive the decision to disallow indexing into a `String` with a single number. [0]: https://doc.rust-lang.org/stable/std/string/struct.String.html#utf-8 [1]: https://doc.rust-lang.org/stable/std/slice/trait.SliceIndex.html#impl-SliceIndex%3Cstr%3E --- library/alloc/src/string.rs | 92 +++++++++++++++++++++++++++++++++---- 1 file changed, 82 insertions(+), 10 deletions(-) diff --git a/library/alloc/src/string.rs b/library/alloc/src/string.rs index 71b6b9b41f5c5..e97c1637fd5a2 100644 --- a/library/alloc/src/string.rs +++ b/library/alloc/src/string.rs @@ -117,27 +117,99 @@ use crate::vec::Vec; /// /// # UTF-8 /// -/// `String`s are always valid UTF-8. This has a few implications, the first of -/// which is that if you need a non-UTF-8 string, consider [`OsString`]. It is -/// similar, but without the UTF-8 constraint. The second implication is that -/// you cannot index into a `String`: +/// `String`s are always valid UTF-8. If you need a non-UTF-8 string, consider +/// [`OsString`]. It is similar, but without the UTF-8 constraint. Because UTF-8 +/// is a variable width encoding, `String`s are typically smaller than an array of +/// the same `chars`: +/// +/// ``` +/// use std::mem; +/// +/// // `s` is ASCII which represents each `char` as one byte +/// let s = "hello"; +/// assert_eq!(s.len(), 5); +/// +/// // A `char` array with the same contents would be longer because +/// // every `char` is four bytes +/// let s = ['h', 'e', 'l', 'l', 'o']; +/// let size: usize = s.into_iter().map(|c| mem::size_of_val(&c)).sum(); +/// assert_eq!(size, 20); +/// +/// // However, for non-ASCII strings, the difference will be smaller +/// // and sometimes they are the same +/// let s = "💖💖💖💖💖"; +/// assert_eq!(s.len(), 20); +/// +/// let s = ['💖', '💖', '💖', '💖', '💖']; +/// let size: usize = s.into_iter().map(|c| mem::size_of_val(&c)).sum(); +/// assert_eq!(size, 20); +/// ``` +/// +/// This raises interesting questions as to how `s[i]` should work. +/// What should `i` be here? Several options include byte indices and +/// `char` indices but, because of UTF-8 encoding, only byte indices +/// would provide constant time indexing. Getting the `i`th `char`, for +/// example, is available using [`chars`]: +/// +/// ``` +/// let s = "hello"; +/// let third_character = s.chars().nth(2); +/// assert_eq!(third_character, Some('l')); +/// +/// let s = "💖💖💖💖💖"; +/// let third_character = s.chars().nth(2); +/// assert_eq!(third_character, Some('💖')); +/// ``` +/// +/// Next, what should `s[i]` return? Because indexing returns a reference +/// to underlying data it could be `&u8`, `&[u8]`, or something else similar. +/// Since we're only providing one index, `&u8` makes the most sense but that +/// might not be what the user expects and can be explicitly achieved with +/// [`as_bytes()`]: +/// +/// ``` +/// // The first byte is 104 - the byte value of `'h'` +/// let s = "hello"; +/// assert_eq!(s.as_bytes()[0], 104); +/// // or +/// assert_eq!(s.as_bytes()[0], b'h'); +/// +/// // The first byte is 240 which isn't obviously useful +/// let s = "💖💖💖💖💖"; +/// assert_eq!(s.as_bytes()[0], 240); +/// ``` +/// +/// Due to these ambiguities/restrictions, indexing with a `usize` is simply +/// forbidden: /// /// ```compile_fail,E0277 /// let s = "hello"; /// -/// println!("The first letter of s is {}", s[0]); // ERROR!!! +/// // The following will not compile! +/// println!("The first letter of s is {}", s[0]); /// ``` /// +/// It is more clear, however, how `&s[i..j]` should work (that is, +/// indexing with a range). It should accept byte indices (to be constant-time) +/// and return a `&str` which is UTF-8 encoded. This is also called "string slicing". +/// Note this will panic if the byte indices provided are not character +/// boundaries - see [`is_char_boundary`] for more details. See the implementations +/// for [`SliceIndex`] for more details on string slicing. For a non-panicking +/// version of string slicing, see [`get`]. +/// /// [`OsString`]: ../../std/ffi/struct.OsString.html "ffi::OsString" +/// [`SliceIndex`]: core::slice::SliceIndex +/// [`as_bytes()`]: str::as_bytes +/// [`get`]: str::get +/// [`is_char_boundary`]: str::is_char_boundary /// -/// Indexing is intended to be a constant-time operation, but UTF-8 encoding -/// does not allow us to do this. Furthermore, it's not clear what sort of -/// thing the index should return: a byte, a codepoint, or a grapheme cluster. -/// The [`bytes`] and [`chars`] methods return iterators over the first -/// two, respectively. +/// The [`bytes`] and [`chars`] methods return iterators over the bytes and +/// codepoints of the string, respectively. To iterate over codepoints along +/// with byte indices, use [`char_indices`]. /// /// [`bytes`]: str::bytes /// [`chars`]: str::chars +/// [`char_indices`]: str::char_indices /// /// # Deref ///