Clarify indexing into Strings #94794

mlodato517 · 2022-03-10T02:55:21Z

This Commit
Adds some clarity around indexing into Strings.

Why?
I was reading through the Range documentation and saw an
implementation for SliceIndex<str>. I was surprised to see this and
went to read the String documentation and, to me, it seemed to
say (at least) three things:

you cannot index into a String
indexing into a String could not be constant-time
indexing into a String does not have an obvious return type

I absolutely agree with the last point but the first two seemed
contradictory to the documentation around SliceIndex<str>
which mention:

you can do substring slicing (which is probably different than
"indexing" but, because the method is called index and I associate
anything with square brackets with "indexing" it was enough to
confuse me)
substring slicing is constant-time (this may be algorithmic ignorance
on my part but if &s[i..i+1] is O(1) then it seems confusing that
&s[i] could not possibly be O(1))

So I was hoping to clarify a couple things and, hopefully, in this PR
review learn a little more about the nuances here that confused me in
the first place.

rust-highfive · 2022-03-10T02:55:24Z

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

library/alloc/src/string.rs

mlodato517 · 2022-03-10T03:08:33Z

Oh I now realize that the claim that "indexing into a string can't be O(1)" is probably saying that "it can't be constant time under the assumption that &s[i] is returning the ith character" because obviously returning the ith byte or panicking if it isn't an ascii byte (similar to what string subslicing is doing) can be done in O(1) time and, in the case of single bytes, is best served with as_bytes()[i] or something.

Mark-Simulacrum · 2022-03-13T15:33:57Z

Oh I now realize that the claim that "indexing into a string can't be O(1)" is probably saying that "it can't be constant time under the assumption that &s[i] is returning the ith character" because obviously returning the ith byte or panicking if it isn't an ascii byte (similar to what string subslicing is doing) can be done in O(1) time and, in the case of single bytes, is best served with as_bytes()[i] or something.

Yes, that seems like the right understanding.

I generally agree that the use of indexing in these docs may not be particularly clear - maybe it is worth trying to clean that up a little. This PRs changes seem fine to me though, so I'm happy to r+ as-is or if you want to make further revisions review those; I don't have a good sense of how we can help avoid the confusion over indexing here.

mlodato517 · 2022-03-13T20:54:31Z

Thanks for taking a look! Okay, with your confirmation of the non-constant-time indexing being under the assumption that the value is a character index, I think the summation of my misunderstanding is:

"cannot index into a `String`"

Is the language of calling &string[i..j] "indexing" incorrect? If so, and the proper terminology is "slicing" (or something else), I can just update the PR to say something along the lines of:

While you cannot index into a String, you can slice it to get a substring using the syntax &string[i..j] - see [blah blah] for more details

And maybe I can update the SliceIndex documentation at some point to mention that (and maybe The Rust Book section) to have a change of vocabulary.

If "indexing" is not incorrect then I think I'm okay leaving it as is.

"Indexing is intended to be a constant-time operation, but UTF-8 encoding does not allow us to do this."

I could update it to say something like:

Indexing is intended to be a constant-time operation, but UTF-8 encoding does not allow us to do this if the index variable is meant as a char index.

But I don't love that because my follow up question would be "if slicing doesn't take char indices, why would regular indexing?". So I almost want to just delete this line and have constant-time indexing not be documented as a reason for not having indexing. I think the reason that the return value is non-obvious is a great reason on its own!

The Book has this reasoning too with slightly more information but I still don't love it - substring slicing doesn't look for the ith character, it just checks if the ith byte starts a character and we could do the same thing here (I don't think we should! But, in theory s[i] could return a char if i started a char (that's obviously wasteful for ASCII and so I'm glad we don't do that)).

I know your time is super valuable so thank you again for reviewing and apologies for typing a huge wall of text. I just want to make sure I'm helping and not doing it just to get my name on a commit in my favorite repo ;-)

mlodato517 · 2022-03-13T20:58:35Z

I don't know what the proper protocol is after rejiggering the labels so I re-requested review - sorry if this was incorrect!

Mark-Simulacrum · 2022-03-18T00:02:29Z

Is the language of calling &string[i..j] "indexing" incorrect?

I don't think there is standard language or any rules around whether this is indexing or slicing. I would probably call it either myself, not sure I'd lean in one direction.

I think it may make sense, rather than trying to do little edits, to replace the UTF-8 section with an expanded form (and maybe keeping an example) of the following:

Strings can be sliced to get &str, the index unit for slicing the String is bytes. Note that since String is UTF-8 encoded, this is not the same as characters.
String slicing will panic if the byte index is not on a char boundary. See str::is_char_boundary for details on what a char boundary is.
Slicing must be done with a byte index range, since only a range can flexibly represent one or more characters. If you want access to individual bytes, use as_bytes().
Slicing a String is a constant-time operation.
Consider using chars() if you need to iterate codepoints, and char_indices() if you want access to the byte offsets of each codepoint.

Does that seem like it would help clarify things? It seems like this gets the intended message across and can hopefully do so in a way that avoids some of the confusing language in the current text. I'm not quite sure how to best cover the "cannot index with usize, need a range" bit -- my suggestion in the bullets above feels a little weak. Maybe it's OK though.

Mark-Simulacrum · 2022-03-18T00:05:23Z

I don't know what the proper protocol is after rejiggering the labels so I re-requested review - sorry if this was incorrect!

Just switching labels is great, no need to do more than that. But also no harm in doing more either :)

mlodato517 · 2022-03-19T01:39:36Z

it may make sense, rather than trying to do little edits, to replace the UTF-8 section with an expanded form

I think you might be right! I was originally trying to avoid that because it'd sort of be duplicated between The Rust Book, the String docs, and the SliceIndex docs and keeping them up-to-date probably wouldn't be possible but I think spelling everything out is probably the least confusing way to go about this - I'll try and draft something up this weekend! Depending on the final form we can see if we want to go back and edit the book to match.

mlodato517 · 2022-03-20T20:25:18Z

Huh, surprised python x.py test std --doc didn't catch that locally ... I guess I needed alloc instead of std?

library/alloc/src/string.rs

Mark-Simulacrum · 2022-03-30T11:48:01Z

library/alloc/src/string.rs

+/// of UTF-8 encoding we can't find the `i`th `char` in constant-time.
+/// So it should be a byte index. But then what should `s[i]` return?
+/// It could be a `char` but that could be wasteful if the character
+/// starting at byte index `i` is encoded in UTF-8 with fewer than four bytes.


s[i] can't actually return a char, since the Index trait requires &Self::Output. s[i] is *s.index(i) under the hood, but there's no way for String to return a &char in that method, since the char doesn't live anywhere in memory.

That's a really good point I hadn't considered at all! Okay so s[i] could return &char if String was internally Vec<char> instead of Vec<u8>. We have Vec<u8> because of the UTF-8 encoding (because Vec<char> would be UTF-32?). So I guess the logic here is:

we could've chosen any encoding and we chose UTF-8

because we chose UTF-8 we store data as Vec<u8>

because we store data as Vec<u8>, s.index(i) can only return a &u8

Is that correct? I can't tell if the 2nd option is logically required or if there are other ways of storing UTF-8 data

In general 1+2 are correct (there's a 0. which is that UTF-8 is generally a pretty good default due to interoperability with ASCII and common usage throughout e.g. web etc).

We could return anything that's in the vec - for example, &[u8] or &str - but those are weird to return from a particular index accessed.

Hopefully handled in a52ab6946e508fdfe4f8dc6313f9b7d9e0d77b58

Mark-Simulacrum · 2022-03-30T11:51:53Z

library/alloc/src/string.rs

+/// ```
+///
+/// This raises an interesting question as to how `s[i]` should work.
+/// What should `i` be here - a byte index or a `char` index? Because


I think it might be worth noting here that there are other options (e.g., grapheme indexes), but they are similarly not constant time -- really, the problem here is that because these are not constant time operations, you are unlikely to actually want just the nth grapheme or char -- you probably want to iterate, so we have chars() and there's graphemes() somewhere in the ecosystem.

Yeah I had intentionally not mentioned graphemes (and I didn't love them in the previous example) because nothing else in std:: had anything to do with graphemes (to my knowledge!). But I can mention it with the reasoning above!

you are unlikely to actually want just the nth grapheme or char -- you probably want to iterate

I'm also not 100% sold on this reasoning (That we didn't provide an index operation because you probably wanted to iterate anyway because ... what if the user doesn't want to iterate? Seems like a weird assumption to me 🤔) but I think

really, the problem here is that because these are not constant time operations

is incontrovertible. Is it okay if I just focus on that or do we think mentioning the other thing is important? I might just be missing the value! 😅

I think not mentioning graphemes is probably ok (it might make sense to have some notes on that topic in char docs but probably not mentioning here is good).

The point is that iterators "obviously" expose non-constant time access via nth(...) so you essentially already have that API, and since it's typically not what you want we avoid making an easy to use but actually wrong API by implementing index with &u8 return type.

Hopefully handled in a52ab6946e508fdfe4f8dc6313f9b7d9e0d77b58

Mark-Simulacrum · 2022-03-30T11:52:58Z

library/alloc/src/string.rs

+/// assert_eq!(s.as_bytes()[0], 240);
+/// ```
+///
+/// Lacking any obvious API, it is simply forbidden:
 ///
 /// ```compile_fail,E0277
 /// let s = "hello";
 ///
 /// println!("The first letter of s is {}", s[0]); // ERROR!!!


I would probably adjust this to more clearly indicate what the error is: maybe we can say something like "it is forbidden, and will not compile". This page may be one of the first a newcomer to Rust reads, and I think that extra clarity will be helpful.

4891bc9e6e42c0cfd6b0f39dab8a2374cb18fb25

Mark-Simulacrum · 2022-03-30T11:54:46Z

Apologies for the delay in reviewing :)

I am liking the new text quite a bit, I think it reads much better, though I did have some scattered comments.

Mark-Simulacrum · 2022-04-06T12:47:36Z

library/alloc/src/string.rs

-/// you cannot index into a `String`:
+/// `String`s are always valid UTF-8. If you need a non-UTF-8 string, consider
+/// [`OsString`]. It is similar, but without the UTF-8 constraint. Because UTF-8
+/// is a variable width encoding, `String`s can be smaller than an array of


Suggested change

/// is a variable width encoding, `String`s can be smaller than an array of

/// is a variable width encoding, `String`s are typically smaller than an array of

I think we want to push people towards String, and this seems pretty much always true -- for space-efficiency String is essentially always the better option.

I think that's a great idea - 90f3f23495601a20bb04d74000b0a8e87d17444a

Mark-Simulacrum · 2022-04-06T12:48:54Z

library/alloc/src/string.rs

 ///
 /// ```compile_fail,E0277
 /// let s = "hello";
 ///
-/// println!("The first letter of s is {}", s[0]); // ERROR!!!
+/// // The following is forbidden and will not compile!


Suggested change

/// // The following is forbidden and will not compile!

/// // The following will not compile!

504bc254359fa5b97e853eb906b6bb0520f893ea

Mark-Simulacrum · 2022-04-08T22:49:04Z

Thanks, the new text looks good to me. If you could squash the commits down then I'm happy to approve this!

**This Commit** Adds some clarity around indexing into Strings and the constraints driving various decisions there. **Why?** The [`String` documentation][0] mentions how `String`s can't be indexed but `Range` has an implementation for `SliceIndex<str>`. This can be confusing. There are also several statements to explain the lack of `String` indexing: - the inability to index into a `String` is an implication of UTF-8 encoding - indexing into a `String` could not be constant-time with UTF-8 encoding - indexing into a `String` does not have an obvious return type This last statement made sense but the first two seemed contradictory to the documentation around [`SliceIndex<str>`][1] which mention: - one can index into a `String` with a `Range` (also called substring slicing but it uses the same syntax and the method name is `index`) - `Range` indexing into a `String` is constant-time To resolve this seeming contradiction the documentation is reworked to more clearly explain what factors drive the decision to disallow indexing into a `String` with a single number. [0]: https://doc.rust-lang.org/stable/std/string/struct.String.html#utf-8 [1]: https://doc.rust-lang.org/stable/std/slice/trait.SliceIndex.html#impl-SliceIndex%3Cstr%3E

mlodato517 · 2022-04-09T13:28:03Z

Fixing a small typo in the commit message...

Mark-Simulacrum · 2022-04-09T14:16:00Z

@bors r+

bors · 2022-04-09T14:16:02Z

📌 Commit 9cf35a6 has been approved by Mark-Simulacrum

Rollup of 7 pull requests Successful merges: - rust-lang#94794 (Clarify indexing into Strings) - rust-lang#95361 (Make non-power-of-two alignments a validity error in `Layout`) - rust-lang#95369 (Fix `x test src/librustdoc` with `download-rustc` enabled ) - rust-lang#95805 (Left overs of rust-lang#95761) - rust-lang#95808 (expand: Remove `ParseSess::missing_fragment_specifiers`) - rust-lang#95817 (hide another #[allow] directive from a docs example) - rust-lang#95831 (Use bitwise XOR in to_ascii_uppercase) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

rust-highfive assigned Mark-Simulacrum Mar 10, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 10, 2022

mlodato517 commented Mar 10, 2022

View reviewed changes

library/alloc/src/string.rs Outdated Show resolved Hide resolved

mlodato517 commented Mar 10, 2022

View reviewed changes

library/alloc/src/string.rs Show resolved Hide resolved

mlodato517 commented Mar 10, 2022

View reviewed changes

library/alloc/src/string.rs Show resolved Hide resolved

mlodato517 marked this pull request as ready for review March 10, 2022 02:58

mlodato517 commented Mar 10, 2022

View reviewed changes

library/alloc/src/string.rs Outdated Show resolved Hide resolved

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 13, 2022

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 13, 2022

mlodato517 requested a review from Mark-Simulacrum March 13, 2022 20:57

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 18, 2022

This comment has been minimized.

Sign in to view

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 21, 2022

Mark-Simulacrum reviewed Mar 30, 2022

View reviewed changes

library/alloc/src/string.rs Show resolved Hide resolved

Mark-Simulacrum reviewed Mar 30, 2022

View reviewed changes

Mark-Simulacrum reviewed Apr 6, 2022

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 9, 2022

Dylan-DPC mentioned this pull request Apr 9, 2022

Rollup of 7 pull requests #95855

Merged

bors merged commit 1ced0b6 into rust-lang:master Apr 9, 2022

rustbot added this to the 1.62.0 milestone Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify indexing into Strings #94794

Clarify indexing into Strings #94794

mlodato517 commented Mar 10, 2022

rust-highfive commented Mar 10, 2022

mlodato517 commented Mar 10, 2022

Mark-Simulacrum commented Mar 13, 2022

mlodato517 commented Mar 13, 2022

mlodato517 commented Mar 13, 2022

Mark-Simulacrum commented Mar 18, 2022

Mark-Simulacrum commented Mar 18, 2022

mlodato517 commented Mar 19, 2022

This comment has been minimized.

mlodato517 commented Mar 20, 2022

Mark-Simulacrum Mar 30, 2022

mlodato517 Mar 30, 2022

Mark-Simulacrum Mar 30, 2022

mlodato517 Apr 6, 2022

Mark-Simulacrum Mar 30, 2022

mlodato517 Mar 30, 2022

mlodato517 Mar 30, 2022

Mark-Simulacrum Mar 30, 2022

mlodato517 Apr 6, 2022

Mark-Simulacrum Mar 30, 2022

mlodato517 Mar 30, 2022 •

edited

Loading

Mark-Simulacrum commented Mar 30, 2022

Mark-Simulacrum Apr 6, 2022

mlodato517 Apr 6, 2022

Mark-Simulacrum Apr 6, 2022

mlodato517 Apr 6, 2022

Mark-Simulacrum commented Apr 8, 2022

mlodato517 commented Apr 9, 2022

Mark-Simulacrum commented Apr 9, 2022

bors commented Apr 9, 2022

	/// is a variable width encoding, `String`s can be smaller than an array of
	/// is a variable width encoding, `String`s are typically smaller than an array of

	/// // The following is forbidden and will not compile!
	/// // The following will not compile!

Clarify indexing into Strings #94794

Clarify indexing into Strings #94794

Conversation

mlodato517 commented Mar 10, 2022

rust-highfive commented Mar 10, 2022

mlodato517 commented Mar 10, 2022

Mark-Simulacrum commented Mar 13, 2022

mlodato517 commented Mar 13, 2022

mlodato517 commented Mar 13, 2022

Mark-Simulacrum commented Mar 18, 2022

Mark-Simulacrum commented Mar 18, 2022

mlodato517 commented Mar 19, 2022

This comment has been minimized.

mlodato517 commented Mar 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlodato517 Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

Mark-Simulacrum commented Mar 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mark-Simulacrum commented Apr 8, 2022

mlodato517 commented Apr 9, 2022

Mark-Simulacrum commented Apr 9, 2022

bors commented Apr 9, 2022

mlodato517 Mar 30, 2022 •

edited

Loading