-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify indexing into Strings #94794
Clarify indexing into Strings #94794
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
Oh I now realize that the claim that "indexing into a string can't be O(1)" is probably saying that "it can't be constant time under the assumption that |
Yes, that seems like the right understanding. I generally agree that the use of indexing in these docs may not be particularly clear - maybe it is worth trying to clean that up a little. This PRs changes seem fine to me though, so I'm happy to r+ as-is or if you want to make further revisions review those; I don't have a good sense of how we can help avoid the confusion over indexing here. |
Thanks for taking a look! Okay, with your confirmation of the non-constant-time indexing being under the assumption that the value is a character index, I think the summation of my misunderstanding is: "cannot index into a `String`"Is the language of calling
And maybe I can update the If "indexing" is not incorrect then I think I'm okay leaving it as is. "Indexing is intended to be a constant-time operation, but UTF-8 encoding does not allow us to do this."I could update it to say something like:
But I don't love that because my follow up question would be "if slicing doesn't take char indices, why would regular indexing?". So I almost want to just delete this line and have constant-time indexing not be documented as a reason for not having indexing. I think the reason that the return value is non-obvious is a great reason on its own! The Book has this reasoning too with slightly more information but I still don't love it - substring slicing doesn't look for the I know your time is super valuable so thank you again for reviewing and apologies for typing a huge wall of text. I just want to make sure I'm helping and not doing it just to get my name on a commit in my favorite repo ;-) |
I don't know what the proper protocol is after rejiggering the labels so I re-requested review - sorry if this was incorrect! |
I don't think there is standard language or any rules around whether this is indexing or slicing. I would probably call it either myself, not sure I'd lean in one direction. I think it may make sense, rather than trying to do little edits, to replace the UTF-8 section with an expanded form (and maybe keeping an example) of the following:
Does that seem like it would help clarify things? It seems like this gets the intended message across and can hopefully do so in a way that avoids some of the confusing language in the current text. I'm not quite sure how to best cover the "cannot index with usize, need a range" bit -- my suggestion in the bullets above feels a little weak. Maybe it's OK though. |
Just switching labels is great, no need to do more than that. But also no harm in doing more either :) |
I think you might be right! I was originally trying to avoid that because it'd sort of be duplicated between The Rust Book, the |
This comment has been minimized.
This comment has been minimized.
Huh, surprised |
library/alloc/src/string.rs
Outdated
/// of UTF-8 encoding we can't find the `i`th `char` in constant-time. | ||
/// So it should be a byte index. But then what should `s[i]` return? | ||
/// It could be a `char` but that could be wasteful if the character | ||
/// starting at byte index `i` is encoded in UTF-8 with fewer than four bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s[i]
can't actually return a char, since the Index trait requires &Self::Output
. s[i]
is *s.index(i)
under the hood, but there's no way for String to return a &char
in that method, since the char doesn't live anywhere in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good point I hadn't considered at all! Okay so s[i]
could return &char
if String
was internally Vec<char>
instead of Vec<u8>
. We have Vec<u8>
because of the UTF-8 encoding (because Vec<char>
would be UTF-32?). So I guess the logic here is:
- we could've chosen any encoding and we chose UTF-8
- because we chose UTF-8 we store data as
Vec<u8>
- because we store data as
Vec<u8>
,s.index(i)
can only return a&u8
Is that correct? I can't tell if the 2nd option is logically required or if there are other ways of storing UTF-8 data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general 1+2 are correct (there's a 0. which is that UTF-8 is generally a pretty good default due to interoperability with ASCII and common usage throughout e.g. web etc).
We could return anything that's in the vec - for example, &[u8] or &str - but those are weird to return from a particular index accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully handled in a52ab6946e508fdfe4f8dc6313f9b7d9e0d77b58
library/alloc/src/string.rs
Outdated
/// ``` | ||
/// | ||
/// This raises an interesting question as to how `s[i]` should work. | ||
/// What should `i` be here - a byte index or a `char` index? Because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worth noting here that there are other options (e.g., grapheme indexes), but they are similarly not constant time -- really, the problem here is that because these are not constant time operations, you are unlikely to actually want just the nth grapheme or char -- you probably want to iterate, so we have chars() and there's graphemes() somewhere in the ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I had intentionally not mentioned graphemes (and I didn't love them in the previous example) because nothing else in std::
had anything to do with graphemes (to my knowledge!). But I can mention it with the reasoning above!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are unlikely to actually want just the nth grapheme or char -- you probably want to iterate
I'm also not 100% sold on this reasoning (That we didn't provide an index operation because you probably wanted to iterate anyway because ... what if the user doesn't want to iterate? Seems like a weird assumption to me 🤔) but I think
really, the problem here is that because these are not constant time operations
is incontrovertible. Is it okay if I just focus on that or do we think mentioning the other thing is important? I might just be missing the value! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not mentioning graphemes is probably ok (it might make sense to have some notes on that topic in char docs but probably not mentioning here is good).
The point is that iterators "obviously" expose non-constant time access via nth(...) so you essentially already have that API, and since it's typically not what you want we avoid making an easy to use but actually wrong API by implementing index with &u8 return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully handled in a52ab6946e508fdfe4f8dc6313f9b7d9e0d77b58
library/alloc/src/string.rs
Outdated
/// assert_eq!(s.as_bytes()[0], 240); | ||
/// ``` | ||
/// | ||
/// Lacking any obvious API, it is simply forbidden: | ||
/// | ||
/// ```compile_fail,E0277 | ||
/// let s = "hello"; | ||
/// | ||
/// println!("The first letter of s is {}", s[0]); // ERROR!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably adjust this to more clearly indicate what the error is: maybe we can say something like "it is forbidden, and will not compile". This page may be one of the first a newcomer to Rust reads, and I think that extra clarity will be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4891bc9e6e42c0cfd6b0f39dab8a2374cb18fb25
Apologies for the delay in reviewing :) I am liking the new text quite a bit, I think it reads much better, though I did have some scattered comments. |
library/alloc/src/string.rs
Outdated
/// you cannot index into a `String`: | ||
/// `String`s are always valid UTF-8. If you need a non-UTF-8 string, consider | ||
/// [`OsString`]. It is similar, but without the UTF-8 constraint. Because UTF-8 | ||
/// is a variable width encoding, `String`s can be smaller than an array of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// is a variable width encoding, `String`s can be smaller than an array of | |
/// is a variable width encoding, `String`s are typically smaller than an array of |
I think we want to push people towards String, and this seems pretty much always true -- for space-efficiency String is essentially always the better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a great idea - 90f3f23495601a20bb04d74000b0a8e87d17444a
library/alloc/src/string.rs
Outdated
/// | ||
/// ```compile_fail,E0277 | ||
/// let s = "hello"; | ||
/// | ||
/// println!("The first letter of s is {}", s[0]); // ERROR!!! | ||
/// // The following is forbidden and will not compile! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// // The following is forbidden and will not compile! | |
/// // The following will not compile! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
504bc254359fa5b97e853eb906b6bb0520f893ea
Thanks, the new text looks good to me. If you could squash the commits down then I'm happy to approve this! |
**This Commit** Adds some clarity around indexing into Strings and the constraints driving various decisions there. **Why?** The [`String` documentation][0] mentions how `String`s can't be indexed but `Range` has an implementation for `SliceIndex<str>`. This can be confusing. There are also several statements to explain the lack of `String` indexing: - the inability to index into a `String` is an implication of UTF-8 encoding - indexing into a `String` could not be constant-time with UTF-8 encoding - indexing into a `String` does not have an obvious return type This last statement made sense but the first two seemed contradictory to the documentation around [`SliceIndex<str>`][1] which mention: - one can index into a `String` with a `Range` (also called substring slicing but it uses the same syntax and the method name is `index`) - `Range` indexing into a `String` is constant-time To resolve this seeming contradiction the documentation is reworked to more clearly explain what factors drive the decision to disallow indexing into a `String` with a single number. [0]: https://doc.rust-lang.org/stable/std/string/struct.String.html#utf-8 [1]: https://doc.rust-lang.org/stable/std/slice/trait.SliceIndex.html#impl-SliceIndex%3Cstr%3E
Fixing a small typo in the commit message... |
@bors r+ |
📌 Commit 9cf35a6 has been approved by |
Rollup of 7 pull requests Successful merges: - rust-lang#94794 (Clarify indexing into Strings) - rust-lang#95361 (Make non-power-of-two alignments a validity error in `Layout`) - rust-lang#95369 (Fix `x test src/librustdoc` with `download-rustc` enabled ) - rust-lang#95805 (Left overs of rust-lang#95761) - rust-lang#95808 (expand: Remove `ParseSess::missing_fragment_specifiers`) - rust-lang#95817 (hide another #[allow] directive from a docs example) - rust-lang#95831 (Use bitwise XOR in to_ascii_uppercase) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup
This Commit
Adds some clarity around indexing into Strings.
Why?
I was reading through the
Range
documentation and saw animplementation for
SliceIndex<str>
. I was surprised to see this andwent to read the
String
documentation and, to me, it seemed tosay (at least) three things:
String
String
could not be constant-timeString
does not have an obvious return typeI absolutely agree with the last point but the first two seemed
contradictory to the documentation around
SliceIndex<str>
which mention:
"indexing" but, because the method is called
index
and I associateanything with square brackets with "indexing" it was enough to
confuse me)
on my part but if
&s[i..i+1]
is O(1) then it seems confusing that&s[i]
could not possibly be O(1))So I was hoping to clarify a couple things and, hopefully, in this PR
review learn a little more about the nuances here that confused me in
the first place.