Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify indexing into Strings #94794

Merged
merged 1 commit into from
Apr 9, 2022
Merged

Clarify indexing into Strings #94794

merged 1 commit into from
Apr 9, 2022

Conversation

mlodato517
Copy link
Contributor

This Commit
Adds some clarity around indexing into Strings.

Why?
I was reading through the Range documentation and saw an
implementation for SliceIndex<str>. I was surprised to see this and
went to read the String documentation and, to me, it seemed to
say (at least) three things:

  1. you cannot index into a String
  2. indexing into a String could not be constant-time
  3. indexing into a String does not have an obvious return type

I absolutely agree with the last point but the first two seemed
contradictory to the documentation around SliceIndex<str>
which mention:

  1. you can do substring slicing (which is probably different than
    "indexing" but, because the method is called index and I associate
    anything with square brackets with "indexing" it was enough to
    confuse me)
  2. substring slicing is constant-time (this may be algorithmic ignorance
    on my part but if &s[i..i+1] is O(1) then it seems confusing that
    &s[i] could not possibly be O(1))

So I was hoping to clarify a couple things and, hopefully, in this PR
review learn a little more about the nuances here that confused me in
the first place.

@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 10, 2022
@mlodato517 mlodato517 marked this pull request as ready for review March 10, 2022 02:58
@mlodato517
Copy link
Contributor Author

Oh I now realize that the claim that "indexing into a string can't be O(1)" is probably saying that "it can't be constant time under the assumption that &s[i] is returning the ith character" because obviously returning the ith byte or panicking if it isn't an ascii byte (similar to what string subslicing is doing) can be done in O(1) time and, in the case of single bytes, is best served with as_bytes()[i] or something.

@Mark-Simulacrum
Copy link
Member

Oh I now realize that the claim that "indexing into a string can't be O(1)" is probably saying that "it can't be constant time under the assumption that &s[i] is returning the ith character" because obviously returning the ith byte or panicking if it isn't an ascii byte (similar to what string subslicing is doing) can be done in O(1) time and, in the case of single bytes, is best served with as_bytes()[i] or something.

Yes, that seems like the right understanding.

I generally agree that the use of indexing in these docs may not be particularly clear - maybe it is worth trying to clean that up a little. This PRs changes seem fine to me though, so I'm happy to r+ as-is or if you want to make further revisions review those; I don't have a good sense of how we can help avoid the confusion over indexing here.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 13, 2022
@mlodato517
Copy link
Contributor Author

Thanks for taking a look! Okay, with your confirmation of the non-constant-time indexing being under the assumption that the value is a character index, I think the summation of my misunderstanding is:

"cannot index into a `String`"

Is the language of calling &string[i..j] "indexing" incorrect? If so, and the proper terminology is "slicing" (or something else), I can just update the PR to say something along the lines of:

While you cannot index into a String, you can slice it to get a substring using the syntax &string[i..j] - see [blah blah] for more details

And maybe I can update the SliceIndex documentation at some point to mention that (and maybe The Rust Book section) to have a change of vocabulary.

If "indexing" is not incorrect then I think I'm okay leaving it as is.

"Indexing is intended to be a constant-time operation, but UTF-8 encoding does not allow us to do this."

I could update it to say something like:

Indexing is intended to be a constant-time operation, but UTF-8 encoding does not allow us to do this if the index variable is meant as a char index.

But I don't love that because my follow up question would be "if slicing doesn't take char indices, why would regular indexing?". So I almost want to just delete this line and have constant-time indexing not be documented as a reason for not having indexing. I think the reason that the return value is non-obvious is a great reason on its own!

The Book has this reasoning too with slightly more information but I still don't love it - substring slicing doesn't look for the ith character, it just checks if the ith byte starts a character and we could do the same thing here (I don't think we should! But, in theory s[i] could return a char if i started a char (that's obviously wasteful for ASCII and so I'm glad we don't do that)).

I know your time is super valuable so thank you again for reviewing and apologies for typing a huge wall of text. I just want to make sure I'm helping and not doing it just to get my name on a commit in my favorite repo ;-)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 13, 2022
@mlodato517
Copy link
Contributor Author

I don't know what the proper protocol is after rejiggering the labels so I re-requested review - sorry if this was incorrect!

@Mark-Simulacrum
Copy link
Member

Is the language of calling &string[i..j] "indexing" incorrect?

I don't think there is standard language or any rules around whether this is indexing or slicing. I would probably call it either myself, not sure I'd lean in one direction.

I think it may make sense, rather than trying to do little edits, to replace the UTF-8 section with an expanded form (and maybe keeping an example) of the following:

  • Strings can be sliced to get &str, the index unit for slicing the String is bytes. Note that since String is UTF-8 encoded, this is not the same as characters.
  • String slicing will panic if the byte index is not on a char boundary. See str::is_char_boundary for details on what a char boundary is.
  • Slicing must be done with a byte index range, since only a range can flexibly represent one or more characters. If you want access to individual bytes, use as_bytes().
  • Slicing a String is a constant-time operation.
  • Consider using chars() if you need to iterate codepoints, and char_indices() if you want access to the byte offsets of each codepoint.

Does that seem like it would help clarify things? It seems like this gets the intended message across and can hopefully do so in a way that avoids some of the confusing language in the current text. I'm not quite sure how to best cover the "cannot index with usize, need a range" bit -- my suggestion in the bullets above feels a little weak. Maybe it's OK though.

@Mark-Simulacrum
Copy link
Member

I don't know what the proper protocol is after rejiggering the labels so I re-requested review - sorry if this was incorrect!

Just switching labels is great, no need to do more than that. But also no harm in doing more either :)

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 18, 2022
@mlodato517
Copy link
Contributor Author

it may make sense, rather than trying to do little edits, to replace the UTF-8 section with an expanded form

I think you might be right! I was originally trying to avoid that because it'd sort of be duplicated between The Rust Book, the String docs, and the SliceIndex docs and keeping them up-to-date probably wouldn't be possible but I think spelling everything out is probably the least confusing way to go about this - I'll try and draft something up this weekend! Depending on the final form we can see if we want to go back and edit the book to match.

@rust-log-analyzer

This comment has been minimized.

@mlodato517
Copy link
Contributor Author

Huh, surprised python x.py test std --doc didn't catch that locally ... I guess I needed alloc instead of std?

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 21, 2022
/// of UTF-8 encoding we can't find the `i`th `char` in constant-time.
/// So it should be a byte index. But then what should `s[i]` return?
/// It could be a `char` but that could be wasteful if the character
/// starting at byte index `i` is encoded in UTF-8 with fewer than four bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s[i] can't actually return a char, since the Index trait requires &Self::Output. s[i] is *s.index(i) under the hood, but there's no way for String to return a &char in that method, since the char doesn't live anywhere in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good point I hadn't considered at all! Okay so s[i] could return &char if String was internally Vec<char> instead of Vec<u8>. We have Vec<u8> because of the UTF-8 encoding (because Vec<char> would be UTF-32?). So I guess the logic here is:

  1. we could've chosen any encoding and we chose UTF-8
  2. because we chose UTF-8 we store data as Vec<u8>
  3. because we store data as Vec<u8>, s.index(i) can only return a &u8

Is that correct? I can't tell if the 2nd option is logically required or if there are other ways of storing UTF-8 data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general 1+2 are correct (there's a 0. which is that UTF-8 is generally a pretty good default due to interoperability with ASCII and common usage throughout e.g. web etc).

We could return anything that's in the vec - for example, &[u8] or &str - but those are weird to return from a particular index accessed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully handled in a52ab6946e508fdfe4f8dc6313f9b7d9e0d77b58

/// ```
///
/// This raises an interesting question as to how `s[i]` should work.
/// What should `i` be here - a byte index or a `char` index? Because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth noting here that there are other options (e.g., grapheme indexes), but they are similarly not constant time -- really, the problem here is that because these are not constant time operations, you are unlikely to actually want just the nth grapheme or char -- you probably want to iterate, so we have chars() and there's graphemes() somewhere in the ecosystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had intentionally not mentioned graphemes (and I didn't love them in the previous example) because nothing else in std:: had anything to do with graphemes (to my knowledge!). But I can mention it with the reasoning above!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are unlikely to actually want just the nth grapheme or char -- you probably want to iterate

I'm also not 100% sold on this reasoning (That we didn't provide an index operation because you probably wanted to iterate anyway because ... what if the user doesn't want to iterate? Seems like a weird assumption to me 🤔) but I think

really, the problem here is that because these are not constant time operations

is incontrovertible. Is it okay if I just focus on that or do we think mentioning the other thing is important? I might just be missing the value! 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not mentioning graphemes is probably ok (it might make sense to have some notes on that topic in char docs but probably not mentioning here is good).

The point is that iterators "obviously" expose non-constant time access via nth(...) so you essentially already have that API, and since it's typically not what you want we avoid making an easy to use but actually wrong API by implementing index with &u8 return type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully handled in a52ab6946e508fdfe4f8dc6313f9b7d9e0d77b58

/// assert_eq!(s.as_bytes()[0], 240);
/// ```
///
/// Lacking any obvious API, it is simply forbidden:
///
/// ```compile_fail,E0277
/// let s = "hello";
///
/// println!("The first letter of s is {}", s[0]); // ERROR!!!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably adjust this to more clearly indicate what the error is: maybe we can say something like "it is forbidden, and will not compile". This page may be one of the first a newcomer to Rust reads, and I think that extra clarity will be helpful.

Copy link
Contributor Author

@mlodato517 mlodato517 Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4891bc9e6e42c0cfd6b0f39dab8a2374cb18fb25

@Mark-Simulacrum
Copy link
Member

Apologies for the delay in reviewing :)

I am liking the new text quite a bit, I think it reads much better, though I did have some scattered comments.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 30, 2022
/// you cannot index into a `String`:
/// `String`s are always valid UTF-8. If you need a non-UTF-8 string, consider
/// [`OsString`]. It is similar, but without the UTF-8 constraint. Because UTF-8
/// is a variable width encoding, `String`s can be smaller than an array of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// is a variable width encoding, `String`s can be smaller than an array of
/// is a variable width encoding, `String`s are typically smaller than an array of

I think we want to push people towards String, and this seems pretty much always true -- for space-efficiency String is essentially always the better option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a great idea - 90f3f23495601a20bb04d74000b0a8e87d17444a

///
/// ```compile_fail,E0277
/// let s = "hello";
///
/// println!("The first letter of s is {}", s[0]); // ERROR!!!
/// // The following is forbidden and will not compile!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // The following is forbidden and will not compile!
/// // The following will not compile!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

504bc254359fa5b97e853eb906b6bb0520f893ea

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 6, 2022
@Mark-Simulacrum
Copy link
Member

Thanks, the new text looks good to me. If you could squash the commits down then I'm happy to approve this!

**This Commit**
Adds some clarity around indexing into Strings and the constraints
driving various decisions there.

**Why?**
The [`String` documentation][0] mentions how `String`s can't be indexed
but `Range` has an implementation for `SliceIndex<str>`. This can be
confusing. There are also several statements to explain the lack of
`String` indexing:

- the inability to index into a `String` is an implication of UTF-8
  encoding
- indexing into a `String` could not be constant-time with UTF-8
  encoding
- indexing into a `String` does not have an obvious return type

This last statement made sense but the first two seemed contradictory to
the documentation around [`SliceIndex<str>`][1] which mention:

- one can index into a `String` with a `Range` (also called substring
  slicing but it uses the same syntax and the method name is `index`)
- `Range` indexing into a `String` is constant-time

To resolve this seeming contradiction the documentation is reworked to
more clearly explain what factors drive the decision to disallow
indexing into a `String` with a single number.

[0]: https://doc.rust-lang.org/stable/std/string/struct.String.html#utf-8
[1]: https://doc.rust-lang.org/stable/std/slice/trait.SliceIndex.html#impl-SliceIndex%3Cstr%3E
@mlodato517
Copy link
Contributor Author

Fixing a small typo in the commit message...

@Mark-Simulacrum
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Apr 9, 2022

📌 Commit 9cf35a6 has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 9, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 9, 2022
Rollup of 7 pull requests

Successful merges:

 - rust-lang#94794 (Clarify indexing into Strings)
 - rust-lang#95361 (Make non-power-of-two alignments a validity error in `Layout`)
 - rust-lang#95369 (Fix `x test src/librustdoc` with `download-rustc` enabled )
 - rust-lang#95805 (Left overs of rust-lang#95761)
 - rust-lang#95808 (expand: Remove `ParseSess::missing_fragment_specifiers`)
 - rust-lang#95817 (hide another #[allow] directive from a docs example)
 - rust-lang#95831 (Use bitwise XOR in to_ascii_uppercase)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 1ced0b6 into rust-lang:master Apr 9, 2022
@rustbot rustbot added this to the 1.62.0 milestone Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants