Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Extend Pattern API to OsStr #2295

Merged
merged 7 commits into from
Apr 9, 2018
Merged

Conversation

kennytm
Copy link
Member

@kennytm kennytm commented Jan 16, 2018

Supersedes the "Pattern API" part of RFC #1309.

cc #900.
cc @Kimundi (#528 "Pattern API 1.0")
cc @SimonSapin (WTF-8)

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jan 16, 2018
@tbu-
Copy link
Contributor

tbu- commented Jan 25, 2018

Probably completely out of place but – &[u8] is also missing the pattern API or useful searching in general.

@aturon
Copy link
Member

aturon commented Feb 7, 2018

Thank you thank you thank you!

@SimonSapin @Kimundi can one or both of you please provide a review of this RFC?

@aturon
Copy link
Member

aturon commented Feb 14, 2018

cc @rust-lang/libs this has been a very longstanding issue. Can we consider landing this API as unstable?

Copy link
Member

@Kimundi Kimundi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good (if somewhat hacky) way to solve the surrogate point issue in doing pattern searches in OsStr. 👍

Short summary: This RFC essentially proposes two things:

  • The ability to slice an OsStr.
  • An modification of the semantics of OsStr-slices under Windows, that allows them to represent unpaired surrogate codepoints by pointing at only a parts of the utf8 encoding of a regular codepoint.

Both are prerequisites to extending the Pattern API, or other searching algorithms, to allow searching patterns inside an OsStr without introducing restrictions on the data the API could handle.

This simplifies the work needed for an follow-up RFC to actually extend the Pattern API to other types like OsStr or [T].


However, it is not clear to me why the proposed semantic for slicing an OsStr is as it is. Assuming we have the codepoint "\u{10000}" and want to split it into the two surrogates "\u{d800}" and "\u{dc00}". Then it seems there are two ways to expose the slicing operations:

  • Hide the proposed scheme behind virtual byte positions (What this RFC proposes).

    OsStr("\u{10000}")      = [u8]("\u{10000}")      = f0 90 80 80
    OsStr("\u{10000}")[..2] = [u8]("\u{10000}")[..3] = f0 90 80
    OsStr("\u{10000}")[2..] = [u8]("\u{10000}")[1..] =    90 80 80
    
  • Expose the proposed scheme, and always slice at real byte positions.

    OsStr("\u{10000}")      = [u8]("\u{10000}")      = f0 90 80 80
    OsStr("\u{10000}")[..3] = [u8]("\u{10000}")[..3] = f0 90 80
    OsStr("\u{10000}")[1..] = [u8]("\u{10000}")[1..] =    90 80 80
    

It seems the main advantage of the first scheme is that there is exactly one slice position that indicates the middle of a surrogate pair. But it is not clear to me if this property is actually needed, since the low level implementations of OsStr searching will deal with the actual byte positions anyway.

I guess it amounts to the question of whether you can take the index returned by, eg, OsStr::find() and use it both as an position to slice-to, and a position to slice-from. The first scheme would allow both, the latter only one of them.


Besides from that, it is also not clear which slice positions should be legal in general across windows and unix. Again I can image two schemes:

  • Allow slicing at each position that would be a legal index in the native format of an OsStr. This means only valid codepoint and half-codepoint positions on windows, and arbitrary byte positions on unix.
  • Allow slicing only at valid unicode boundaries. This means only valid codepoint and half-codepoint positions on windows, and utf8 boundaries on unix.
  • Don't define which positions are legal in general, just that indices returned by the std lib are valid.

Lastly, I think we should also be sure that changing the equality semantic of OsStrs under windows does not impact performance too much, or break existing code, since its afaik the only aspect of the proposal that changes an existing stable API.

@kennytm
Copy link
Member Author

kennytm commented Feb 23, 2018

@Kimundi Thanks for the review 😊

Yes OsStr::find() is the main reason I chose the "middle of the 4-byte sequence" scheme.

let s = OsStr::new("\u{10000}");
assert_eq!(s.len(), 4);
let index = s.find('\u{dc00}').unwrap();
assert_eq!(index, 1);     // if we "Expose the proposed scheme"
let right = &s[index..];  // this will be [90 80 80], fine
let left = &s[..index];   // but this will be [f0] ??????

This also relates to the next question "which slice positions should be legal". We could

  1. make "\u{10000}"[..1] panic, or
  2. make "\u{10000}"[..1] same as "\u{10000}"[..3] and return f0 90 80

The second choice is no better than what this RFC proposed. There would also be the question in the meaning of s[1..3] and s[3..1].

The first choice means if you want the left, you'll need to awkwardly slice with &s[..(index+2)].

My preference about legal slice positions would be:

  1. 0 is always a valid position.
  2. self.len() is always a valid position.
  3. find(), rfind(), find_range(), rfind_range(), match_indices(), rmatch_indices(), match_ranges(), rmatch_ranges() always yield valid positions.

Because we support self.find(OsStr::from_bytes(b"\xff")) on Unix, a legal index cannot be restricted to UTF-8 boundaries.

@Kimundi
Copy link
Member

Kimundi commented Feb 24, 2018

Makes sense! So, would you agree that we could define the slicing semantic as:

  • Under unix, the native OsStr is just a bunch of bytes. Slicing and searching is allowed at any index.
  • Under windows, the native OsStr is (potentially malformed) utf16. Slicing in its wtf8 representation is only allowed at codepoint boundaries, and in the (exact) middle of codepoints that would have been represented as a surrogate pair in utf16. This way there is a 1:1 mapping to the native u16 array.

We can then either expose this definition as a stable guarantee for users, or just use it as a internal one with the guarantee that you can slice at all indices that the Pattern API returns.

Bad slicing positions should then just panic, just as with str.

@kennytm
Copy link
Member Author

kennytm commented Feb 24, 2018

@Kimundi Agreed. Since OsStr did not specify that the encoding must be WTF-8, I think it should better be an internal guarantee.

@Kimundi
Copy link
Member

Kimundi commented Feb 24, 2018

Great! Would you mind adding that as clarification to the RFC text?

@kennytm
Copy link
Member Author

kennytm commented Feb 25, 2018

Done! Added 8b1171c.

Copy link
Member

@Kimundi Kimundi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, the RFC looks good now in my opinion.

I would add two further advantages of the "use real index" alternative:

  • Its less confusing for the user, since any index you get out of an API is identical to an underlying byte position. The overlap of adjacent slices is also not hidden from you.
  • It simplifies the API somewhat, since a few places no longer need to to transform the index to/from the real byte offsets.

But those are minor nits.

@Kimundi
Copy link
Member

Kimundi commented Feb 27, 2018

And since there are no other comments, and we really should start making some progress on this, I'll be moving this to FCP asap. 😄

@SimonSapin: I don't think what is proposed here is too objectionable in regard to WTF-8, since the RFC doesn't expose its encoding directly.

@rfcbot fcp merge

@alexcrichton
Copy link
Member

hm...

@rfcbot fcp merge

@rfcbot
Copy link
Collaborator

rfcbot commented Feb 27, 2018

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label Feb 27, 2018
Copy link
Contributor

@SimonSapin SimonSapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong +1 on the changes proposed. (And some remarks on the RFC itself.)

The high-level goal of WTF-8 is to provide an alternative representation (“encoding”) for [u16]. Ideally, everything that is possible to do with [u16] should be possible with OsStr-on-Windows. I think the changes proposed here are probably the best way to further that goal, given the design constraints.

I was mildly concerned by the escalation of complexity in the encoding, but this is mitigated by being an internal implementation detail whose details end users (hopefully) don’t need to care about. Maybe we would have been better off deciding pre-1.0 to make OsStr be [u16] on Windows instead, but now we’re stuck with OsStr being a superset of &str / UTF-8 on all platforms anyway.

The term “WTF-8” has been around with a precise specification for a while, so rather than modifying WTF-8 it makes sense to define a new encoding (with a new name) that is a superset. Kudos to @kennytm for coming up with OMG-WTF-8’s design and a well-written and well-illustrated specification.

/// let path = OsStr::new("/usr/bin/bash");
/// let range = path.rfind_range("/b");
/// assert_eq!(range, Some(8..10));
/// assert_eq!(path[range.unwrap()], OsStr::new("/bin"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the similar example in find_range look wrong. Shouldn’t slicing result in OsStr::new("/b")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

It is trivial to apply the pattern API to `OsStr` on platforms where it is just an `[u8]`. The main
difficulty is on Windows where it is an `[u16]` encoded as WTF-8. This RFC thus focuses on Windows.

We will generalize the encoding of `OsStr` to specify these two capabilities:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://rust-lang.github.io/rfcs/0517-io-os-reform.html#string-handling does mention WTF-8 by name, so this RFC should probably mention the OMG-WTF-8 encoding by name (as a definition of the encoding, not just a sample implementation) even if the encoding remains an implementation detail.

The implementation PR should also make sure to update internal documentation to say that the representation of OsStr on Windows becomes OMG-WTF-8, with a link to the encoding’s full specification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned.


## Pattern API

This RFC assumes a generalized pattern API which supports more than strings. If the pattern API is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite understand what it means for an RFC to assume something. I’d prefer this RFC to propose one concrete API for Pattern and related traits (and maybe list alternatives in the Alternatives section).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded this section and linked to the previous drafts. However, I prefer not to formally propose the pattern API here in this RFC as that is kinda out-of-scope.

@BurntSushi
Copy link
Member

Overall, this RFC looks great, but I think there might be an important question here that isn't covered. In particular, how are folks outside of std expected to implement the Pattern trait for &OsStr? As a concrete example, consider the regex crate. This isn't necessarily theoretical either, since glob matching often wants to happen on a &OsStr, for example. In particular, it looks like std gets the benefit of knowing the internal representation of an &OsStr, but outside crates do not.

What I do today (on Windows of course, the Unix case is easy) is lossily convert the &OsStr into a &str, which normally saves an allocation but does require the UTF-8 validation check. If we assume that's an acceptable cost, then I think my question is answered. But is it an acceptable cost? I've somewhat reasoned "yes" since we've already accepted the cost of transcoding UTF-8 to WTF-8.

@kennytm
Copy link
Member Author

kennytm commented Mar 13, 2018

@BurntSushi

Since we do not want to expose the WTF-8 buffer, this is probably not easy to impl Pattern<&OsStr> for OutsideStdType, although you may be able to cheat by considering the range Haystack::cursor_at_front(os_str) to Haystack::cursor_at_end(os_str) as a [u8]...

regex::bytes::Regex can easily implement Unix's Pattern<&OsStr>. However, for Windows it may not make sense.

regex::Regex could implement Pattern<&OsStr> portably, but it needs to properly handle the cases when non-UTF-8 bytes are encountered.

@BurntSushi
Copy link
Member

Since we do not want to expose the WTF-8 buffer, this is probably not easy to impl Pattern<&OsStr> for OutsideStdType, although you may be able to cheat by considering the range Haystack::cursor_at_front(os_str) to Haystack::cursor_at_end(os_str) as a [u8]...

Right, but that still requires me to assume WTF-8 in the first place.

To be clear, I do think that if this use case becomes important for performance, then we should recognize the people can and will depend on internal implementation details in the name of performance. Does that change how we approach this? If enough people do it, then what was an internal implementation detail will become de facto stabilized.

regex::bytes::Regex can easily implement Unix's Pattern<&OsStr>. However, for Windows it may not make sense.

Hmm could you unpack this? (I understand the Unix side of things. For the purposes of this discussion, Unix is a non-issue.)

regex::Regex could implement Pattern<&OsStr> portably, but it needs to properly handle the cases when non-UTF-8 bytes are encountered.

How does this work if I can't assume WTF-8?

@kennytm
Copy link
Member Author

kennytm commented Mar 14, 2018

To be clear, I do think that if this use case becomes important for performance, then we should recognize the people can and will depend on internal implementation details in the name of performance. Does that change how we approach this? If enough people do it, then what was an internal implementation detail will become de facto stabilized.

I see what you mean now. I think the situation now is no much different from https://internals.rust-lang.org/t/make-std-os-unix-ffi-osstrext-cross-platform/6277/5. Although this RFC focuses on WTF-8, the new methods do not assume (OMG-)WTF-8, even if it is very hard to find other encoding that satisfies all the requirements:

  1. OsStr::new() will not allocate. This rules out all encodings that is not a superset of UTF-8, e.g. CESU-8 cannot be used.
  2. OsStr::to_str() will not allocate if the input is valid UTF-8. This rules out all encodings that will not always store a valid Unicode string as UTF-8, e.g. a union of UTF-8 and UCS-2.

That does not mean an alternative encoding is impossible though. For instance we could slightly tweak WTF-8 to store the lone surrogate codepoints as ff d8 00 to ff df ff and everything should work just fine.

Therefore, unless the libs team thinks otherwise, I suppose we will maintain the status quo i.e. there is no safe way to access the underlying buffer. There's no way we can prevent the de facto WTF-8 assumption, but at least we could get away by claiming "THIS IS UNSAFE!!!1"?

This also means there won't be a stable non-std impl Pattern<&OsStr>.

regex::bytes::Regex can easily implement Unix's Pattern<&OsStr>. However, for Windows it may not make sense.

Hmm could you unpack this? (I understand the Unix side of things. For the purposes of this discussion, Unix is a non-issue.)

On Windows, an OsStr is logically an IntoIterator of u16, it would be strange to allow regex::bytes::Regex which targets &[u8] to support Windows's OsStr.

@BurntSushi
Copy link
Member

BurntSushi commented Mar 14, 2018

@kennytm Oh hmm, let me pop up a level. I'm less coming at this as "a libs team member who is strenuously objecting" and more as "let's make sure we walk into this with our eyes wide open." Of course, I agree that this isn't making the status quo worse, and more to the point, this new pattern API is a wonderful improvement over the status quo. I think my only real aim here is to get us to carefully consider that what is today an internal implementation detail, may tomorrow be a de facto public API, regardless of how insistent we are about it. I don't have the depth of understanding to know exactly what the ramifications of that are though!

More concretely, if the only thing that comes of my bleating is a couple sentences as a documentation somewhere, then I'm happy. :-)

@SimonSapin
Copy link
Contributor

Another reason I would be reluctant to have a public API that expose WTF-8 bytes, more than losing the option to change the encoding later, is the risk that people might not realize this is not UTF-8 and send those bytes over the network or write them to a file. Before we know it, WTF-8 might accidentally become a de-facto requirement for interoperably implementing some protocol or format.

@vitiral
Copy link

vitiral commented Mar 27, 2018

This is a tangential point to this RFC but I just wanted to mention that I developed STFU8 specifically for the use case of serializing/deserializing OsString (or any arbitrary u8/u16 array which can be mostly represented as valid UTF8). Rust could consider using it internally in the future on a to_stfu8() and from_stfu8() methods, although I would not recommend it ATM (due to how young/non-reviewed it is).

I'm a huge 👍 on this RFC.

unsafe fn range_to_self(hs: Self, start: Self::StartCursor, end: Self::EndCursor) -> Self;

// Since a StartCursor and EndCursor may not be comparable, we also need this method
fn is_range_empty(start: Self::StartCursor, end: Self::EndCursor) -> bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add something like type StartCursor: Copy + PartialOrd<Self::EndCursor> to handle this possibly?

@rfcbot
Copy link
Collaborator

rfcbot commented Mar 30, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. and removed proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. labels Mar 30, 2018
@rfcbot
Copy link
Collaborator

rfcbot commented Apr 9, 2018

The final comment period is now complete.

@Centril Centril merged commit 9c1609f into rust-lang:master Apr 9, 2018
@Centril
Copy link
Contributor

Centril commented Apr 9, 2018

Huzzah! 🎉 This RFC has been merged!

Tracking issue: rust-lang/rust#49802

@mina86
Copy link

mina86 commented Feb 14, 2023

Probably completely out of place but – &[u8] is also missing the pattern API or useful searching in general.

There is slight issue with that. str.starts_with("abc") test whether string starts with abc. str.starts_with(&['a', 'b', 'c']) tests whether string starts with one of the letters. What would str.as_bytes().starts_with(b"abc") do?

I actually started looking into implementing this RFC and meant to do patterns for &[u8] to ease myself into dealing with WTF-8 alas turns out the semantics aren’t obvious.

@Diggsey
Copy link
Contributor

Diggsey commented Feb 14, 2023

@mina86 I would say that str.as_bytes().starts_with(b"abc") should check if the string begins with "abc". The array => disjunction functionality present for strings is IMO too implicit and surprising. Instead we should introduce an AnyPattern(...) wrapper which can be used to disambiguate that case.

@mina86
Copy link

mina86 commented Feb 15, 2023

So I’ve been wondering. The RFC calls for StartCursor and EndCursor as separate types and then range_to_self which takes start and end cursors as arguments. Wouldn’t the design be simpler if instead Haystack had a split_at method? As far as I can see, this would get rid of the need for two separate cursor types. This would also be somewhat easier for Pattern to use actually since it wouldn’t need to convert between cursor types. Instead of:

fn strip_prefix_of(self, haystack: H) -> Option<H> {
    if let SearchStep::Match(start, end) = self.into_searcher(haystack).next() {
        let start = haystack.start_from_end_cursor(end);
        Some(unsafe { haystack.range_to_self(start, haystack.cursor_at_back()) })
    } else {
        None
    }
}

fn strip_suffix_of(self, haystack: H) -> Option<H>
where Self::Searcher: ReverseSearcher<H> {
    if let SearchStep::Match(start, end) = self.into_searcher(haystack).next_back() {
        let end = haystack.end_from_start_cursor(start);
        Some(unsafe { haystack.get_range_unchecked(haystack.front(), end) })
    } else {
        None
    }
}

it would be:

fn strip_prefix_of(self, haystack: H) -> Option<H> {
    if let SearchStep::Match(start, end) = self.into_searcher(haystack).next() {
        Some(unsafe { haystack.split_at_cursor_unchecked(end) }.1;
    } else {
        None
    }
}

fn strip_suffix_of(self, haystack: H) -> Option<H>
where Self::Searcher: ReverseSearcher<H> {
    if let SearchStep::Match(start, end) = self.into_searcher(haystack).next_back() {
        Some(unsafe { haystack.split_at_cursor_unchecked(start) }.0;
    } else {
        None
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-needle Needle API related proposals & ideas final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.