-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Extend Pattern API to OsStr #2295
Conversation
Probably completely out of place but – |
Thank you thank you thank you! @SimonSapin @Kimundi can one or both of you please provide a review of this RFC? |
cc @rust-lang/libs this has been a very longstanding issue. Can we consider landing this API as unstable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good (if somewhat hacky) way to solve the surrogate point issue in doing pattern searches in OsStr
. 👍
Short summary: This RFC essentially proposes two things:
- The ability to slice an
OsStr
. - An modification of the semantics of
OsStr
-slices under Windows, that allows them to represent unpaired surrogate codepoints by pointing at only a parts of the utf8 encoding of a regular codepoint.
Both are prerequisites to extending the Pattern
API, or other searching algorithms, to allow searching patterns inside an OsStr
without introducing restrictions on the data the API could handle.
This simplifies the work needed for an follow-up RFC to actually extend the Pattern
API to other types like OsStr
or [T]
.
However, it is not clear to me why the proposed semantic for slicing an OsStr
is as it is. Assuming we have the codepoint "\u{10000}"
and want to split it into the two surrogates "\u{d800}"
and "\u{dc00}"
. Then it seems there are two ways to expose the slicing operations:
-
Hide the proposed scheme behind virtual byte positions (What this RFC proposes).
OsStr("\u{10000}") = [u8]("\u{10000}") = f0 90 80 80 OsStr("\u{10000}")[..2] = [u8]("\u{10000}")[..3] = f0 90 80 OsStr("\u{10000}")[2..] = [u8]("\u{10000}")[1..] = 90 80 80
-
Expose the proposed scheme, and always slice at real byte positions.
OsStr("\u{10000}") = [u8]("\u{10000}") = f0 90 80 80 OsStr("\u{10000}")[..3] = [u8]("\u{10000}")[..3] = f0 90 80 OsStr("\u{10000}")[1..] = [u8]("\u{10000}")[1..] = 90 80 80
It seems the main advantage of the first scheme is that there is exactly one slice position that indicates the middle of a surrogate pair. But it is not clear to me if this property is actually needed, since the low level implementations of OsStr
searching will deal with the actual byte positions anyway.
I guess it amounts to the question of whether you can take the index returned by, eg, OsStr::find()
and use it both as an position to slice-to, and a position to slice-from. The first scheme would allow both, the latter only one of them.
Besides from that, it is also not clear which slice positions should be legal in general across windows and unix. Again I can image two schemes:
- Allow slicing at each position that would be a legal index in the native format of an
OsStr
. This means only valid codepoint and half-codepoint positions on windows, and arbitrary byte positions on unix. - Allow slicing only at valid unicode boundaries. This means only valid codepoint and half-codepoint positions on windows, and utf8 boundaries on unix.
- Don't define which positions are legal in general, just that indices returned by the std lib are valid.
Lastly, I think we should also be sure that changing the equality semantic of OsStr
s under windows does not impact performance too much, or break existing code, since its afaik the only aspect of the proposal that changes an existing stable API.
@Kimundi Thanks for the review 😊 Yes let s = OsStr::new("\u{10000}");
assert_eq!(s.len(), 4);
let index = s.find('\u{dc00}').unwrap();
assert_eq!(index, 1); // if we "Expose the proposed scheme"
let right = &s[index..]; // this will be [90 80 80], fine
let left = &s[..index]; // but this will be [f0] ?????? This also relates to the next question "which slice positions should be legal". We could
The second choice is no better than what this RFC proposed. There would also be the question in the meaning of The first choice means if you want the My preference about legal slice positions would be:
Because we support |
Makes sense! So, would you agree that we could define the slicing semantic as:
We can then either expose this definition as a stable guarantee for users, or just use it as a internal one with the guarantee that you can slice at all indices that the Bad slicing positions should then just panic, just as with |
@Kimundi Agreed. Since |
Great! Would you mind adding that as clarification to the RFC text? |
1758742
to
840e821
Compare
840e821
to
8b1171c
Compare
Done! Added 8b1171c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, the RFC looks good now in my opinion.
I would add two further advantages of the "use real index" alternative:
- Its less confusing for the user, since any index you get out of an API is identical to an underlying byte position. The overlap of adjacent slices is also not hidden from you.
- It simplifies the API somewhat, since a few places no longer need to to transform the index to/from the real byte offsets.
But those are minor nits.
And since there are no other comments, and we really should start making some progress on this, I'll be moving this to FCP asap. 😄 @SimonSapin: I don't think what is proposed here is too objectionable in regard to WTF-8, since the RFC doesn't expose its encoding directly. @rfcbot fcp merge |
hm... @rfcbot fcp merge |
Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams: No concerns currently listed. Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong +1 on the changes proposed. (And some remarks on the RFC itself.)
The high-level goal of WTF-8 is to provide an alternative representation (“encoding”) for [u16]
. Ideally, everything that is possible to do with [u16]
should be possible with OsStr
-on-Windows. I think the changes proposed here are probably the best way to further that goal, given the design constraints.
I was mildly concerned by the escalation of complexity in the encoding, but this is mitigated by being an internal implementation detail whose details end users (hopefully) don’t need to care about. Maybe we would have been better off deciding pre-1.0 to make OsStr
be [u16]
on Windows instead, but now we’re stuck with OsStr
being a superset of &str
/ UTF-8 on all platforms anyway.
The term “WTF-8” has been around with a precise specification for a while, so rather than modifying WTF-8 it makes sense to define a new encoding (with a new name) that is a superset. Kudos to @kennytm for coming up with OMG-WTF-8’s design and a well-written and well-illustrated specification.
text/0000-os-str-pattern.md
Outdated
/// let path = OsStr::new("/usr/bin/bash"); | ||
/// let range = path.rfind_range("/b"); | ||
/// assert_eq!(range, Some(8..10)); | ||
/// assert_eq!(path[range.unwrap()], OsStr::new("/bin")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the similar example in find_range
look wrong. Shouldn’t slicing result in OsStr::new("/b")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
text/0000-os-str-pattern.md
Outdated
It is trivial to apply the pattern API to `OsStr` on platforms where it is just an `[u8]`. The main | ||
difficulty is on Windows where it is an `[u16]` encoded as WTF-8. This RFC thus focuses on Windows. | ||
|
||
We will generalize the encoding of `OsStr` to specify these two capabilities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://rust-lang.github.io/rfcs/0517-io-os-reform.html#string-handling does mention WTF-8 by name, so this RFC should probably mention the OMG-WTF-8 encoding by name (as a definition of the encoding, not just a sample implementation) even if the encoding remains an implementation detail.
The implementation PR should also make sure to update internal documentation to say that the representation of OsStr
on Windows becomes OMG-WTF-8, with a link to the encoding’s full specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned.
text/0000-os-str-pattern.md
Outdated
|
||
## Pattern API | ||
|
||
This RFC assumes a generalized pattern API which supports more than strings. If the pattern API is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t quite understand what it means for an RFC to assume something. I’d prefer this RFC to propose one concrete API for Pattern
and related traits (and maybe list alternatives in the Alternatives section).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded this section and linked to the previous drafts. However, I prefer not to formally propose the pattern API here in this RFC as that is kinda out-of-scope.
Overall, this RFC looks great, but I think there might be an important question here that isn't covered. In particular, how are folks outside of What I do today (on Windows of course, the Unix case is easy) is lossily convert the |
Since we do not want to expose the WTF-8 buffer, this is probably not easy to
|
Right, but that still requires me to assume WTF-8 in the first place. To be clear, I do think that if this use case becomes important for performance, then we should recognize the people can and will depend on internal implementation details in the name of performance. Does that change how we approach this? If enough people do it, then what was an internal implementation detail will become de facto stabilized.
Hmm could you unpack this? (I understand the Unix side of things. For the purposes of this discussion, Unix is a non-issue.)
How does this work if I can't assume WTF-8? |
I see what you mean now. I think the situation now is no much different from https://internals.rust-lang.org/t/make-std-os-unix-ffi-osstrext-cross-platform/6277/5. Although this RFC focuses on WTF-8, the new methods do not assume (OMG-)WTF-8, even if it is very hard to find other encoding that satisfies all the requirements:
That does not mean an alternative encoding is impossible though. For instance we could slightly tweak WTF-8 to store the lone surrogate codepoints as Therefore, unless the libs team thinks otherwise, I suppose we will maintain the status quo i.e. there is no safe way to access the underlying buffer. There's no way we can prevent the de facto WTF-8 assumption, but at least we could get away by claiming "THIS IS UNSAFE!!!1"? This also means there won't be a stable non-std
On Windows, an |
@kennytm Oh hmm, let me pop up a level. I'm less coming at this as "a libs team member who is strenuously objecting" and more as "let's make sure we walk into this with our eyes wide open." Of course, I agree that this isn't making the status quo worse, and more to the point, this new pattern API is a wonderful improvement over the status quo. I think my only real aim here is to get us to carefully consider that what is today an internal implementation detail, may tomorrow be a de facto public API, regardless of how insistent we are about it. I don't have the depth of understanding to know exactly what the ramifications of that are though! More concretely, if the only thing that comes of my bleating is a couple sentences as a documentation somewhere, then I'm happy. :-) |
Another reason I would be reluctant to have a public API that expose WTF-8 bytes, more than losing the option to change the encoding later, is the risk that people might not realize this is not UTF-8 and send those bytes over the network or write them to a file. Before we know it, WTF-8 might accidentally become a de-facto requirement for interoperably implementing some protocol or format. |
This is a tangential point to this RFC but I just wanted to mention that I developed STFU8 specifically for the use case of serializing/deserializing I'm a huge 👍 on this RFC. |
text/0000-os-str-pattern.md
Outdated
unsafe fn range_to_self(hs: Self, start: Self::StartCursor, end: Self::EndCursor) -> Self; | ||
|
||
// Since a StartCursor and EndCursor may not be comparable, we also need this method | ||
fn is_range_empty(start: Self::StartCursor, end: Self::EndCursor) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add something like type StartCursor: Copy + PartialOrd<Self::EndCursor>
to handle this possibly?
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period is now complete. |
Huzzah! 🎉 This RFC has been merged! Tracking issue: rust-lang/rust#49802 |
There is slight issue with that. I actually started looking into implementing this RFC and meant to do patterns for |
@mina86 I would say that |
So I’ve been wondering. The RFC calls for StartCursor and EndCursor as separate types and then range_to_self which takes start and end cursors as arguments. Wouldn’t the design be simpler if instead Haystack had a split_at method? As far as I can see, this would get rid of the need for two separate cursor types. This would also be somewhat easier for Pattern to use actually since it wouldn’t need to convert between cursor types. Instead of: fn strip_prefix_of(self, haystack: H) -> Option<H> {
if let SearchStep::Match(start, end) = self.into_searcher(haystack).next() {
let start = haystack.start_from_end_cursor(end);
Some(unsafe { haystack.range_to_self(start, haystack.cursor_at_back()) })
} else {
None
}
}
fn strip_suffix_of(self, haystack: H) -> Option<H>
where Self::Searcher: ReverseSearcher<H> {
if let SearchStep::Match(start, end) = self.into_searcher(haystack).next_back() {
let end = haystack.end_from_start_cursor(start);
Some(unsafe { haystack.get_range_unchecked(haystack.front(), end) })
} else {
None
}
} it would be: fn strip_prefix_of(self, haystack: H) -> Option<H> {
if let SearchStep::Match(start, end) = self.into_searcher(haystack).next() {
Some(unsafe { haystack.split_at_cursor_unchecked(end) }.1;
} else {
None
}
}
fn strip_suffix_of(self, haystack: H) -> Option<H>
where Self::Searcher: ReverseSearcher<H> {
if let SearchStep::Match(start, end) = self.into_searcher(haystack).next_back() {
Some(unsafe { haystack.split_at_cursor_unchecked(start) }.0;
} else {
None
}
} |
Supersedes the "Pattern API" part of RFC #1309.
omgwtf8
cc #900.
cc @Kimundi (#528 "Pattern API 1.0")
cc @SimonSapin (WTF-8)