Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS string string-like interface #1309

Closed
wants to merge 3 commits into from

Conversation

wthrowe
Copy link

@wthrowe wthrowe commented Oct 6, 2015

@wthrowe wthrowe changed the title Add RFC for OS string string-like interface OS string string-like interface Oct 6, 2015
@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Oct 6, 2015
@Stebalien
Copy link
Contributor

See rust-lang/rust#26499 (related discussion)

@wthrowe
Copy link
Author

wthrowe commented Oct 6, 2015

Thanks. I'd missed that discussion.

starts_with and ends_with (and contains) could certainly be implemented for general OsStr arguments, although I don't think the straight byte comparison proposed there handles WTF-8 unpaired surrogates correctly.

That also suggests the question of whether these functions should take things like &str or things like T: AsRef<str>. I see std interfaces going both ways. Which is preferred for new interfaces?

@alexcrichton alexcrichton self-assigned this Oct 8, 2015
```rust
/// Returns true if the string starts with a valid UTF-8 sequence
/// equal to the given `&str`.
fn starts_with_str(&self, prefix: &str) -> bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding string-like APIs to OsStr, it may be best to try to stick to original str APIs as much as possible, while also allowing all kinds of fun functionality for OsStr. Along those lines, perhaps this API could look like:

fn starts_with<S: AsRef<OsStr>>(&self, prefix: S) -> bool;

That should cover this use case as well as something like os_str.starts_with(&other_os_string).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I'll make that change.

@alexcrichton
Copy link
Member

Thanks for the RFC @wthrowe! It does seem high time that we start expanding the API for OsStr, so thanks for pushing on it!

My thoughts on this in the past have been primarily along the lines of:

  • All extensions should mirror str in one form or another wherever possible
  • Questions about encoding and such tend to work best if we can somehow sidestep the entire question (depends on the question at hand)
  • The encoding question may end up meaning that some str APIs aren't quite appropriate for OS strings, but it's certainly a space to explore!

@wthrowe
Copy link
Author

wthrowe commented Oct 8, 2015

I'll look into writing up equivalents of some of the more general pattern matching str methods and replacing things those make obsolete. The only difficulty I can think of immediately is figuring out what patterns that match the empty string should do.

@alexcrichton
Copy link
Member

Yeah I agree that str.split("") has a pretty good meaning whereas os_str.split("") may not. This may not be able to go the route of a "full Pattern trait" just yet which is totally fine, just food for thought!


This is analogous to the existing `OsStr::to_string_lossy` method, but
transfers ownership. This operation can be done without a copy if the
`OsString` contains UTF-8 data or if the platform is Windows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading between lines I think you already know that, but WTF-8 to UTF-8 conversion can be done in place: https://simonsapin.github.io/wtf-8/#converting-wtf-8-utf-8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. In fact, the internal WTF-8 implementation in libstd already has an into_string_lossy that does the right thing. It just isn't exposed at the OsString level.

Also adds more explanation of how OS strings are interpreted.
This seems to not prevent anything actually useful and avoids
confusion.  Also matches `str` better.
@wthrowe
Copy link
Author

wthrowe commented Oct 17, 2015

New version, now with more patterns!

The previous proposed functions have mostly been replaced with a new interface matching str as closely as possible, but with a few new OsStr-specific operations.

I did decide to not propose slice_shift_char in the end, as I think I'd prefer to go the route of proposing more general functions for both str and OsStr, probably along the lines of fn split_prefix<P: Pattern<'a>>(&'a self, pat: P) -> Option<(&'a str, &'a Self)>, kind of like the current str::split_at. (slice_shift_char is then split_prefix(|_| true), except for some minor type differences.) (The idea that this would also be nice for str was pointed out by @SimonSapin in some line comments above.) It does make this proposal less useful in isolation, but I think it's a better direction in the long run.

The main uncertainty in the new version is what to do with patterns that match the empty string. The non-iterator functions (like contains) have pretty obvious interpretations, but the iterator ones (like matches) don't. I've listed a few possibilities in the text, but I think some more discussion is needed on this.

@wthrowe
Copy link
Author

wthrowe commented Oct 17, 2015

One more thought on patterns: With more complicated implementors of Pattern (like regular expressions or something) the concept of "matches the empty string" is not even terribly meaningful. A pattern might match some empty strings but not others, depending on context. The most reasonable idea might be to just declare that patterns are matched against each Unicode section separately. (And maybe at the beginning and end of the string in any case, since it would feel odd for .starts_with("") to ever give false.)


```rust
/// Returns true if `needle` is a substring of `self`.
fn contains_os<S: AsRef<OsStr>>(&self, needle: S) -> bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be a bit onerous to remember that contains and contains_os are both methods on an OsStr. I think we'd be in a better place (e.g. especially in mirroring str) if we could avoid extra methods like this. It may involve perhaps a new Pattern trait down below, but we may also be able to substitute AsRef<OsStr> for Pattern because str is in theory far more ubiquitous than OsStr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some kind of OsPattern trait implemented for P: Pattern and OsStr could probably work. (Implementing for AsRef<OsStr> as well would be forbidden by coherence, I believe.) There might be some trickiness due to starts_with and contains having different bounds. Have to think about it.

Edit: I think that can be dealt with by bounds on the trait methods.

No reason such a thing couldn't be used for replace as well, except that for some reason str doesn't accept patterns there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forget the exact reason that replace on strings doesn't take a pattern, but I think there may be a good reason? (cc @Kimundi)

Otherwise yeah having an OsPattern trait seems not-too-bad here perhaps

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a possible rough design:

Trait OsPattern<'a> is basically identical to Pattern<'a> except that the strs become OsStrs, the Searcher is bound on OsSearcher<'a>, and is_contained_in has an extra bound (more on that later).

Safe trait OsSearcher<'a> only has methods haystack (same as Searcher<'a>) and is_prefix_of.

Safe trait ReverseOsSearcher<'a> requires OsSearcher<'a> and adds is_suffix_of. OsStr::ends_with requires this bound.

Safe trait FullOsSearcher<'a> requires OsSearcher<'a> and adds is_contained_in. OsStr::contains requires this bound.

If we additionally want to support replace with patterns, then we also have:
Unsafe trait IndexedOsSearcher<'a> requires FullOsSearcher<'a> and adds the remainder of the Searcher<'a> methods, except that all the usize returns are changed to OsIndex<'a>s. OsStr::replace requires this bound.

Enum OsIndex<'a> has variants Unicode(&'a str, usize) and NonUnicode(&'a OsStr, NonUnicodeIndex). The first field in each variant is expected to be a substring of the haystack, and the second is an index into that substring. NonUnicodeIndex is a struct with basically no public interface. (OS-specific interfaces may be added later.)

If we decide to add methods like the split_prefix mentioned above we will likely need to add more variants on the OsSearcher trait. Ideally that would be figured out before any of this is stabilized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in the limit I think this may end up duplicating the API surface area of the string pattern traits, but there's also a question as to whether this needs to be so full-blown just yet. In theory OS strings are used much more rarely than regular strings (especially for various text manipulation routines), so we may be able to get by with a much smaller API surface area.

I wonder if perhaps this could expose a relatively straightforward, if not as generic, API today which leaves room to this sort of expansion in the future but doesn't take the leap quite just yet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reflecting on this, I think at the very least IndexedOsSearcher is silly, because I can't think of any case where one would implement FullOsSearcher but not it, so they can be merged.

The only reasonable simplification of the API that I can think of that can be generalized to something like this is to just bound on the Pattern family of traits. Then there are no new traits needed, and everything works except for starts_with, ends_with, contains, and replace with an OsStr. That's probably not too bad as a restriction, and I believe changing to OsPattern at some later time would be fully backwards compatible.

@aturon aturon assigned Kimundi and unassigned alexcrichton Mar 2, 2016
@alexcrichton
Copy link
Member

I've unfortunately not been able to find much time to allocate to this RFC, so @Kimundi (who originally developed the pattern API for strings) is gonna take over this.

@Kimundi
Copy link
Member

Kimundi commented Mar 15, 2016

Hi, just wanted to say that haven't forgotten about this RFC, but I'm currently investigating how a fully general pattern API (for arbitrary slice types) would look like, and whether there might be incompatibilities or method name clashes with this RFC.

@Diggsey
Copy link
Contributor

Diggsey commented Apr 6, 2016

@Kimundi Thanks for the update - could really do with this functionality!

@Kimundi
Copy link
Member

Kimundi commented Apr 24, 2016

Update:

I have a incomplete, undocumented prototype at https://github.com/Kimundi/rust_pattern_api_v2 that provides the same Pattern API + iterators for both & and &mut of str, [T], and OsStr.

I need to find time to do a proper writeup for my findings, but in regard to this RFC the cliff notes are:

  • There are two separate Pattern API's for OsStr:
    • one for searching sub-str slices which is identical to the str API: Patterns str, char, char predicates.
    • one for searching sub-OsStr slices with Patterns OsStr, str, char, char predicates.
      • searching for OsStr patterns has the edge case of needing to reject patterns on windows that start/end with the wrong kind of leading/trailing surrogate, since the Pattern API would not always be able to return slices to the haystack string that contains them. The alternative to a panic/err result would be to make the Pattern impl return Cow<OsStr> values.
  • If both are provided in some capacity, like in this RFC, they should probably be provided consistently and uniformly.
    • Eg. there should for example be symmetrical sets of Pattern-using foo and foo_os methods like contains() and contains_os(), instead of the proposed mix of, eg, contains() that works similar to taking a OsStr Pattern vs starts_with() taking a str Pattern.
  • The "unicode embedded in OsStr" behavior seems to be implementable without going through the split_unicode layer and requiring cloneable Patterns.

@alexcrichton
Copy link
Member

The libs team discussed this RFC recently and the conclusion was that while we'd like to implement something along these lines the RFC will need much of a revamp now and we unfortunately haven't been able to reach the author, so we're going to close. We're of course quite willing to entertain RFCs in this area though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postponed RFCs that have been postponed and may be revisited at a later time. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants