Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a StringSegment type #22004

Closed
terrajobst opened this issue May 26, 2017 · 23 comments
Closed

Add a StringSegment type #22004

terrajobst opened this issue May 26, 2017 · 23 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime
Milestone

Comments

@terrajobst
Copy link
Member

ASP.NET Core has a type called StringSegment. It allows substring operations on string to not allocate but return segments using offset and length.

It's the spiritual equivalent of ArraySegment<T> for string. Note that Span<T> will also allow slicing, but spans have different type system constraints that not all consumers can -- or want to -- opt into.

Proposed API Shape

Starting point is the existing API shape from ASP.NET. We should, however, diff it against the latest surface of string.

namespace System.Text
{
    public struct StringSegment : IEquatable<StringSegment>, IEquatable<string>
    {
        public static readonly StringSegment Empty;
        public StringSegment(string buffer);
        public StringSegment(string buffer, int offset, int length);
        public string Buffer { get; }
        public int Offset { get; }
        public int Length { get; }
        public string Value { get; }
        public bool HasValue { get; }
        public char this[int index] { get; }
        public override bool Equals(object obj);
        public bool Equals(StringSegment other);
        public bool Equals(StringSegment other, StringComparison comparisonType);
        public static bool Equals(StringSegment a, StringSegment b, StringComparison comparisonType);
        public bool Equals(string text);
        public bool Equals(string text, StringComparison comparisonType);
        public override int GetHashCode();
        public static bool operator ==(StringSegment left, StringSegment right);
        public static bool operator !=(StringSegment left, StringSegment right);
        public static implicit operator StringSegment(string value);
        public bool StartsWith(string text, StringComparison comparisonType);
        public bool EndsWith(string text, StringComparison comparisonType);
        public string Substring(int offset);
        public string Substring(int offset, int length);
        public StringSegment Subsegment(int offset);
        public StringSegment Subsegment(int offset, int length);
        public int IndexOf(char c, int start, int count);
        public int IndexOf(char c, int start);
        public int IndexOf(char c);
        public int IndexOfAny(char[] anyOf, int startIndex, int count);
        public int IndexOfAny(char[] anyOf, int startIndex);
        public int IndexOfAny(char[] anyOf);
        public int LastIndexOf(char value);
        public StringSegment Trim();
        public StringSegment TrimStart();
        public StringSegment TrimEnd();
        public StringTokenizer Split(char[] chars);
        public static bool IsNullOrEmpty(StringSegment value);
        public override string ToString();
    }
}

Other APIs

In order to make the API useful, we could also define methods on String that allow returning segments, e.g.

namespace String
{
    public partial class String
    {
        public StringSegment SubSegment(int startIndex)   // or Subsegment
        public StringSegment SubSegment(int startIndex, int length)
    }
}
@terrajobst
Copy link
Member Author

@davidfowl
Copy link
Member

We should add an implicit conversion to Span<char> as well.

@khellang
Copy link
Member

It's really unfortunate that there needs to be so many different types to essentially do the same thing. Is it really necessary?

Note that Span<T> will also allow slicing, but spans have different type system constraints that not all consumers can -- or want to -- opt into.

Can you expand on these constraints? Are you talking about stack-only?

@javiercn
Copy link
Member

I would also consider things like StringTokenizer https://github.com/aspnet/Common/blob/7c7031f145413221be316719683edb0bf1b9bf40/src/Microsoft.Extensions.Primitives/StringTokenizer.cs

In general, any abstraction that allows classic string operations but without allocating.

@jnm2
Copy link
Contributor

jnm2 commented Sep 15, 2017

Your proposed API shape includes StringTokenizer:

    public StringTokenizer Split(char[] chars);

Is that intentional?

@jnm2
Copy link
Contributor

jnm2 commented Sep 15, 2017

So I'm trying to get a picture: if I want to do slicing I'll always use ReadOnlySpan<char> up to the point where I find I need to put it on the heap; then I'll refactor everything to use StringSegment rather than ReadOnlySpan<char> up to that point (because you can't convert Span to Segment without allocating).
That would make it desirable for the instance and extension methods on ReadOnlySpan<char> and StringSegment to match as much as possible to make the switch easy.

Then, when I bring the StringSegment back onto the stack to use it, I'll convert it to a ReadOnlySpan<char> before doing any further processing on it because it has speed benefits (?) over StringSegment.

How close am I?

@danmoseley
Copy link
Member

Cc @ stephentoub

@danmoseley
Copy link
Member

@stephentoub , rather.

@stephentoub
Copy link
Member

@KrzysztofCwalina, could we make ReadOnlyMemory<char> work with string efficiently? I have to believe that if we were starting over today, we wouldn't have both Memory<T> and ArraySegment<T> and would just have the former, so if StringSegment is the correlary to ArraySegment<T>, seems logical we'd just use ReadOnlyMemory<char>... but maybe that's too complicated from an implementation perspective?

Regardless of whether we end up using ReadOnlyMemory<char> or a new StringSegment though, it makes sense to have one of them, given the ByRefLike-constraints on ReadOnlySpan<char>. And as with Memory<T> vs Span<T>, I expect the guidance would be that methods should accept as an argument the lowest-level type they're capable of accepting. string, char[], Span<char>, ArraySegment<char>, Memory<char>, ReadOnlyMemory<char>, and so on are all convertible to ReadOnlySpan<char>, so if you can accept one of those, you're going to be compatible with as many inputs as possible. But if you're asynchronous or an iterator or otherwise would need to store the input onto the heap or use it as a generic parameter or other things that would violate the ByRefLike constraints, you'd need to instead accept a StringSegment or a ReadOnlyMemory<char>, whichever we landed on.

@KrzysztofCwalina
Copy link
Member

I think it boils down to whether we are ok with this segment type being subject to struct tearing.

@stephentoub
Copy link
Member

@KrzysztofCwalina, can you elaborate? Both StringSegment and ReadOnlyMemory are multi-field structs, both subject to tearing. Or are you saying you'd make StringSegment a class?

@KrzysztofCwalina
Copy link
Member

Sorry, I was just first trying to decide between span like type vs memory like type.

@benaadams
Copy link
Member

StringSegment has is usefulness over ReadOnlyMemory<char> when you need to give string, offset, length to an apis that don't understand ROM and wants strings as converting back to string becomes allocating and expensive.

However should be conversion from StringSegment -> ReadOnlyMemory<char>

@stephentoub
Copy link
Member

StringSegment has is usefulness over ReadOnlyMemory when you need to give string, offset, length to an apis that don't understand ROM and wants strings as converting back to string becomes allocating and expensive.

Maybe, but ReadOnlyMemory<char> already has DangerousTryGetArray for getting out the underlying array/offset/length. If ReadOnlyMemory<char> had support for string, you could imagine an API for extracting an object/offset/length as well, where object could be an array, a string, an OwnedMemory, etc.

@benaadams
Copy link
Member

That could work :)

@davkean
Copy link
Member

davkean commented Oct 5, 2017

Side note, I'd guestimate that in building large solutions 30% - 40% of all MSBuild string allocations are short-lived and would greatly benefit from a StringSegment-like type. I'm prototyping some changes and have just directly grabbed ASP.NET's version of this to see how we can benefit from this.

To full see the benefit here for us - I'd like to see allocation-free Path APIs along with this; both MSBuild and VS make large amount of short-lived strings via Path.GetDirectoryName, Path.GetExtension, etc that end up just being compared or combined with other strings.

@davkean
Copy link
Member

davkean commented Oct 5, 2017

@JeremyKuhne for the Path-related request ^

@benaadams
Copy link
Member

Is this now covered by ReadOnlyMemory<char> and ReadOnlySpan<char>?

@stephentoub
Copy link
Member

I believe so.

@karelz
Copy link
Member

karelz commented May 10, 2018

Closing then

@karelz karelz closed this as completed May 10, 2018
@davkean
Copy link
Member

davkean commented May 11, 2018

Is there a diff that shows what APIs we ended up? I'm interested in making sure we have string-like APIs and equivalent APIs on Path.

@Tratcher
Copy link
Member

One demonstrative excersise would be to convert the internals of https://github.com/aspnet/Common/blob/dev/src/Microsoft.Extensions.Primitives/StringSegment.cs and see if all of the functionality is available.

@stephentoub
Copy link
Member

One demonstrative excersise would be to convert the internals of https://github.com/aspnet/Common/blob/dev/src/Microsoft.Extensions.Primitives/StringSegment.cs and see if all of the functionality is available.

From a quick skim, I expect all of the needed functionality is there with the exception of:

  • Split. There's an issue open at https://github.com/dotnet/corefx/issues/26528.
  • Trim. There are Trim* methods for ReadOnlySpan<char> but not for ReadOnlyMemory<char>. In most cases, span-based methods are sufficient, e.g. IndexOf, because you can just get the memory's span, but in the case of Trim, it returns back a slice of the input, and it's not trivial to get the memory slice from the trimmed memory's span.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime
Projects
None yet
Development

No branches or pull requests