Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Add Regex.Enumerate(ReadOnlySpan<char>) which is allocation free #65011

Closed
joperezr opened this issue Feb 8, 2022 · 4 comments · Fixed by #67794
Closed

[API Proposal]: Add Regex.Enumerate(ReadOnlySpan<char>) which is allocation free #65011

joperezr opened this issue Feb 8, 2022 · 4 comments · Fixed by #67794
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Milestone

Comments

@joperezr
Copy link
Member

joperezr commented Feb 8, 2022

This issue is breaking down one part of this original proposal: #59629

cc: @stephentoub

Background and motivation

We have an ongoing Regex investment for .NET 7 which is adding span-based matching APIs that would incur in allocation free operations. The ones being worked on in #59629 encompass only the IsMatch overloads, which are APIs where the caller only cares about finding if the input is a match or not, but don't actually want the Match object back. We can't really provide APIs that work over ReadOnlySpan<char>, are alloc-free and return a Match object since the Match object holds the string that matched with the captures, and because Match are Object types, they can't hold a span as a field.

Enumerate would not permit access to the full list of groups and captures, just the index/offset of the top-level capture, but in doing so, these Enumerate methods can become amortized zero-alloc: the enumerator is a ref struct, no objects are yielded, the input is a span, and the matching engine can reuse the internal Match object (and its supporting arrays) just as is done today with IsMatch to make it ammortized zero-alloc. If someone still needs the full details, they can fall back to using strings to begin with and the existing either Match or Matches, or (for some patterns, e.g. ones that don’t have anchors or lookaheads or lookbehinds that might go beyond the matching boundaries) re-run the engine with Match(matchString) just on the string representing the area of the input that matched. (The trouble with adding Regex.Match/Matches overloads for spans is the Match and MatchCollection types can’t store a Span; thus various surface area on these types couldn’t function with spans, like NextMatch… if we were to accept that, we could add span-based methods for those as well, but it would likely be confusing and inconsistent).

API Proposal

namespace System.Text.RegularExpressions
{
+    public readonly ref struct ValueMatch
+    {
+        public Range Range { get { throw null; } }
+    }
    public class Regex : ISerializable
    {
+        public ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input) { throw null; }
+        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { throw null; }
+        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options) { throw null; }
+        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options, TimeSpan matchTimeout) { throw null; }
+        public ref struct ValueMatchEnumerator
+        {
+            public readonly ValueMatchEnumerator GetEnumerator() { throw null; }
+            public bool MoveNext() { throw null; }
+            public readonly ValueMatch Current { get { throw null; } }
+        }
    }
}

API Usage

Regex regex = new Regex(@"\b\w+\b");
ReadOnlySpan<char> span = loremIpsum.AsSpan();
int lowerCaseWords = 0;
foreach (ValueMatch word in regex.EnumerateMatches(span))
{
    if (span[word.Range][0] >= 'a' && span[word.Range][0] <= 'z')
    {
        lowerCaseWords++;
    }
}

Alternative Designs

One of the topics for discussion here is that the fact that the Enumerate method won't give access to capture data might be confusing for consumers, as it is not plain and obvious that the intention of this method is to be allocation free, so it is possible that consumers might expect to get a Match enumerable back. One of the ideas that have been suggested by @stephentoub is that we could also provide ref struct versions of Match, Group and Capture which basically would reference the ReadOnlySpan<char> and that would be what is returned by Enumerate instead, which would make the return value be more intuitive and more what consumers might expect.

Risks

We want to make sure that by introducing this new API, we don't introduce confusion on whether this should be used over some other existing API like Matches which already returns an enumerable of Match objects back. We have to make sure that we make it obvious when one should be used over the other.

@joperezr joperezr added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions labels Feb 8, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 8, 2022
@ghost
Copy link

ghost commented Feb 8, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

This issue is breaking down one part of this original proposal: #59629

We have an ongoing Regex investment for .NET 7 which is adding span-based matching APIs that would incur in allocation free operations. The ones being worked on in #59629 encompass only the IsMatch overloads, which are APIs where the caller only cares about finding if the input is a match or not, but don't actually want the Match object back. We can't really provide APIs that work over ReadOnlySpan<char>, are alloc-free and return a Match object since the Match object holds the string that matched with the captures, and because Match are Object types, they can't hold a span as a field.

Enumerate would not permit access to the full list of groups and captures, just the index/offset of the top-level capture, but in doing so, these Enumerate methods can become amortized zero-alloc: the enumerator is a ref struct, no objects are yielded, the input is a span, and the matching engine can reuse the internal Match object (and its supporting arrays) just as is done today with IsMatch to make it ammortized zero-alloc. If someone still needs the full details, they can fall back to using strings to begin with and the existing either Match or Matches, or (for some patterns, e.g. ones that don’t have anchors or lookaheads or lookbehinds that might go beyond the matching boundaries) re-run the engine with Match(matchString) just on the string representing the area of the input that matched. (The trouble with adding Regex.Match/Matches overloads for spans is the Match and MatchCollection types can’t store a Span; thus various surface area on these types couldn’t function with spans, like NextMatch… if we were to accept that, we could add span-based methods for those as well, but it would likely be confusing and inconsistent).

API Proposal

namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input);
+        public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern);
+        public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);

+        public ref struct MatchPositionEnumerator
+        {
+            public MatchPositionEnumerator GetEnumerator();
+            public bool MoveNext();
+            public Range Current { get; } // could also add a new struct for the return type here rather than System.Range
+        }
    }
}

API Usage

foreach (Range r in Regex.Enumerate(spanInput,pattern))
{
    Process(spanInput[r]);
} 

Alternative Designs

One of the topics for discussion here is that the fact that the Enumerate method won't give access to capture data might be confusing for consumers, as it is not plain and obvious that the intention of this method is to be allocation free, so it is possible that consumers might expect to get a Match enumerable back. One of the ideas that have been suggested by @stephentoub is that we could also provide ref struct versions of Match, Group and Capture which basically would reference the ReadOnlySpan<char> and that would be what is returned by Enumerate instead, which would make the return value be more intuitive and more what consumers might expect.

Risks

We want to make sure that by introducing this new API, we don't introduce confusion on whether this should be used over some other existing API like Matches which already returns an enumerable of Match objects back. We have to make sure that we make it obvious when one should be used over the other.

Author: joperezr
Assignees: -
Labels:

api-suggestion, area-System.Text.RegularExpressions

Milestone: -

@joperezr
Copy link
Member Author

joperezr commented Apr 8, 2022

I have done some experimentation on this, and have also discussed offline with both @stephentoub and @bartonjs. Based on this I have made a few changes on my prototype and I will update the above proposal as such. Here are the main points:

  • Instead of naming the method Enumerate we will change it to be EnumerateMatches. There are a lot of similarities between the API being proposed and the existing regex.Matches(string input) API, but the main difference between the two will be that the one being proposed will operate over a passed in span and will be amortized allocation free. After chatting with @bartonjs, using a name like EnumerateMatches will provide a better connotation to indicate that it will not pre-compute all of the values.
  • Instead of having Current return Range, we will return a custom struct in order to be able to, in the future, add support for also having capture data in that struct. We will now introduce a ValueMatch struct which is what will be returned from Current that for now will only contain a Range, but in the future it might contain other things that might expose Capture data.
  • Since what we are returning now is a ValueMatch, we will rename MatchPositionEnumerator -> ValueMatchEnumerator.

I will update the main description to apply the above suggestions.

@joperezr joperezr added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Apr 8, 2022
@joperezr
Copy link
Member Author

joperezr commented Apr 8, 2022

For Capture support, we can later propose something like:

namespace System.Text.RegularExpressions
{
    public readonly ref struct ValueMatch
    {
        public Range Range { get { throw null; } }
+        public Match? Match { get { throw null; } }
    }
    public class Regex : ISerializable
    {
+        public ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures) { throw null; }
+        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { throw null; }
+        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures,  [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options) { throw null; }
+        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options, TimeSpan matchTimeout) { throw null; }
    }

@joperezr joperezr added the blocking Marks issues that we want to fast track in order to unblock other important work label Apr 8, 2022
@bartonjs
Copy link
Member

bartonjs commented Apr 8, 2022

Video

We broke Range up into Index and Length to better match the current Match type. We can easily add a Range property later if it's desired.

The ValueMatch type is being approved as a ref struct because we have concerns that it will need it with a more complete implementation of ValueMatch. In either this release, or a subsequent one, when we're confident we can drop it we would like to do so.

namespace System.Text.RegularExpressions
{
    // NOTE: This is being approved as a ref struct because we have concerns that it will need it with
    // a more complete implementation of ValueMatch.
    // If the ref-ness can safely be removed, we'd rather this was a simple (non-ref) struct.
    public readonly ref struct ValueMatch
    {
        public int Index { get; }
        public int Length { get; }
    }

    public class Regex : ISerializable
    {
        public ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input) { throw null; }
        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { throw null; }
        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options) { throw null; }
        public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, TimeSpan matchTimeout) { throw null; }
        public ref struct ValueMatchEnumerator
        {
            public readonly ValueMatchEnumerator GetEnumerator() { throw null; }
            public bool MoveNext() { throw null; }
            public readonly ValueMatch Current { get { throw null; } }
        }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed blocking Marks issues that we want to fast track in order to unblock other important work api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 8, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 9, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 13, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants