Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Access Regex group without allocations #73223

Open
ronaldvdv opened this issue Aug 2, 2022 · 8 comments
Open

[API Proposal]: Access Regex group without allocations #73223

ronaldvdv opened this issue Aug 2, 2022 · 8 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions
Milestone

Comments

@ronaldvdv
Copy link

Background and motivation

In class Capture (namespace System.Text.RegularExpressions) we now have the nice addition of the ValueSpan property which allows me to access the captured text efficiently. However, to be able to access that property I would still need to access Match.Groups[num] which would allocate the full GroupCollection and all Group instances, which (for my specific use case) defeats the purpose a bit.

API Proposal

namespace System.Text.RegularExpressions;

public class Match : Group
{
    public ReadOnlySpan<char> GetGroupValueSpan(int groupnum);
}

API Usage

string text = "One car red car blue car";
string pat = @"(\w+)\s+(car)";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(text);
ReadOnlySpan<char> word = m.GetGroupValueSpan(1); // matches "One"

Alternative Designs

  • We might need similar methods for returning the captured text as a string
  • We may need overloads that accept the group name as a string

Risks

No response

@ronaldvdv ronaldvdv added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 2, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 2, 2022
@ghost
Copy link

ghost commented Aug 2, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

In class Capture (namespace System.Text.RegularExpressions) we now have the nice addition of the ValueSpan property which allows me to access the captured text efficiently. However, to be able to access that property I would still need to access Match.Groups[num] which would allocate the full GroupCollection and all Group instances, which (for my specific use case) defeats the purpose a bit.

API Proposal

namespace System.Text.RegularExpressions;

public class Match : Group
{
    public ReadOnlySpan<char> GetGroupValueSpan(int groupnum);
}

API Usage

string text = "One car red car blue car";
string pat = @"(\w+)\s+(car)";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(text);
ReadOnlySpan<char> word = m.GetGroupValueSpan(1); // matches "One"

Alternative Designs

  • We might need similar methods for returning the captured text as a string
  • We may need overloads that accept the group name as a string

Risks

No response

Author: ronaldvdv
Assignees: -
Labels:

api-suggestion, area-System.Text.RegularExpressions

Milestone: -

@svick
Copy link
Contributor

svick commented Aug 2, 2022

We might need similar methods for returning the captured text as a string

What would be the reason for that? The whole point of this new method is to avoid allocations, but then you also add a version that has an extra allocation? Especially since anyone who does need a string can just call ToString() on the returned span.

We may need overloads that accept the group name as a string

This one does sound useful to me.

@ronaldvdv
Copy link
Author

Agree! The idea/question mark came from the fact that we provide both ValueSpan and Value on Capture, but I suspect that is for backward compatibility only.

@joperezr
Copy link
Member

joperezr commented Aug 2, 2022

Thanks for the proposal @ronaldvdv. Not sure if you are aware, but we added some amortized-allocation free APIs that loop through matches in .NET 7, in particular, we added regex.EnumerateMatches(ReadOnlySpan<char> input). Today, this returns an enumerator that eventually gets you ValueMatch structs which contain the index and the length of the match. We didn't have time to add groups and captures support on that ValueMatch yet as there are still some things we need to consider in terms of design in order to continue having this be allocation-free. My uber point here, is that I think we would probably prefer to have this sort of support on that ValueMatch type as opposed to Match since if you really care about allocations, Match object is probably still not the best alternative now that we have other options.

@joperezr joperezr added this to the Future milestone Aug 2, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Aug 2, 2022
@ronaldvdv
Copy link
Author

Interesting!! Yes, indeed if we have an amortized-allocation free option, it's nice to extend that to include access to individual captures within the match.

  • If we would match a single Regex pattern against a large set of strings consecutively, would I be sharing the same single Match instance throughout those many calls to Regex.EnumerateMatches()?
  • If we would reuse that Match instance, can we also reuse or pool the two-dimensional integer arrays in Match._matches (which I think are still being used by ValueMatch under the hood)?

@joperezr
Copy link
Member

joperezr commented Aug 2, 2022

If we would match a single Regex pattern against a large set of strings consecutively, would I be sharing the same single Match instance throughout those many calls to Regex.EnumerateMatches()?

Assuming you are using the same regex object to call for the different inputs, then yes. When EnumerateMatches is called, only one internal Match object will be created RegexRunner._runmatch (if it wasn't created already by a previous call to the same regex instance) and this will get reused both while enumerating all of the results on each input, as well as everytime you provide a new input to match (in which case it just gets reset first before it gets used).

If we would reuse that Match instance, can we also reuse or pool the two-dimensional integer arrays in Match._matches (which I think are still being used by ValueMatch under the hood)?

That is the tricky part and why it wasn't added yet for 7.0. ValueSpan is a ref struct which only contains the index and length of the match, it can't really keep a reference to the _runmatch object it was used to create it since as we've discussed above, this will get reused so there is no guarantees that the group or _matches info will be accurate/relevant when we are querying it. Also, if possible we would like to avoid having to allocate arrays for the groups and matches when creating a ValueSpan as that would make it no longer alloc-free. I added a draft of a proposal of what it would look like to add this where you can optionally ask EnumerateMatches to also include Capture data with the expense of knowing that this might incur in few allocations: #65011 (comment)

@ronaldvdv
Copy link
Author

Thanks for providing more background! I still struggle a little bit to understand how the approach described in #65011 would help for the original question in this issue. I hope you don't mind me asking a follow-up here.

If we would tell EnumerateMatches to also store capture data, would that mean Match instance(s) are populated just like when we would not be using ValueMatch at all? Then how would that improve the performance for getting group data compared to the existing implementation?

It seems to me we would still need an additional method that skips the creation of Group instances as proposed above. That method could either be placed in either ValueMatch (if it has access to the group data within its Match instance) or on Match itself, but ultimately it would be using data from Match so I'm not sure what the added value of ValueMatch would be in the solution to the problem at hand. Would it provide pooling/reuse of the Match instance and would EnumerateMatches be the only way to enjoy that benefit, r is there something else I'm missing?

@joperezr
Copy link
Member

joperezr commented Aug 3, 2022

I hope you don't mind me asking a follow-up here.

Not at all, questions are always welcomed 😄

If we would tell EnumerateMatches to also store capture data, would that mean Match instance(s) are populated just like when we would not be using ValueMatch at all?

Nothing would change with the internal Match object, that would still have the same fields that track capture data, but remember that this Match object gets reused, so we can't rely on it to get the capture groups data. That means that the way to get this info would have to be through the ValueMatch instance that gets returned for each match, meaning we would need to copy this data over to that match. The current info on that match are only two ints, which means that this match instance is allocation-free today. If we wanted to also store the capture data on it, we would have to then allocate a ValueTuple<int index, int length> array to store that capture data, but that means our ValueMatch is no longer allocation free, as it needs to allocate that array. This is why we would want that to be an opt-in option in case you care about capture data, but if your pattern doesn't have captures or you don't care about them, then we remain allocation-free.

Then how would that improve the performance for getting group data compared to the existing implementation?

It would improve since we wouldn't have to return a Match object, as we could instead just reuse the internal runner's runmatch field and we could also operate over passed in spans. Current implementation is never allocation-free, as whenever you get a Match object back, it means we have to allocate that match object for every match you try. That's the main advantage of working with EnumerateMatches and ValueMatch matches instead, since those are ref structs (which don't need allocation) and for every match you try we just reuse the only-one-time-allocated internal runmatch.

Would it provide pooling/reuse of the Match instance and would EnumerateMatches be the only way to enjoy that benefit, r is there something else I'm missing?

Exactly. With this approach you only allocate Match and it's fields (including groups) once, and reuse them, and all ValueMatch returned only have one ValueTuple<int,int>[] allocation in case you had capturing groups and you cared about the capture data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions
Projects
None yet
Development

No branches or pull requests

3 participants