-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use bit mask in Regex's IsWordChar #84003
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsImproves throughput of IsWordChar on non-ASCII chars by ~15%. Roslyn has also fixed the codegen for is/is not patterns, so the switch expressions used here instead have been switched back.
|
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three blocks of identical code (not counting the source emitting version nor the test) for the WordCategories constant. Could we pull this out somewhere common so we don't eventually get them out of sync?
There are three copies of the const. We could remove one of the three by combining two of them into a field const, but the one in the source generator needs to remain separate. The categories that make up \w are also never going to change. |
Working on the coalescing of the constant... Also I noticed a what looks like a typo int chDiv8 = ch >> 3;
return (uint)chDiv8 < (uint)ascii.Length ?
(ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
((WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0 ||
(ch == ZeroWidthJoiner | ch == ZeroWidthNonJoiner)); That last line shoudn't the |
@stephentoub is there a performance history that explains the duplicate code in public static bool IsBoundaryWordChar(char ch)
{
// According to UTS#18 Unicode Regular Expressions (http://www.unicode.org/reports/tr18/)
// RL 1.4 Simple Word Boundaries The class of <word_character> includes all Alphabetic
// values from the Unicode character database, from UnicodeData.txt [UData], plus the U+200C
// ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER.
const char ZeroWidthNonJoiner = '\u200C', ZeroWidthJoiner = '\u200D';
return IsWordChar(ch) || ch == ZeroWidthJoiner || ch == ZeroWidthNonJoiner;
} For branch prediction and common data, I would expect the ordering shown above to be more valuable than doing the cheaper, but less likely character equality checks first... |
Because then you'll be doing the extra boundary checks even in the ASCII case where they're unnecessary. |
That is on purpose based on perf testing and the elimination of the branch on average being better. |
Glad I asked! I've got the PR ready then... one other question. Would it be ugly to expose this constant in some way so the Emitter generated code could just reference it? Is there precedent for that sort of construct? |
Yes. We want the emitted code to be as readable as possible, which means it's better for it to read: const int WordCategories =
1 << (int)UnicodeCategory.UppercaseLetter |
1 << (int)UnicodeCategory.LowercaseLetter |
1 << (int)UnicodeCategory.TitlecaseLetter |
1 << (int)UnicodeCategory.ModifierLetter |
1 << (int)UnicodeCategory.OtherLetter |
1 << (int)UnicodeCategory.NonSpacingMark |
1 << (int)UnicodeCategory.DecimalDigitNumber |
1 << (int)UnicodeCategory.ConnectorPunctuation; than to read: const int WordCategories = 0x4013F; |
Reminds me this runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs Lines 90 to 92 in c3983a2
|
PR submitted, I did make a change in Emitter, but ONLY to change the name to My follow-up question for the #84003 (comment) is more along the lines of should we expose a public constant from RegexCharClass that is referenced in the Emitter's output code such that the emitted code looks like " return (uint)chDiv8 < (uint)ascii.Length ?",
" (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :",
" (RegexCharClass.WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;",
"}",
}); |
No, I don't believe there's enough value in that (and RegexCharClass is internal). This is how we currently do the check; that could change in the future, and we'd be exposing public API for what's an implementation detail. |
Improves throughput of IsWordChar on non-ASCII chars by ~15%.
Roslyn has also fixed the codegen for is/is not patterns, so the switch expressions used here instead have been switched back.