-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unroll String.Equals for constant input [0..16] length #64821
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsI decided to push my old prototype for review 🙂
We can unroll Other potential candidates: Demo: bool Test1(string s) => s.StartsWith("t", StringComparison.Ordinal);
bool Test2(string s) => s.StartsWith("te", StringComparison.Ordinal);
bool Test3(string s) => s.StartsWith("tes", StringComparison.Ordinal);
bool Test4(string s) => s.StartsWith("test", StringComparison.Ordinal);
bool Test4(string s, string var) => s.StartsWith(var, StringComparison.Ordinal); Codegen diff: https://www.diffchecker.com/Y7yBNGt4 @dotnet/jit-contrib @stephentoub @jkotas @GrabYourPitchforks
|
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
What does it do with this? s.StartsWith("test5", StringComparison.Ordinal) |
Current impl will use the defaul impl, but can be easily extended |
de2711c
to
65d29fb
Compare
Added bool Test(string? s) => s == "Hi"; codegen: ; Method Foo2:Test(System.String):bool:this
G_M50453_IG01: ;; offset=0000H
;; bbWeight=1 PerfScore 0.00
G_M50453_IG02: ;; offset=0000H
4885D2 test rdx, rdx
7416 je SHORT G_M50453_IG04
;; bbWeight=1 PerfScore 1.25
G_M50453_IG03: ;; offset=0005H
48B80200000048006900 mov rax, 0x69004800000002
48394208 cmp qword ptr [rdx+8], rax
0F94C0 sete al
0FB6C0 movzx rax, al
EB02 jmp SHORT G_M50453_IG05
;; bbWeight=0.50 PerfScore 3.25
G_M50453_IG04: ;; offset=001BH
33C0 xor eax, eax
;; bbWeight=0.50 PerfScore 0.12
G_M50453_IG05: ;; offset=001DH
C3 ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code: 30 Length and firstChar are merged into a single cmp (the trick I learned from Levi in Discord) PS: codegen is still suboptimal, but it's a completely different issue |
Pushed a demo for vectorized unrolling (len >= 8 && len <= 16): bool Test1(string s) => s == "ProxyHeader"; new codegen: ; Method Tests:Test1(System.String):bool:this
G_M52821_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M52821_IG02: ;; offset=0003H
4885D2 test rdx, rdx
7504 jne SHORT G_M52821_IG04
G_M52821_IG03: ;; offset=0008H
33C0 xor eax, eax
EB32 jmp SHORT G_M52821_IG05
G_M52821_IG04: ;; offset=000CH
C5F910052C000000 vmovupd xmm0, xmmword ptr [reloc @RWD00]
C5FA6F4A0C vmovdqu xmm1, xmmword ptr [rdx+12]
C5F1EF0D2F000000 vpxor xmm1, xmm1, xmmword ptr [reloc @RWD16]
8B4208 mov eax, dword ptr [rdx+8]
C5F9EF4402FC vpxor xmm0, xmm0, xmmword ptr [rdx+rax-4]
C5F1EBC0 vpor xmm0, xmm1, xmm0
C4E27917C0 vptest xmm0, xmm0
0F94C0 sete al
0FB6C0 movzx rax, al
G_M52821_IG05: ;; offset=0039H
C3 ret
RWD00 dq 0065004800790078h, 0072006500640061h
RWD16 dq 0078006F00720050h, 0061006500480079h
; Total bytes of code: 58 Unfortunately inliner goes out of time budget if I enable SIMD path |
You can always up the limit, if it's close. say from 10x to12x would be ok. But 10x to 100x maybe not so much. |
Good point, thanks. So overall I decided to only keep |
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
Consider a method that does 100 string comparisons against constant strings. Are all these 100 string comparisons going to be optimized, without hitting inliner's budget? |
Also, it would be interesting to measure JIT time of a method like this (before/after). |
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
…son.cs Co-authored-by: Günther Foidl <[email protected]>
… into unroll-startswith
Indeed 😁 |
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
…son.cs Co-authored-by: Miha Zupan <[email protected]>
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
…startswith # Conflicts: # src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
4156270
to
e0bd4ba
Compare
@jkotas So the budget is common for all the callees for a specific root and it gets eaten very quickly with string.Equals, e.g.: static bool Foo(string s)
{
if (s == "11111111" ||
s == "22222222" ||
s == "33333333" ||
s == "44444444" ||
s == "55555555")
{
return true;
}
return false;
} codegen: ; Method UnrolledStringEquals:Foo(System.String):bool
G_M33664_IG01: ;; offset=0000H
56 push rsi
4883EC20 sub rsp, 32
C5F877 vzeroupper
488BF1 mov rsi, rcx
;; bbWeight=1 PerfScore 2.50
G_M33664_IG02: ;; offset=000BH
4885F6 test rsi, rsi
7436 je SHORT G_M33664_IG04
;; bbWeight=1 PerfScore 1.25
G_M33664_IG03: ;; offset=0010H
837E0808 cmp dword ptr [rsi+8], 8
7530 jne SHORT G_M33664_IG04
C5FA6F460C vmovdqu xmm0, xmmword ptr [rsi+12]
C5FA6F4E0C vmovdqu xmm1, xmmword ptr [rsi+12]
C5F1EF0DF8000000 vpxor xmm1, xmm1, xmmword ptr [reloc @RWD00]
C5F9EF05F0000000 vpxor xmm0, xmm0, xmmword ptr [reloc @RWD00]
C5F1EBC0 vpor xmm0, xmm1, xmm0
C4E27917C0 vptest xmm0, xmm0
0F94C2 sete dl
0FB6D2 movzx rdx, dl
EB02 jmp SHORT G_M33664_IG05
;; bbWeight=0.50 PerfScore 11.29
G_M33664_IG04: ;; offset=0041H
33D2 xor edx, edx
;; bbWeight=0.50 PerfScore 0.12
G_M33664_IG05: ;; offset=0043H
85D2 test edx, edx
0F85B4000000 jne G_M33664_IG11
;; bbWeight=1 PerfScore 1.25
G_M33664_IG06: ;; offset=004BH
4885F6 test rsi, rsi
7436 je SHORT G_M33664_IG07
837E0808 cmp dword ptr [rsi+8], 8
7530 jne SHORT G_M33664_IG07
C5FA6F460C vmovdqu xmm0, xmmword ptr [rsi+12]
C5FA6F4E0C vmovdqu xmm1, xmmword ptr [rsi+12]
C5F1EF0DC8000000 vpxor xmm1, xmm1, xmmword ptr [reloc @RWD16]
C5F9EF05C0000000 vpxor xmm0, xmm0, xmmword ptr [reloc @RWD16]
C5F1EBC0 vpor xmm0, xmm1, xmm0
C4E27917C0 vptest xmm0, xmm0
0F94C2 sete dl
0FB6D2 movzx rdx, dl
EB02 jmp SHORT G_M33664_IG08
;; bbWeight=0.50 PerfScore 11.92
G_M33664_IG07: ;; offset=0081H
33D2 xor edx, edx
;; bbWeight=0.50 PerfScore 0.12
G_M33664_IG08: ;; offset=0083H
85D2 test edx, edx
7573 jne SHORT G_M33664_IG11
4885F6 test rsi, rsi
7436 je SHORT G_M33664_IG09
837E0808 cmp dword ptr [rsi+8], 8
7530 jne SHORT G_M33664_IG09
C5FA6F460C vmovdqu xmm0, xmmword ptr [rsi+12]
C5FA6F4E0C vmovdqu xmm1, xmmword ptr [rsi+12]
C5F1EF0D9C000000 vpxor xmm1, xmm1, xmmword ptr [reloc @RWD32]
C5F9EF0594000000 vpxor xmm0, xmm0, xmmword ptr [reloc @RWD32]
C5F1EBC0 vpor xmm0, xmm1, xmm0
C4E27917C0 vptest xmm0, xmm0
0F94C2 sete dl
0FB6D2 movzx rdx, dl
EB02 jmp SHORT G_M33664_IG10
;; bbWeight=0.50 PerfScore 12.54
G_M33664_IG09: ;; offset=00BDH
33D2 xor edx, edx
;; bbWeight=0.50 PerfScore 0.12
G_M33664_IG10: ;; offset=00BFH
85D2 test edx, edx
7532 jne SHORT G_M33664_IG11
48BA805565D474010000 mov rdx, 0x174D4655580 ; "44444444"
488B12 mov rdx, gword ptr [rdx]
488BCE mov rcx, rsi
E8503EFEFF call System.String:<Equals>g__EqualsUnrolled_9_to_16|38_1(System.String,System.String):bool
85C0 test eax, eax
7519 jne SHORT G_M33664_IG11
48BA885565D474010000 mov rdx, 0x174D4655588 ; "55555555"
488B12 mov rdx, gword ptr [rdx]
488BCE mov rcx, rsi
E8BF43FEFF call System.String:Equals(System.String,System.String):bool
85C0 test eax, eax
740B je SHORT G_M33664_IG13
;; bbWeight=0.50 PerfScore 5.38
G_M33664_IG11: ;; offset=00F5H
B801000000 mov eax, 1
;; bbWeight=0.50 PerfScore 0.12
G_M33664_IG12: ;; offset=00FAH
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; bbWeight=0.50 PerfScore 0.88
G_M33664_IG13: ;; offset=0100H
33C0 xor eax, eax
;; bbWeight=0.50 PerfScore 0.12
G_M33664_IG14: ;; offset=0102H
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; bbWeight=0.50 PerfScore 0.88
RWD00 dq 0031003100310031h, 0031003100310031h
RWD16 dq 0032003200320032h, 0032003200320032h
RWD32 dq 0033003300330033h, 0033003300330033h
; Total bytes of code: 264 so no budget for `s == "55555555"
since it can't do 100 Equals I don't see anything terrible and in general I guess there are not so many, I also tested the test I added - no visible effects. What is nice is that for non-unrollable cases jit ignores those calls as it is able to remove simple dead branches during import. Any opinion on this? It feels great to be able to declare rules to unroll stuff in pure C# but yeah, several long equals can eat inliner's budget and we won't be able to inline other stuff in the current root. |
hm... |
So the only thing I can do here is:
|
So it was an interesting challenge but I feel like it is not worth the trouble. Maybe it will be more relevant for Spans, especially for Spans of bytes (UTF8). Also, as an alternative good solution similar to this we can use |
How much do you lose in perf by not force-inlining the |
it's pretty bad, e.g. this is what happens when the masks are not folded to constants: static Vector128<ushort> Foo(string b)
{
return
Sse2.Xor(
Vector128.Create(
b[0], b[1], b[2], b[3],
b[4], b[5], b[6], b[7]),
Vector128.Create(
b[b.Length - 8], b[b.Length - 7],
b[b.Length - 6], b[b.Length - 5],
b[b.Length - 4], b[b.Length - 3],
b[b.Length - 2], b[b.Length - 1]));
} ; Assembly listing for method XX:Foo(System.String):System.Runtime.Intrinsics.Vector128`1[UInt16]
G_M28844_IG01: ;; offset=0000H
55 push ebp
8BEC mov ebp, esp
56 push esi
C5F877 vzeroupper
;; bbWeight=1 PerfScore 3.25
G_M28844_IG02: ;; offset=0007H
8B4204 mov eax, dword ptr [edx+4]
85C0 test eax, eax
0F8403010000 je G_M28844_IG05
0FB77208 movzx esi, word ptr [edx+8]
C5F96EC6 vmovd xmm0, esi
83F801 cmp eax, 1
0F86F1000000 jbe G_M28844_IG05
C5F9C4420A01 vpinsrw xmm0, xmm0, word ptr [edx+10], 1
83F802 cmp eax, 2
0F86E1000000 jbe G_M28844_IG05
C5F9C4420C02 vpinsrw xmm0, xmm0, word ptr [edx+12], 2
83F803 cmp eax, 3
0F86D1000000 jbe G_M28844_IG05
C5F9C4420E03 vpinsrw xmm0, xmm0, word ptr [edx+14], 3
83F804 cmp eax, 4
0F86C1000000 jbe G_M28844_IG05
C5F9C4421004 vpinsrw xmm0, xmm0, word ptr [edx+16], 4
83F805 cmp eax, 5
0F86B1000000 jbe G_M28844_IG05
C5F9C4421205 vpinsrw xmm0, xmm0, word ptr [edx+18], 5
83F806 cmp eax, 6
0F86A1000000 jbe G_M28844_IG05
C5F9C4421406 vpinsrw xmm0, xmm0, word ptr [edx+20], 6
83F807 cmp eax, 7
0F8691000000 jbe G_M28844_IG05
C5F9C4421607 vpinsrw xmm0, xmm0, word ptr [edx+22], 7
8D70F8 lea esi, [eax-8]
3BF0 cmp esi, eax
0F837F000000 jae G_M28844_IG05
0FB77442F8 movzx esi, word ptr [edx+2*eax-8]
C5F96ECE vmovd xmm1, esi
8D70F9 lea esi, [eax-7]
3BF0 cmp esi, eax
736E jae SHORT G_M28844_IG05
C5F1C44C42FA01 vpinsrw xmm1, xmm1, word ptr [edx+2*eax-6], 1
8D70FA lea esi, [eax-6]
3BF0 cmp esi, eax
735F jae SHORT G_M28844_IG05
C5F1C44C42FC02 vpinsrw xmm1, xmm1, word ptr [edx+2*eax-4], 2
8D70FB lea esi, [eax-5]
3BF0 cmp esi, eax
7350 jae SHORT G_M28844_IG05
C5F1C44C42FE03 vpinsrw xmm1, xmm1, word ptr [edx+2*eax-2], 3
8D70FC lea esi, [eax-4]
3BF0 cmp esi, eax
7341 jae SHORT G_M28844_IG05
;; bbWeight=1 PerfScore 78.75
G_M28844_IG03: ;; offset=00C8H
C5F1C40C4204 vpinsrw xmm1, xmm1, word ptr [edx+2*eax], 4
8D70FD lea esi, [eax-3]
3BF0 cmp esi, eax
7333 jae SHORT G_M28844_IG05
C5F1C44C420205 vpinsrw xmm1, xmm1, word ptr [edx+2*eax+2], 5
8D70FE lea esi, [eax-2]
3BF0 cmp esi, eax
7324 jae SHORT G_M28844_IG05
C5F1C44C420406 vpinsrw xmm1, xmm1, word ptr [edx+2*eax+4], 6
8D70FF lea esi, [eax-1]
3BF0 cmp esi, eax
7315 jae SHORT G_M28844_IG05
C5F1C44C420607 vpinsrw xmm1, xmm1, word ptr [edx+2*eax+6], 7
C5F9EFC1 vpxor xmm0, xmm0, xmm1
C5F91101 vmovupd xmmword ptr [ecx], xmm0
;; bbWeight=1 PerfScore 27.58
G_M28844_IG04: ;; offset=0100H
5E pop esi
5D pop ebp
C3 ret
;; bbWeight=1 PerfScore 2.00
G_M28844_IG05: ;; offset=0103H
E809891520 call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code 265, prolog size 7, PerfScore 139.88, instruction count 70, allocated bytes for code 283 (MethodHash=985a8f53) for method XX:Foo(System.String):System.Runtime.Intrinsics.Vector128`1[UInt16]
; ============================================================ |
At this point it sounds like it's actually simpler to do everything in JIT directly - I'll return with a prototype in JIT this weekend 🙂 Going to close this one, just to keep as an example of doing things in pure C# only. in jit I'll be able to optimize corner special cases (e.g. 8 chars), StartsWith, no budget problems, no throughput problems, AVX2 path (for up to 32byte), access to block weights, Spans... |
This PR unrolls
String.Equals
when the 2nd argument is a constant string of [0..16] length (can be extended to e.g. 32 via AVX2 easily but needs some work with inliner's budget)Example
New codegen:
@dotnet/jit-contrib @stephentoub @jkotas @GrabYourPitchforks