Skip to content

Latest commit

 

History

History
131 lines (123 loc) · 11.1 KB

README.md

File metadata and controls

131 lines (123 loc) · 11.1 KB

NOTE: This is all based on the work done by Adrien Grand over here: decode-128-ints-benchmark

DISCLAIMER: I'm doing my best to stumble through some decompiled assembly here. I'm not particularly experienced with this, so please verify results for yourself :)

Explore more efficient ways to handle the special-case in PFOR decoding of 0 bits-per-value (all deltas are the same) but where the "same value" is something other than 1. The PFOR encoding in Lucene (PForUtil) will look for cases where all deltas in the same block are the same, and encode it with "zero bits-per-value". It will then record the value once. The common case for this is "dense blocks" where all deltas are 1 (i.e., a very common term that many docs contain). There can be rare cases though where all deltas are the same but not 1. This benchmark tries to explore more efficient ways to handle this decoding, particularly around trying to get the HotSpot compiler to auto-vectorize the loops.

The benchmark currently compares a "baseline" decoder to a "candidate" decoder, allowing different implementations of prefixSumOf to be experimented with.

My current results are showing the baseline to be more performant, primarily due to the "multiplication loop" in the candidate not getting vectorized. Nothing is getting vectorized in the baseline, but it's handling everything in one loop vs. the candidate needing a loop plus two vectorized instructions (the array copy and the "addition loop").

Benchmark                                  (bitsPerValue)  (exceptionCount)  (sameVal)   Mode  Cnt  Score   Error   Units
DecodeBenchmark.pForDeltaBaselineDecoder                0                 0          2  thrpt   10  8.034 ± 0.022  ops/us
DecodeBenchmark.pForDeltaCandidateDecoder               0                 0          2  thrpt   10  6.849 ± 0.116  ops/us

When looking at the decompiled assembly the HostSpot appears to be creating, here's the main loop for the baseline:

  1.23%   ↗  0x00007f9607a02910: movslq %r10d,%rbx
  1.19%   │  0x00007f9607a02913: mov    %rbx,%rdi
  0.90%   │  0x00007f9607a02916: add    $0x1,%rdi
  1.49%   │  0x00007f9607a0291a: imul   %r9,%rdi
  1.34%   │  0x00007f9607a0291e: add    %rcx,%rdi
  1.45%   │  0x00007f9607a02921: mov    %rdi,0x10(%r8,%rbx,8)
  1.53%   │  0x00007f9607a02926: mov    %rbx,%rdi
  1.45%   │  0x00007f9607a02929: add    $0x4,%rdi
  1.42%   │  0x00007f9607a0292d: imul   %r9,%rdi
  1.31%   │  0x00007f9607a02931: add    %rcx,%rdi
  1.49%   │  0x00007f9607a02934: mov    %rbx,%rdx
  1.42%   │  0x00007f9607a02937: add    $0x3,%rdx
  1.38%   │  0x00007f9607a0293b: imul   %r9,%rdx
  1.16%   │  0x00007f9607a0293f: add    %rcx,%rdx
  1.75%   │  0x00007f9607a02942: mov    %rbx,%rsi
  1.16%   │  0x00007f9607a02945: add    $0x2,%rsi
  1.49%   │  0x00007f9607a02949: imul   %r9,%rsi
  3.92%   │  0x00007f9607a0294d: add    %rcx,%rsi
  2.83%   │  0x00007f9607a02950: mov    %rsi,0x18(%r8,%rbx,8)
  2.65%   │  0x00007f9607a02955: mov    %rdx,0x20(%r8,%rbx,8)
  1.72%   │  0x00007f9607a0295a: mov    %rdi,0x28(%r8,%rbx,8)  ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          │                                                ; - gsmiller.PForBaselineDecoder::prefixSumOf@24 (line 12)
          │                                                ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
          │                                                ; - gsmiller.DecodeBenchmark::pForDeltaBaselineDecoder@42 (line 22)
          │                                                ; - gsmiller.generated.DecodeBenchmark_pForDeltaBaselineDecoder_jmhTest::pForDeltaBaselineDecoder_thrpt_jmhStub@151 (line 240)
  1.53%   │  0x00007f9607a0295f: add    $0x4,%r10d         ;*iadd {reexecute=0 rethrow=0 return_oop=0}
          │                                                ; - gsmiller.PForBaselineDecoder::prefixSumOf@17 (line 12)
          │                                                ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
          │                                                ; - gsmiller.DecodeBenchmark::pForDeltaBaselineDecoder@42 (line 22)
          │                                                ; - gsmiller.generated.DecodeBenchmark_pForDeltaBaselineDecoder_jmhTest::pForDeltaBaselineDecoder_thrpt_jmhStub@151 (line 240)
  2.01%   │  0x00007f9607a02963: cmp    $0x7d,%r10d
          ╰  0x00007f9607a02967: jl     0x00007f9607a02910  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                           ; - gsmiller.PForBaselineDecoder::prefixSumOf@8 (line 11)
                                                           ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
                                                           ; - gsmiller.DecodeBenchmark::pForDeltaBaselineDecoder@42 (line 22)
                                                           ; - gsmiller.generated.DecodeBenchmark_pForDeltaBaselineDecoder_jmhTest::pForDeltaBaselineDecoder_thrpt_jmhStub@151 (line 240)

Compare this to the three main operations in the candidate (array copy, multiplication loop and add loop): Looks like the array copy (which is getting vectorized):

  0.04%       0x00007f132fa052b9: mov    %r10,0x98(%rsp)
  0.04%       0x00007f132fa052c1: mov    0x90(%rsp),%rdi
              0x00007f132fa052c9: mov    %r10,%rsi
              0x00007f132fa052cc: mov    $0x80,%edx
              0x00007f132fa052d1: vzeroupper 
  0.23%       0x00007f132fa052d4: movabs $0x7f1327dd0f60,%r10
  0.12%       0x00007f132fa052de: callq  *%r10              ;*invokestatic arraycopy {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - gsmiller.PForCandidateDecoder::prefixSumOf@9 (line 12)
                                                            ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
                                                            ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@42 (line 29)
                                                            ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)

Looks like the main "multiplication loop" (looks pretty non-vectorized)

  0.58%   ↗│  0x00007f132fa05350: mov    %r13,%r10
  1.81%   ││  0x00007f132fa05353: imul   0x10(%rbp,%r9,8),%r10
  0.96%   ││  0x00007f132fa05359: mov    %r10,0x10(%rbp,%r9,8)
  2.69%   ││  0x00007f132fa0535e: mov    %r13,%r10
  0.19%   ││  0x00007f132fa05361: imul   0x18(%rbp,%r9,8),%r10
  1.23%   ││  0x00007f132fa05367: mov    %r10,0x18(%rbp,%r9,8)
  1.54%   ││  0x00007f132fa0536c: mov    %r13,%r10
  1.77%   ││  0x00007f132fa0536f: imul   0x20(%rbp,%r9,8),%r10
  0.19%   ││  0x00007f132fa05375: mov    %r10,0x20(%rbp,%r9,8)
  2.54%   ││  0x00007f132fa0537a: mov    %r13,%r10
  0.54%   ││  0x00007f132fa0537d: imul   0x28(%rbp,%r9,8),%r10
  1.42%   ││  0x00007f132fa05383: mov    %r10,0x28(%rbp,%r9,8)  ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          ││                                                ; - gsmiller.PForCandidateDecoder::prefixSumOf@30 (line 14)
          ││                                                ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
          ││                                                ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@42 (line 29)
          ││                                                ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)
  1.27%   ││  0x00007f132fa05388: add    $0x4,%r9d          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ││                                                ; - gsmiller.PForCandidateDecoder::prefixSumOf@31 (line 13)
          ││                                                ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
          ││                                                ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@42 (line 29)
          ││                                                ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)
  2.46%   ││  0x00007f132fa0538c: cmp    %r11d,%r9d
          ╰│  0x00007f132fa0538f: jl     0x00007f132fa05350  ;*putfield tmp {reexecute=0 rethrow=0 return_oop=0}
           │                                                ; - gsmiller.ForUtil::<init>@9 (line 186)
           │                                                ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@8 (line 28)
           │                                                ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)

And the "addition loop" (which appears vectorized)

  0.23%  ↗    0x00007f132fa04ec0: vpaddq 0x10(%rbp,%r11,8),%ymm0,%ymm1
  0.69%  │    0x00007f132fa04ec7: vmovdqu %ymm1,0x10(%rbp,%r11,8)  ;*lastore {reexecute=0 rethrow=0 return_oop=0}
         │                                                  ; - gsmiller.PForCandidateDecoder::prefixSumOf@54 (line 17)
         │                                                  ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
         │                                                  ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@42 (line 29)
         │                                                  ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)
  0.35%  │    0x00007f132fa04ece: add    $0x4,%r11d         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
         │                                                  ; - gsmiller.PForCandidateDecoder::prefixSumOf@55 (line 16)
         │                                                  ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
         │                                                  ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@42 (line 29)
         │                                                  ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)
  0.08%  │    0x00007f132fa04ed2: cmp    %ecx,%r11d
         ╰    0x00007f132fa04ed5: jl     0x00007f132fa04ec0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - gsmiller.PForCandidateDecoder::prefixSumOf@44 (line 16)
                                                            ; - gsmiller.AbstractPForDecoder::decodeAndPrefixSum@33 (line 39)
                                                            ; - gsmiller.DecodeBenchmark::pForDeltaCandidateDecoder@42 (line 29)
                                                            ; - gsmiller.generated.DecodeBenchmark_pForDeltaCandidateDecoder_jmhTest::pForDeltaCandidateDecoder_thrpt_jmhStub@151 (line 240)