efficient vector masks / reducing bitvector-mask-conversions #95

mchristianl · 2022-02-11T21:06:40Z

Hi,

I have stumbled upon conversions between bool-arrays and SIMD-masks that are generated by Julia/LLVM in the amd64 assembly and I am wondering if there is any technique or idiom to reduce the occuring runtime conversions.

This is related to SIMD programming in Julia and might not be related to the package SIMD.jl although it could be that some representation for masks is missing in SIMD.jl so I hope this is the right place to discuss this.

My concerns regarding the topic are:

Would it make sense to expose "quadword masks" (and similar) as returned by vector-comparisons in amd64 in Julia?
- ... and how could that be done?
- ... or does such an exposition belong to the LLVM-domain?
If not in Julia, how are vector masks meant to be used in just LLVM then?
- LLVM calls these masks "condition vectors", but condition vectors and boolean SIMD-vectors seem to differ conceptually
And generally, how architecture dependent is SIMD-mask programming relating to Julia's supported x86 and ARM targets in 64 and 32 bit?
- i.e. "how does this work under ARM?"

Background

The conversions occur when using vifelse from the SIMD.jl package, which produces a select statement in LLVM assembly and a vblendvpd instruction in amd64 assembly. Now I found out that

Julia's Vec{N, Bool} seems to be represented as <N x i8>,
the generated LLVM statement select seems to take <N x i1> masks,
for the amd64 instruction vblendvpd "The mask bits are the most significant bit in each quadword element of the mask register",
the LLVM statement for comparison fcmp seems to produce <N x i1> results,
and the amd64 instruction for comparison vcmpltpd produces "a quadword mask of all 1s (comparison true) or all 0s (comparison false)".

I understand, that Julia's and LLVM's representation-conversions are "opaque"/architecture-independent and meant to be optimized away in the most cases, but for tiny-tiny SIMD-"kernels" the architecture and the mask representation affects the way one would write a program.

E.g. I found a technique where they "load from the mask once per 16 elements, and reusing the vector by bit shifting" which has an amd64-specific smell to it. Such techniques only work when we can treat the SIMD-mask as a "quadword mask" which may or may not be in the scope of Julia, because of its architecture dependence (?)

Example

select_min(a,b)               = vifelse(                           (a < b)   , a, b )
select_min(a,b,mask_a,mask_b) = vifelse( vifelse((mask_a & mask_b),(a < b),a), a, b )
select_min(a,b,mask_a,mask_b) = vifelse(     mask_a & ((~mask_b) | (a < b))  , a, b )

To illustate the issue, think of a merge of two vectors by a comparison where each input comes with a mask.
I.e. from vectors a and b we want to component-wise select the smaller element but only when mask_a or mask_b is true. This could be a vector-reduction step in a masked minimum operation:

a < b	mask_a	mask_b	masked_cond	masked minimum
0	0	0	*	*
0	0	1	0	b
0	1	0	1	a
0	1	1	0	b
1	0	0	*	*
1	0	1	0	b
1	1	0	1	a
1	1	1	1	a

A boolean expression to realize this behaviour would be

  ma & mb .? a < b .: ma
= (ma & mb) & (a < b) | ~(ma & mb) & ma
= ma & (~mb | (a < b))

When extending this masked minimum into a masked findmin one strategy would be to re-use the mask on a separate index-vector. I stick with the first variant although this specific function could be implemented differently:

using SIMD
function select_min(
         a :: Vec{4,Float64} # 4×64 float
  ,      b :: Vec{4,Float64} # 4×64 float
  , mask_a :: Vec{4,Bool}    # 4×8 least-significant-bit bool
  , mask_b :: Vec{4,Bool}    # 4×8 least-significant-bit bool
  )
  cond_A = a < b                    #✓# `cond_A` is a 4×64 most-significant-bit mask
                                    #↯# `cond_A` gets converted to a 4×8 least-significant-bit bool
  cond_B = mask_a & mask_b          #✓# `cond_B` is a 4×8 least-significant-bit bool
                                    #↯# `cond_B` gets converted to a 4×8 most-significant-bit mask
  masked_cond = 
    vifelse(cond_B, cond_A, mask_a) #✓# `masked_cond` is a 4×8 least-significant-bit bool
                                    #↯# `masked_cond` gets converted to a 4×64 most-significant-bit mask
  return vifelse(masked_cond, a, b)
  # think here of multiple other "companion-components" that should fit into registers such as
  #   return ( vifelse(masked_cond, a      , b      )
  #          , vifelse(masked_cond, a_index, b_index)
  #          , vifelse(masked_cond, a_data , b_data )
  #          )
end

Here, I've marked the implicit conversions with a small ↯ symbol. The conversions become "visible" when looking at the generated assembly with

N = 4
T = Float64
code_native(select_min,(Vec{N,T},Vec{N,T},Vec{N,Bool},Vec{N,Bool}); debuginfo=:none)

then we can see, that the mask-conversions make up for a lot of instructions and some additional registers.
I have annotated the assembly so far as I understood:

movq         %rdi, %rax
vmovupd     (%rsi), %ymm0               ;✓; load a (to ymm0)
vmovupd     (%rdx), %ymm1               ;✓; load b (to ymm1)
vcmpltpd     %ymm1, %ymm0, %ymm2        ;✓; ymm2  = a < b
                                        ; ; ⇒ ymm2 == cond_A
                                        ; ; packing of cond_A:
vextractf128    $1, %ymm2, %xmm3        ;↯;   xmm3  = upper part of ymm2
vpackssdw    %xmm3, %xmm2, %xmm2        ;↯;   xmm2  = ... packing ... of xmm3
vpackssdw    %xmm2, %xmm2, %xmm2        ;↯;   xmm2  = ... packing ... of xmm2
vpacksswb    %xmm2, %xmm2, %xmm2        ;↯;   xmm2  = ... packing ... of xmm2
movabsq    $.rodata.cst16, %rdx         ;↯;   ...
vpand       (%rdx), %xmm2, %xmm2        ;↯;   xmm2 &= some bitmask
                                        ; ;   ⇒ xmm2 == packed cond_A
movl        (%rcx), %ecx                ;✓; ecx   = mask_a
vmovd         %ecx, %xmm3               ;✓; xmm3  = mask_a
andl         (%r8), %ecx                ;✓; ecx   = mask_a & mask_b
vmovd         %ecx, %xmm4               ;✓; xmm4  = mask_a & mask_b
                                        ; ; ⇒ xmm4 = cond_B
                                        ; ; unpacking of cond_B:
movabsq $140542381220720, %rcx          ;↯;   ...
vpand       (%rcx), %xmm4, %xmm4        ;↯;   xmm4 &= some bitmask (?)
vpmovzxbd    %xmm4, %xmm4               ;↯;   xmm4  = xmm4[0...1...2...3...]
movabsq  $140542381220736, %rcx         ;↯;   ...
vpshufb     (%rcx), %xmm4, %xmm4        ;↯;   xmm4 = "Packed Shuffle Bytes"
vpsllw          $7, %xmm4, %xmm4        ;↯;   xmm4 <<= 7
                                        ; ;   ⇒ xmm4 == unpacked cond_B
vpblendvb    %xmm4, %xmm2, %xmm3, %xmm2 ;✓; xmm2  = xmm4 .? xmm2 .: xmm3
                                        ; ; ⇒ xmm2 = masked_cond
                                        ; ; unpacking of masked_cond:
vpmovzxbd    %xmm2, %xmm2               ;↯;   xmm2  = xmm2[0...1...2...3...]
vpslld       $31, %xmm2, %xmm2          ;↯;   xmm2 <<= 31
vpmovsxdq    %xmm2, %ymm2               ;↯;   ymm2  = "Packed Move with Sign Extend" xmm2
                                        ; ;   ⇒ ymm2 == unpacked masked_cond
vblendvpd    %ymm2, %ymm0, %ymm1, %ymm0 ;✓; ymm0  = ymm2 .? a .: b
vmovapd      %ymm0, (%rdi)              ;✓; store result
vzeroupper
retq

The reason for these conversions occurance might be already present in the LLVM code:

define void @julia_mselect_65501(
    [1 x <4 x double>]* noalias nocapture sret([1 x <4 x double>]) %0
  , [1 x <4 x double>]* nocapture nonnull readonly align 16 dereferenceable(32) %1
  , [1 x <4 x double>]* nocapture nonnull readonly align 16 dereferenceable(32) %2
  , [1 x <4 x     i8>]* nocapture nonnull readonly align  4 dereferenceable( 4) %3
  , [1 x <4 x     i8>]* nocapture nonnull readonly align  4 dereferenceable( 4) %4
  ) #0 {
top:
  %5       = getelementptr inbounds [1 x <4 x double>] ....            ; ...
  %6       = getelementptr inbounds [1 x <4 x double>] ....            ; ...
  %7       = load     <4 x double>, <4 x double>* %5, align 16         ; %7 = a
  %8       = load     <4 x double>, <4 x double>* %6, align 16         ; %8 = b
  %res.i   = fcmp olt <4 x double> %7, %8                              ; %res.i  = a < b          (4 x i1)
  %resb.i  = zext     <4 x     i1> %res.i to <4 x i8>                  ; %resb.i = a < b          (4 x i8)
                                                                       ; ⇒ %resb.i = cond_A
  %9       = getelementptr inbounds [1 x <4 x     i8>] ....            ; ...
  %10      = getelementptr inbounds [1 x <4 x     i8>] ....            ; ...
  %11      = load   <4 x i8>, <4 x i8>*  %9, align 4                   ; %11 = mask_a             (4 x i8)
  %12      = load   <4 x i8>, <4 x i8>* %10, align 4                   ; %12 = mask_b             (4 x i8)
  %13      = and    <4 x i8> %12, %11                                  ; %13 = mask_a & mask_b    (4 x i8)
                                                                       ; ⇒ %13 == cond_B          (4 x i8)
  %cond.i2 = trunc  <4 x i8> %13 to <4 x i1>                           ; %cond.i2 = cond_B        (4 x i1)
  %res.i3  = select <4 x i1> %cond.i2, <4 x i8> %resb.i, <4 x i8> %11  ; ⇒ %res.i3 == masked_cond (4 x i8)
  %cond.i  = trunc  <4 x i8> %res.i3 to <4 x i1>                       ; %cond.i = masked_cond    (4 x i1)
  %res.i1  = select <4 x i1> %cond.i, <4 x double> %7, <4 x double> %8 ; %res.i1 = masked_cond .? .a : .b
  %14      = getelementptr inbounds [1 x <4 x double>] ....            ;
  store <4 x double> %res.i1, <4 x double>* %14, align 32              ; store result
  ret void
}

Although I am not fluent in amd64, I hope that something like this snippet would be possible:

vmovupd  (%rsi), %ymm0               ;✓; load      a (to ymm0)
vmovupd  (%rdx), %ymm1               ;✓; load      b (to ymm1)
vmovupd  (%r??), %ymm3               ;✓; load mask_a (to ymm3)
vmovupd  (%r??), %ymm4               ;✓; load mask_b (to ymm4)
                                     ; ; ⇒ these registers might already be loaded anyways
vcmpltpd  %ymm1, %ymm0, %ymm2        ;✓; ymm2 = a < b (cond_A)
vandps    %ymm3, %ymm4, %ymm4        ;✓; ymm4 = mask_a & mask_b (cond_B)
vpblendvb %ymm4, %ymm2, %ymm3, %ymm2 ;✓; ymm2  = ymm4 .? ymm2 .: ymm3 (masked_cond)
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 ;✓; ymm0  = ymm2 .? a .: b

which makes use of 5 ymm registers and therefore could be unrolled three times without register spilling and whatnot.

Would it be possible to achieve this in Julia somehow?

edit/PS: It seems that the whole issue generalizes for SIMD vectors of element size greater than a single byte

The text was updated successfully, but these errors were encountered:

mchristianl · 2022-02-12T03:40:06Z

There seems to be a llvm.vp.* namespace for Vector Predication Intrinsics with llvm.vp.reduce.fmin.* instructions

🥳

The first operand is the start value of the reduction...

The second operand is the vector on which the reduction is performed...

The third operand is the vector mask...

The fourth operand is the explicit vector length of the operation.

But when trying to use llvm.vp.reduce.fmin.v4f32 like so

function minimum_llvm2(xs::Vec{4,Float32},mask::Vec{4,Bool},x0::Float32)
  s = """
  declare float @llvm.vp.reduce.fmin.v4f32(float, <4 x float>, <4 x i1>, i32)
  define float @entry(
      <4 x float> %0
    , <4 x    i8> %1
    , float       %2
    ) #0 {
  top:
    %3 = trunc <4 x i8> %1 to <4 x i1>
    %res.i = call reassoc float @llvm.vp.reduce.fmin.v4f32(float %2, <4 x float> %0, <4 x i1> %3, i32 4)
    ret float %res.i
  }
  """
  Base.llvmcall((s, "entry"), Float32, Tuple{SIMD.LVec{4,Float32},SIMD.LVec{4,Bool},Float32}
    , xs.data, mask.data, x0)
end

I get a Symbols not found error:

julia> minimum_llvm2(Vec(1.0f0,2.0f0,3.0f0,4.0f0),Vec(false,true,true,false),0.0f0)
JIT session error: Symbols not found: [ llvm.vp.reduce.fmin.v4f32 ]
Failure value returned from cantFail wrapped call
Failed to materialize symbols: { (JuliaOJIT, { jfptr_minimum_llvm2_224, julia_minimum_llvm2_223 }) }
UNREACHABLE executed at /usr/include/llvm/Support/Error.h:782!

Could this be due to wrong llvm-code or is this functionality really missing? Are there ways to obtain a list of all supported llvm.* methods for my system? 🤔

this is on

Julia Version 1.7.1
Commit ac5cc99908* (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
  JULIA_LOAD_PATH = /usr/share/gmsh/api/julia/:

mchristianl · 2022-02-12T05:21:57Z

It seems to be the case that an explicit conversion from Vec{N,Bool} to the amd64-bitmask representation makes LLVM omit the implicit mask conversions. For bitvec2mask and msb2bitvec the slightly modified version

using SIMD
function select_min(
         a :: Vec{4,Float64}
  ,      b :: Vec{4,Float64}
  , mask_a :: Vec{4, UInt64}
  , mask_b :: Vec{4, UInt64}
  )
  cond_A      = bitvec2mask(UInt64, a < b)                                  
  cond_B      = mask_a & mask_b
  masked_cond = vifelse(msb2bitvec(cond_B), cond_A, mask_a)
  return        vifelse(msb2bitvec(masked_cond), a, b)
end

with

code_native(select_min,(Vec{4,Float64},Vec{4,Float64},Vec{4,UInt64},Vec{4,UInt64});debuginfo=:none)

produces the desired assembly

movq       %rdi, %rax
vmovupd         (%rsi), %ymm0        ; load a (to ymm0)
vmovupd         (%rdx), %ymm1        ; load b (to ymm1)
vcmpltpd  %ymm1, %ymm0, %ymm2        ; %ymm2 = a < b
                                     ; ⇒ ymm2 == cond_A
vmovupd         (%rcx), %ymm3        ; load mask_a (to ymm3)
vandpd    (%r8), %ymm3, %ymm4        ; ymm4 = mask_a & mask_b
                                     ; ⇒ ymm4 == cond_B
vblendvpd %ymm4, %ymm2, %ymm3, %ymm2 ; ymm2 = 
                                     ; ⇒ ymm2 == masked_cond
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 ; ymm0 = masked_cond .? a .: b
vmovapd   %ymm0, (%rdi)              ; store result
vzeroupper
retq

where bitvec2mask converts a single-byte-least-significant-bit bool to an multi-byte-all-one-or-all-zero mask

# @code_native debuginfo=:none bitvec2mask(UInt64,Vec(false,true,true,false))
#   vpmovzxbq (%rsi), %ymm0
#   vpxor      %xmm1, %xmm1, %xmm1
#   vpsubq     %ymm0, %ymm1, %ymm0
#   vmovdqa    %ymm0, (%rdi)
@inline bitvec2mask(::Type{U}, bs::Vec{N,Bool}) where {U,N} =
  ~Vec((U.(Tuple(bs)))...) + one(U)

and msb2bitvec converts a multi-byte-most-significant-bit bool to a single-byte-least-significant-bit bool

# @code_native debuginfo=:none msb2bitvec(bitvec2mask(UInt64,Vec(false,true,true,false)))
#   vmovdqu         (%rdi), %ymm0
#   vpsrlq      $63, %ymm0, %ymm0
#   vextracti128 $1, %ymm0, %xmm1
#   vpackusdw        %xmm1, %xmm0, %xmm0
#   vpackusdw        %xmm0, %xmm0, %xmm0
#   vpackuswb        %xmm0, %xmm0, %xmm0
@inline msb2bitvec(m::Vec{N,U}) where {N,U <: Unsigned} = 
  reinterpret(Vec{N,Bool},Vec((UInt8.(Tuple(m >> (sizeof(U)*8-1))))...))

A wonder how LLVM figures that out ... 🤯

Edit/PS: interestingly, using bitvec2msb instead of bitvec2mask also works, but generates an single, additional shift-left instrution vpsllq that does not affect the result 🤔

@inline bitvec2msb(::Type{U}, bs::Vec{N,Bool}) where {U,N} =
  Vec((U.(Tuple(bs)))...) << (sizeof(U)*8-1)

Edit2/PPS: the same assembly is produced when using LLVM's signed extension sext instead of bitvec2mask

function select_min(...)
  ...
  cond_A      = sext(UInt64, a < b)                                  
  ...
end

where sext is created like so

@generated function sext(::Type{T},x::Vec{N,Bool}) where {N,T}
  t = SIMD.Intrinsics.llvm_type(T)
  s = """
  %2 = trunc <$N x i8> %0 to <$N x i1>
  %3 = sext  <$N x i1> %2 to <$N x $t>
  ret <$N x $t> %3
  """
  return :( $(Expr(:meta,:inline)); Vec(Base.llvmcall($s,LVec{$N,$T},Tuple{LVec{$N,Bool}},x.data)) )
end

which might be recognized by LLVM when it sees ~Vec((U.(Tuple(bs)))...) + one(U).

Edit3/PPPS: instead of msb2bitvec of course also signbit works

function select_min4(
         a :: Vec{4,Float64}
  ,      b :: Vec{4,Float64}
  , mask_a :: Vec{4,  Int64}
  , mask_b :: Vec{4,  Int64}
  )
  cond_A      = sext(Int64, a < b)                                  
  cond_B      = mask_a & mask_b
  masked_cond = vifelse(signbit(cond_B), cond_A, mask_a)
  return        vifelse(signbit(masked_cond), a, b)
end

Zentrik mentioned this issue Mar 29, 2023

any function is slow #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

efficient vector masks / reducing bitvector-mask-conversions #95

efficient vector masks / reducing bitvector-mask-conversions #95

mchristianl commented Feb 11, 2022 •

edited

Loading

mchristianl commented Feb 12, 2022

mchristianl commented Feb 12, 2022 •

edited

Loading

efficient vector masks / reducing bitvector-mask-conversions #95

efficient vector masks / reducing bitvector-mask-conversions #95

Comments

mchristianl commented Feb 11, 2022 • edited Loading

Background

Example

mchristianl commented Feb 12, 2022

mchristianl commented Feb 12, 2022 • edited Loading

mchristianl commented Feb 11, 2022 •

edited

Loading

mchristianl commented Feb 12, 2022 •

edited

Loading