-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
efficient vector masks / reducing bitvector-mask-conversions #95
Comments
There seems to be a 🥳
But when trying to use function minimum_llvm2(xs::Vec{4,Float32},mask::Vec{4,Bool},x0::Float32)
s = """
declare float @llvm.vp.reduce.fmin.v4f32(float, <4 x float>, <4 x i1>, i32)
define float @entry(
<4 x float> %0
, <4 x i8> %1
, float %2
) #0 {
top:
%3 = trunc <4 x i8> %1 to <4 x i1>
%res.i = call reassoc float @llvm.vp.reduce.fmin.v4f32(float %2, <4 x float> %0, <4 x i1> %3, i32 4)
ret float %res.i
}
"""
Base.llvmcall((s, "entry"), Float32, Tuple{SIMD.LVec{4,Float32},SIMD.LVec{4,Bool},Float32}
, xs.data, mask.data, x0)
end I get a Symbols not found error:
Could this be due to wrong llvm-code or is this functionality really missing? Are there ways to obtain a list of all supported this is on
|
It seems to be the case that an explicit conversion from using SIMD
function select_min(
a :: Vec{4,Float64}
, b :: Vec{4,Float64}
, mask_a :: Vec{4, UInt64}
, mask_b :: Vec{4, UInt64}
)
cond_A = bitvec2mask(UInt64, a < b)
cond_B = mask_a & mask_b
masked_cond = vifelse(msb2bitvec(cond_B), cond_A, mask_a)
return vifelse(msb2bitvec(masked_cond), a, b)
end with code_native(select_min,(Vec{4,Float64},Vec{4,Float64},Vec{4,UInt64},Vec{4,UInt64});debuginfo=:none) produces the desired assembly movq %rdi, %rax
vmovupd (%rsi), %ymm0 ; load a (to ymm0)
vmovupd (%rdx), %ymm1 ; load b (to ymm1)
vcmpltpd %ymm1, %ymm0, %ymm2 ; %ymm2 = a < b
; ⇒ ymm2 == cond_A
vmovupd (%rcx), %ymm3 ; load mask_a (to ymm3)
vandpd (%r8), %ymm3, %ymm4 ; ymm4 = mask_a & mask_b
; ⇒ ymm4 == cond_B
vblendvpd %ymm4, %ymm2, %ymm3, %ymm2 ; ymm2 =
; ⇒ ymm2 == masked_cond
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 ; ymm0 = masked_cond .? a .: b
vmovapd %ymm0, (%rdi) ; store result
vzeroupper
retq where # @code_native debuginfo=:none bitvec2mask(UInt64,Vec(false,true,true,false))
# vpmovzxbq (%rsi), %ymm0
# vpxor %xmm1, %xmm1, %xmm1
# vpsubq %ymm0, %ymm1, %ymm0
# vmovdqa %ymm0, (%rdi)
@inline bitvec2mask(::Type{U}, bs::Vec{N,Bool}) where {U,N} =
~Vec((U.(Tuple(bs)))...) + one(U) and # @code_native debuginfo=:none msb2bitvec(bitvec2mask(UInt64,Vec(false,true,true,false)))
# vmovdqu (%rdi), %ymm0
# vpsrlq $63, %ymm0, %ymm0
# vextracti128 $1, %ymm0, %xmm1
# vpackusdw %xmm1, %xmm0, %xmm0
# vpackusdw %xmm0, %xmm0, %xmm0
# vpackuswb %xmm0, %xmm0, %xmm0
@inline msb2bitvec(m::Vec{N,U}) where {N,U <: Unsigned} =
reinterpret(Vec{N,Bool},Vec((UInt8.(Tuple(m >> (sizeof(U)*8-1))))...)) A wonder how LLVM figures that out ... 🤯 Edit/PS: interestingly, using @inline bitvec2msb(::Type{U}, bs::Vec{N,Bool}) where {U,N} =
Vec((U.(Tuple(bs)))...) << (sizeof(U)*8-1) Edit2/PPS: the same assembly is produced when using LLVM's signed extension function select_min(...)
...
cond_A = sext(UInt64, a < b)
...
end where @generated function sext(::Type{T},x::Vec{N,Bool}) where {N,T}
t = SIMD.Intrinsics.llvm_type(T)
s = """
%2 = trunc <$N x i8> %0 to <$N x i1>
%3 = sext <$N x i1> %2 to <$N x $t>
ret <$N x $t> %3
"""
return :( $(Expr(:meta,:inline)); Vec(Base.llvmcall($s,LVec{$N,$T},Tuple{LVec{$N,Bool}},x.data)) )
end which might be recognized by LLVM when it sees Edit3/PPPS: instead of function select_min4(
a :: Vec{4,Float64}
, b :: Vec{4,Float64}
, mask_a :: Vec{4, Int64}
, mask_b :: Vec{4, Int64}
)
cond_A = sext(Int64, a < b)
cond_B = mask_a & mask_b
masked_cond = vifelse(signbit(cond_B), cond_A, mask_a)
return vifelse(signbit(masked_cond), a, b)
end |
Hi,
I have stumbled upon conversions between bool-arrays and SIMD-masks that are generated by Julia/LLVM in the amd64 assembly and I am wondering if there is any technique or idiom to reduce the occuring runtime conversions.
This is related to SIMD programming in Julia and might not be related to the package
SIMD.jl
although it could be that some representation for masks is missing inSIMD.jl
so I hope this is the right place to discuss this.My concerns regarding the topic are:
Background
The conversions occur when using
vifelse
from theSIMD.jl
package, which produces aselect
statement in LLVM assembly and avblendvpd
instruction in amd64 assembly. Now I found out thatVec{N, Bool}
seems to be represented as<N x i8>
,select
seems to take<N x i1>
masks,fcmp
seems to produce<N x i1>
results,I understand, that Julia's and LLVM's representation-conversions are "opaque"/architecture-independent and meant to be optimized away in the most cases, but for tiny-tiny SIMD-"kernels" the architecture and the mask representation affects the way one would write a program.
E.g. I found a technique where they "load from the mask once per 16 elements, and reusing the vector by bit shifting" which has an amd64-specific smell to it. Such techniques only work when we can treat the SIMD-mask as a "quadword mask" which may or may not be in the scope of Julia, because of its architecture dependence (?)
Example
To illustate the issue, think of a merge of two vectors by a comparison where each input comes with a mask.
I.e. from vectors
a
andb
we want to component-wise select the smaller element but only whenmask_a
ormask_b
is true. This could be a vector-reduction step in a masked minimum operation:A boolean expression to realize this behaviour would be
When extending this masked minimum into a masked findmin one strategy would be to re-use the mask on a separate index-vector. I stick with the first variant although this specific function could be implemented differently:
Here, I've marked the implicit conversions with a small
↯
symbol. The conversions become "visible" when looking at the generated assembly withthen we can see, that the mask-conversions make up for a lot of instructions and some additional registers.
I have annotated the assembly so far as I understood:
The reason for these conversions occurance might be already present in the LLVM code:
Although I am not fluent in amd64, I hope that something like this snippet would be possible:
which makes use of 5 ymm registers and therefore could be unrolled three times without register spilling and whatnot.
Would it be possible to achieve this in Julia somehow?
edit/PS: It seems that the whole issue generalizes for SIMD vectors of element size greater than a single byte
The text was updated successfully, but these errors were encountered: