-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requesting Support for fmaddsub
#88
Comments
This looks reasonable. Are you sure this intrinsic exists on your system? The To go further, I recommend the following:
This allows people to see easily exactly what changes you made. Then report exactly what system you are using (and run |
Sure thing. I'm certain my system is capable of using fma (I regularly use instructions from it with C++ intrinsics), but I just don't know if it exists in LLVM or how to access it which is largely why I put up the issue. I've made a draft PR and I'll get some more info later today. Here's my version info a priori
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 11th Gen Intel(R) Core(TM) i5-11600K @ 3.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
|
See #89 |
LLVM supports the intrinsic. The question is whether the machine architecture that Julia tells LLVM to use includes the fma extensions. What error message do you see? |
I toyed around with it inside julia> using SIMD
julia> x = Vec(1.5, 2.5)
<2 x Float64>[1.5, 2.5]
julia> xd = x.data
(VecElement{Float64}(1.5), VecElement{Float64}(2.5))
julia> SIMD.Intrinsics.fma_fmaddsub(xd, xd, xd)
(VecElement{Float64}(3.75), VecElement{Float64}(8.75)) For some reason it's just calling Then, I tried implementing it a different way using function fmaddsub(a::LVec{N, T}, b::LVec{N, T}, c::LVec{N, T}) where {N, T<:FloatingTypes}
Base.llvmcall("llvm.x86.fma.vfmaddsub_pd", LVec{N, T}, (LVec{N, T}, LVec{N, T}, LVec{N, T}), a, b, c)
end which (using the same previous example) gave me
I know that I can theoretically access it through Julia, because I was talking with some people over at VectorizationBase and someone got it to work there and pushed it up. I didn't want to just leave this hanging, but working with LLVM is definitely not in my wheelhouse. |
This is my implementation:
and this is the result I get:
which subtracts the first and adds the second element. Note that Note also the spelling of the intrinsic: there is a letter |
Ah, I totally missed the underscore I left in, sloppy mistake. Check out the PR again and see how you feel about what I did (I tried to imitate the library's previous syntax). I added Also, it seems like at least my version of LLVM doesn't support AVX512 based intrinsics for these (I know that my CPU is capable of the AVX512 instructions, and I can perform those instructions normally through C), so I excluded those using comments and left open the possibility for adding them back in easily later. |
I added comments. These comments assume that you plan to submit this to SIMD.jl, and include things such as test cases etc. I don't know which LLVM version supports what intrinsics. I find LLVM's documentation in this respect sparse, i.e. I usually have to look at its source code to find out. I also noticed that |
Sure, I've come too far to turn back now. I agree with the problem of finding what version supports which intrinsics, I've had the same issue, but it seems like these were incorporated long enough ago that it shouldn't be a problem for Julia (though I'll check a baseline LLVM version). I can't tell you why |
I'm looking at this and I'm starting to rethink whether this does belong in SIMD.jl. In the original thread I posted I thought it did because this library provides a lot of functionality for the |
Made a comment about this on the PR: #89 (comment). |
So, is there any progress about this? |
Not really, at least for now (I will make no promises). I looked into it and started, but it turned out that VectorizationBase.Jl had this (or rather, it was implemented around the time I opened this). I apologize for the dead issue 😅 . |
It appears to be possible to emit these kind of instructions with generic LLVM without calling specific intrinsic functions. I generally agree with @KristofferC to allow LLVM to generate (though fragile) specialized instructions and to keep this package lean. Like mentioned, VectorizationBase.jl seems to handle many different instructions for various architectures so if users want them they can use that package. Nonethless, here is a general version of these instructions with some more details at heltonmc/SIMDMath.jl#4. For example, julia> @code_native SIMDMath.faddsub(a, b)
julia_faddsub_1812: # @julia_faddsub_1812
; ┌ @ none within `faddsub`
.cfi_startproc
# %bb.0: # %top
; │┌ @ none within `macro expansion`
vaddsubpd %ymm1, %ymm0, %ymm0
retq
Generates, the target specific instruction on my machine while delivering a fairly optimal instruction on my ARM computer shown in other thread. Though, it is not quite combining the multiply with the |
Maybe you need to emit "fast" fp instructions for LLVM to be allowed to change it to a fused operation? |
Yes! I also tried a couple variations similar to # a*b - c for i = 1, 3, ...
# a*b + c for i = 0, 2, ...
@inline @generated function fmsubadd(x::LVec{N, T}, y::LVec{N, T}, z::LVec{N, T}) where {N, T <: FloatTypes}
@assert iseven(N) "Vector length must be even"
shfl = join((string("i32 ", Int32(i-1), ", i32 ", Int32(N+i)) for i in 1:2:N), ", ")
s = """
%4 = fmul <$N x $(LLVMType[T])> %0, %1
%5 = fadd fast <$N x $(LLVMType[T])> %4, %2
%6 = fsub fast <$N x $(LLVMType[T])> %4, %2
%7 = shufflevector <$N x $(LLVMType[T])> %5, <$N x $(LLVMType[T])> %6, <$N x i32> <$shfl>
ret <$N x $(LLVMType[T])> %7
"""
return :(
llvmcall($s, LVec{N, T}, Tuple{LVec{N, T}, LVec{N, T}, LVec{N, T}}, x, y, z)
)
end But they all gave similar instructions (though I think this approach will work for this issue just need to fit into a form LLVM likes...) julia> @code_native SIMDMath.fmaddsub(a, b, c)
julia_fmaddsub_1848: # @julia_fmaddsub_1848
; ┌ @ none within `fmaddsub`
.cfi_startproc
# %bb.0: # %top
; │┌ @ none within `macro expansion`
vmulpd %ymm1, %ymm0, %ymm0
vaddsubpd %ymm2, %ymm0, %ymm0
retq
|
I was talking in the discourse about SIMD complex numbers (I hadn't seen #60 until a minute or two ago) and I'm trying to get a working implementation of Complex using
Vec
types, at least as a preliminary type. I can shave off a few clock cycles from the multiplication itself if I'm able to access thefmaddsub
/fmsubadd
intrinsics, but unfortunately, I can't those working on my computer. I'm not sure what the issue is, but I'd put up a PR if I knew more LLVM. I'm putting my meager attempt at it belowThe text was updated successfully, but these errors were encountered: