Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions are not exported: add support for NTuple{N,Core.VecElement{T}} ? #14

Open
chriselrod opened this issue Dec 21, 2018 · 3 comments

Comments

@chriselrod
Copy link

chriselrod commented Dec 21, 2018

The functions aren't exported, because that would be type piracy on the base types Float64 and Float32.

Therefore, why not extend these functions to also work on NTuple{N,Core.VecElement{Float64}} and NTuple{N,Core.VecElement{Float32}}?

Check this out! vexp is a version of exp from this library, just modified so that it works on SIMD vectors:

julia> includet("/home/chriselrod/Documents/progwork/julia/SLEEF_test2.jl")
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.2/SIMDPirates/jEO2E.ji for SIMDPirates [21efa798-c60a-11e8-04d3-e1a92915a26a]

julia> vx = ntuple(Val(8)) do x VE(randn()) end
(VecElement{Float64}(-1.0607753663132373), VecElement{Float64}(0.9882746379097759), VecElement{Float64}(-1.3145487866655596), VecElement{Float64}(-0.6473402975019208), VecElement{Float64}(-2.1551085376634402), VecElement{Float64}(1.3938713996464838), VecElement{Float64}(-0.31154734396155076), VecElement{Float64}(-0.4621337814318952))

julia> vexp(vx)
(VecElement{Float64}(0.34618728428218104), VecElement{Float64}(2.686595121845405), VecElement{Float64}(0.26859548976775793), VecElement{Float64}(0.5234361113361462), VecElement{Float64}(0.11589061143405407), VecElement{Float64}(4.030423267688383), VecElement{Float64}(0.7323129390890033), VecElement{Float64}(0.629938060288801))

julia> using SLEEFwrap
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.2/SLEEFwrap/GD9JX.ji for SLEEFwrap [6e748a96-c50b-11e8-2c8a-ed394f64ea02]

julia> SLEEFwrap.exp(vx)
(VecElement{Float64}(0.34618728428218104), VecElement{Float64}(2.686595121845405), VecElement{Float64}(0.26859548976775793), VecElement{Float64}(0.5234361113361462), VecElement{Float64}(0.11589061143405406), VecElement{Float64}(4.030423267688383), VecElement{Float64}(0.7323129390890034), VecElement{Float64}(0.629938060288801))

julia> using BenchmarkTools

julia> @benchmark SLEEFwrap.exp($vx)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4599907975615880458
  --------------
  minimum time:     8.605 ns (0.00% GC)
  median time:      8.670 ns (0.00% GC)
  mean time:        9.035 ns (0.00% GC)
  maximum time:     32.975 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark vexp($vx)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4599907975615880458
  --------------
  minimum time:     6.529 ns (0.00% GC)
  median time:      6.560 ns (0.00% GC)
  mean time:        6.857 ns (0.00% GC)
  maximum time:     24.878 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> versioninfo()
Julia Version 1.2.0-DEV.27
Commit 046755c71a (2018-12-17 22:41 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Note, the answers do disagree on 2 of the 8 answers. Float64(exp(BigFloat(x))) suggests they were each right on one of those instances, so more testing would be needed to see how accurate they are.

32 bit benchmark:

julia> vx32 = ntuple(Val(16)) do x VE(randn(Float32)) end;

julia> @benchmark vexp($vx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4603241027206639343
  --------------
  minimum time:     5.030 ns (0.00% GC)
  median time:      5.056 ns (0.00% GC)
  mean time:        5.083 ns (0.00% GC)
  maximum time:     21.156 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark SLEEFwrap.exp($vx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4603241027206639343
  --------------
  minimum time:     5.175 ns (0.00% GC)
  median time:      5.220 ns (0.00% GC)
  mean time:        5.232 ns (0.00% GC)
  maximum time:     21.366 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Here is the script:

using SIMDPirates, SLEEF
import SLEEF: unsafe_trunc, L2U, L2L, exp_kernel, ldexp2k, max_exp, min_exp, MLN2E, exponent_bias, split_exponent, integer2float, exp_kernel
using VectorizationBase: extract_data, VE
using Base.Math: significand_bits
using Base.Math: @horner


@inline function exp_kernel(x::Vec{W,Float64}) where W
    @pirate begin
        c11 = 2.08860621107283687536341e-09
        c10 = 2.51112930892876518610661e-08
        c9  = 2.75573911234900471893338e-07
        c8  = 2.75572362911928827629423e-06
        c7  = 2.4801587159235472998791e-05
        c6  = 0.000198412698960509205564975
        c5  = 0.00138888888889774492207962
        c4  = 0.00833333333331652721664984
        c3  = 0.0416666666666665047591422
        c2  = 0.166666666666666851703837
        c1  = 0.50
        @horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
    end
end

@inline function exp_kernel(x::Vec{W,Float32}) where W
    @pirate begin
        c6 = 0.000198527617612853646278381f0
        c5 = 0.00139304355252534151077271f0
        c4 = 0.00833336077630519866943359f0
        c3 = 0.0416664853692054748535156f0
        c2 = 0.166666671633720397949219f0
        c1 = 0.5f0
        @horner x c1 c2 c3 c4 c5 c6
    end
end

@inline split_exponent(::Type{Float64}, q::Vec{W,<:Integer}) where W = _split_exponent(q, UInt(9), UInt(31), UInt(2))
@inline split_exponent(::Type{Float32}, q::Vec{W,<:Integer}) where W = _split_exponent(q, UInt(6), UInt(31), UInt(2))

@inline function integer2float(::Type{Float64}, m::Vec{W,<:Integer}) where W
    SIMDPirates.pirate_reinterpret(Vec{W,Float64}, (@pirate (m % Int64)  << significand_bits(Float64)))
end
@inline function integer2float(::Type{Float32}, m::Vec{W,<:Integer}) where W
    SIMDPirates.pirate_reinterpret(Vec{W,Float32}, (@pirate (m % Int32) << significand_bits(Float32)))
end

@inline function ldexpk(x::Vec{W,T}, q::Vec{W,<:Integer}) where {W,T<:Union{Float32,Float64}}
    @pirate begin
        bias = exponent_bias(T)
        emax = exponent_raw_max(T)
        m, q = split_exponent(T, q)
        m += bias
        m = vifelse(m < 0, 0, m)
        m = vifelse(m > emax, emax, m)
        q += bias
        u = integer2float(T, m)
        x = x * u * u * u * u
        u = integer2float(T, q)
        x * u
    end
end
@inline function pow2i(::Type{T}, q::Vec{W,I}) where {W,T<:Union{Float32,Float64},I<:Integer}
    @pirate integer2float(T, q + I(exponent_bias(T)))
end
@inline function SLEEF.ldexp2k(x::Vec{W,T}, e::Vec{W,<:Integer}) where {W,T<:Union{Float32,Float64}}
    @pirate x * pow2i(T, e >> 1) * pow2i(T, e - (e >> 1))
end
integertype(::Type{Float32}) = Int32
integertype(::Type{Float64}) = Int64

function vexp(d::Vec{W,T}) where {W,T<:Union{Float32,Float64}}
    @pirate begin
        q = round(T(MLN2E) * d)

        qi = ntuple(Val(W)) do w
            VE(unsafe_trunc(integertype(T), q[w].value))
        end

        s = muladd(q, -L2U(T), d)
        s = muladd(q, -L2L(T), s)

        u = exp_kernel(s)

        u = s * s * u + s + 1

        u = ldexp2k(u, qi)

        u = vifelse(d > max_exp(T), T(Inf), u)
        u = vifelse(d < min_exp(T), T(0), u)

        return u
    end
end

I only lightly modified functions from this library, and it already beats my library wrapping SLEEF!

I would be happy to make a PR myself.
However, this is heavily dependent on my library SIMDPirates, which is a copy of the better known and registered SIMD.jl.
So, an alternative would be to add support for that library.
However, any function returning structs wrapping one or more V = NTuple{N,Core.VecElement{T}} where sizeof(V) == 64 causes segmentation faults (no issue if they're used inside the function but not returned).

I failed to install SLEEFwrap on someone's Windows computer, because I couldn't figure out how to compile a library (SLEEF) using cmake on Windows. As a long term solution, switching to a pure-Julia solution would be great.

I would be happy to add a PR to add SIMD support, either using SIMD.jl or SIMDPirates.jl.
Otherwise, I'll fork this library and adapt it on my own.

@musm
Copy link
Owner

musm commented Dec 22, 2018

Thank you @chriselrod for the details, I don't have much time since I'm on vacation now. Please feel free to ping me if you don't hear a response soon. I'll try to take a look so we can resolve this issue satisfactorily. Thanks again.

@musm
Copy link
Owner

musm commented Dec 25, 2018

Have you tried to upstream SIMDPirates changes to SIMD ? BTW have you seen https://github.com/KristofferC/SIMDIntrinsics.jl , could that help as well?

I think we should have to incorporate your suggestions into SLEEF.

@chriselrod
Copy link
Author

chriselrod commented Dec 25, 2018

I haven't upstreamed the changes. The major changes are on working with NTuple{N,VecElement{T}}s directly without type pirating, which is a big change. I believe that leads to better code whenever functions are not inlined.
If everything is inlined, the compiler eliminates the structs. Otherwise, it generates a lot of extra move instructions.

But another change I made -- less kosher -- is that SIMDPirates.vmul is lazy, dispatching on muladd if there is an addition. Immediate evaluation can be forced. Just like muladd can be forced via calling muladd but the default is split instructions with normal scalar code; I just switched the default to fusion.

But, my code using SIMDPirates above (I set Random.seed!(1) so the random vectors were identical):

julia> vexp(vx)
(VecElement{Float64}(1.3462029292549422), VecElement{Float64}(1.4657923768059056), VecElement{Float64}(0.5501113994952206), VecElement{Float64}(0.9896091174905127), VecElement{Float64}(0.43213084511420985), VecElement{Float64}(1.364941183228046), VecElement{Float64}(9.92530765261393), VecElement{Float64}(0.10361363469405041))

julia> @benchmark vexp($vx)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4608741578183204540
  --------------
  minimum time:     6.527 ns (0.00% GC)
  median time:      6.549 ns (0.00% GC)
  mean time:        6.572 ns (0.00% GC)
  maximum time:     25.553 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

vs my PR on SLEEF using SIMD:

julia> @inline vexp(x) = SLEEF.exp(x).elts # segfaults aren't fun
vexp (generic function with 1 method)

julia> vexp(vx)
(VecElement{Float64}(1.3462029292549422), VecElement{Float64}(1.4657923768059056), VecElement{Float64}(0.5501113994952206), VecElement{Float64}(0.9896091174905127), VecElement{Float64}(0.43213084511420985), VecElement{Float64}(1.3649411832280463), VecElement{Float64}(9.92530765261393), VecElement{Float64}(0.1036136346940504))

julia> @benchmark vexp($vx)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4608741578183204540
  --------------
  minimum time:     7.595 ns (0.00% GC)
  median time:      7.628 ns (0.00% GC)
  mean time:        7.650 ns (0.00% GC)
  maximum time:     27.178 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

And exp was one of the functions that performed best relative to SLEEFwrap! So it'll take more investigating to figure out what exactly is going on, and how we can improve.

Here is the @code_native from the SIMDPirates version:

.text
movabsq	$140024019057584, %rax  # imm = 0x7F59E1EA7BB0
vmulpd	(%rax){1to8}, %zmm0, %zmm1
vrndscalepd	$4, %zmm1, %zmm2
vcvttpd2qq	%zmm2, %zmm1
movabsq	$140024019057704, %rax  # imm = 0x7F59E1EA7C28
vbroadcastsd	(%rax), %zmm3
movabsq	$140024019057720, %rax  # imm = 0x7F59E1EA7C38
vcmpnltpd	(%rax){1to8}, %zmm0, %k1
movabsq	$140024019057592, %rax  # imm = 0x7F59E1EA7BB8
vcmpnltpd	%zmm0, %zmm3, %k2
vfmadd231pd	(%rax){1to8}, %zmm2, %zmm0
movabsq	$140024019057600, %rax  # imm = 0x7F59E1EA7BC0
vfmadd231pd	(%rax){1to8}, %zmm2, %zmm0
movabsq	$140024019057608, %rax  # imm = 0x7F59E1EA7BC8
vbroadcastsd	(%rax), %zmm2
movabsq	$140024019057616, %rax  # imm = 0x7F59E1EA7BD0
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057624, %rax  # imm = 0x7F59E1EA7BD8
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057632, %rax  # imm = 0x7F59E1EA7BE0
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057640, %rax  # imm = 0x7F59E1EA7BE8
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057648, %rax  # imm = 0x7F59E1EA7BF0
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057656, %rax  # imm = 0x7F59E1EA7BF8
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057664, %rax  # imm = 0x7F59E1EA7C00
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057672, %rax  # imm = 0x7F59E1EA7C08
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057680, %rax  # imm = 0x7F59E1EA7C10
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
movabsq	$140024019057688, %rax  # imm = 0x7F59E1EA7C18
vfmadd213pd	(%rax){1to8}, %zmm0, %zmm2
vmulpd	%zmm2, %zmm0, %zmm2
movabsq	$140024019057696, %rax  # imm = 0x7F59E1EA7C20
vaddpd	(%rax){1to8}, %zmm0, %zmm3
vfmadd231pd	%zmm2, %zmm0, %zmm3
vpsrlq	$1, %zmm1, %zmm0
vpsllq	$52, %zmm0, %zmm2
vpbroadcastq	(%rax), %zmm4
vpaddq	%zmm4, %zmm2, %zmm2
vpsubq	%zmm0, %zmm1, %zmm0
vpsllq	$52, %zmm0, %zmm0
vpaddq	%zmm4, %zmm0, %zmm0
vmulpd	%zmm0, %zmm2, %zmm0
movabsq	$140024019057712, %rax  # imm = 0x7F59E1EA7C30
vbroadcastsd	(%rax), %zmm1
vmulpd	%zmm0, %zmm3, %zmm1 {%k2}
vmovapd	%zmm1, %zmm0 {%k1} {z}
retq
nopw	%cs:(%rax,%rax)

and from SIMD:

.text
vmovupd	(%rsi), %zmm1
movabsq	$139644718079952, %rax  # imm = 0x7F0191D0DFD0
vmulpd	(%rax){1to8}, %zmm1, %zmm0
vrndscalepd	$4, %zmm0, %zmm2
vcvttpd2qq	%zmm2, %zmm0
movabsq	$139644718080072, %rax  # imm = 0x7F0191D0E048
vbroadcastsd	(%rax), %zmm3
movabsq	$139644718080088, %rax  # imm = 0x7F0191D0E058
vcmpnltpd	(%rax){1to8}, %zmm1, %k1
movabsq	$139644718079960, %rax  # imm = 0x7F0191D0DFD8
vcmpnltpd	%zmm1, %zmm3, %k2
vfmadd231pd	(%rax){1to8}, %zmm2, %zmm1
movabsq	$139644718079968, %rax  # imm = 0x7F0191D0DFE0
vfmadd231pd	(%rax){1to8}, %zmm2, %zmm1
movabsq	$139644718079976, %rax  # imm = 0x7F0191D0DFE8
vbroadcastsd	(%rax), %zmm2
movabsq	$139644718079984, %rax  # imm = 0x7F0191D0DFF0
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718079992, %rax  # imm = 0x7F0191D0DFF8
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080000, %rax  # imm = 0x7F0191D0E000
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080008, %rax  # imm = 0x7F0191D0E008
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080016, %rax  # imm = 0x7F0191D0E010
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080024, %rax  # imm = 0x7F0191D0E018
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080032, %rax  # imm = 0x7F0191D0E020
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080040, %rax  # imm = 0x7F0191D0E028
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080048, %rax  # imm = 0x7F0191D0E030
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
movabsq	$139644718080056, %rax  # imm = 0x7F0191D0E038
vfmadd213pd	(%rax){1to8}, %zmm1, %zmm2
vmulpd	%zmm1, %zmm1, %zmm3
vmulpd	%zmm2, %zmm3, %zmm2
vaddpd	%zmm2, %zmm1, %zmm1
movabsq	$139644718080064, %rax  # imm = 0x7F0191D0E040
vaddpd	(%rax){1to8}, %zmm1, %zmm1
vpsrlq	$1, %zmm0, %zmm2
vpsllq	$52, %zmm2, %zmm3
vpbroadcastq	(%rax), %zmm4
vpaddq	%zmm4, %zmm3, %zmm3
vmulpd	%zmm3, %zmm1, %zmm1
vpsubq	%zmm2, %zmm0, %zmm0
vpsllq	$52, %zmm0, %zmm0
vpaddq	%zmm4, %zmm0, %zmm0
movabsq	$139644718080080, %rax  # imm = 0x7F0191D0E050
vbroadcastsd	(%rax), %zmm2
vmulpd	%zmm0, %zmm1, %zmm2 {%k2}
vmovapd	%zmm2, %zmm0 {%k1} {z}
vmovapd	%zmm0, (%rdi)
movq	%rdi, %rax
vzeroupper
retq
nopl	(%rax)

Here I do see that SIMDPirates fused one vmul and vadd into a vfma.
It also has 1 less each of vmovupd, vmovapd, and movq. SIMDPirates produces a single vmovapd at the end, while SIMD produces 2 of them, and one each of vmovupd and movq.

However, when I add @inline to SLEEF.exp:

julia> @noinline vexp(x) = SLEEF.exp(x).elts # segfaults still aren't fun
vexp (generic function with 1 method)

julia> @benchmark vexp($vx)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4608741578183204540
  --------------
  minimum time:     6.606 ns (0.00% GC)
  median time:      6.637 ns (0.00% GC)
  mean time:        6.656 ns (0.00% GC)
  maximum time:     23.897 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Which is much closer to the SIMDPirates version. The @code_native also shows the movq and second vmovapd are now gone. But because we passed in a Vec instead of an NTuple{N,VE{T}}, we still have the vmovupd at the start of the function.

I wonder how much of the time difference is just function call overhead. I added @inline vs @noinline on the segfault-preventing-wrapper to the SLEEF.exp calls to try and make sure there was only a single non-inlined call per benchmark.

When it comes to the functions like log, the trig stuff, etc, I'll have to take a look at the other functions to see what's going on. Brief glances made it look like they vectorized completely.
I wonder how much time is taken on all the vifelse calls. Obviously something went wrong with them.

EDIT:
Something cool is that we can pick wider vectors to reuse some of the movabsqs, etc:

julia> vx16 = ntuple(Val(16)) do x VE(randn()) end 

julia> @benchmark vexp($vx16) # SIMDPirates version
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4610329866701751455
  --------------
  minimum time:     10.090 ns (0.00% GC)
  median time:      10.206 ns (0.00% GC)
  mean time:        10.245 ns (0.00% GC)
  maximum time:     39.443 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

Less than 0.65 nanoseconds per double precision exp calculation on a single core -- that's pretty cool!
(On a computer with avx512).

EDIT:
I'll have to look at SIMDIntrinsics.jl.
It doesn't support masked instructions yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants