Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for SIMD.jl; WIP #15

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Added support for SIMD.jl; WIP #15

wants to merge 7 commits into from

Conversation

chriselrod
Copy link

@chriselrod chriselrod commented Dec 25, 2018

  • Add tests for Vec{N,T} where T <: FloatTypes.
  • Make sure all of these tests also pass.
  • Investigate performance regressions vs the SLEEF C library.

Overview of this PR:
The C SLEEF (SIMD Library for Evaluating Elementary Functions) library provides vectorized elementary functions. Therefore, I thought it makes sense to let SLEEF.jl support the SIMD.jl's Vec{N,T} vector type.

This PR provides preliminary support.

using SIMD, SLEEF, SLEEFwrap, BenchmarkTools, Random
@inline extract(x) = x.elts # 64-byte vectors segfault when returned while wrapped in a struct
sv8 = Vec{8,Float32}(ntuple(Val(8)) do x Core.VecElement(randexp(Float32)) end)
dv4 = Vec{4,Float64}(ntuple(Val(4)) do x Core.VecElement(randexp(Float64)) end)
sv16 = Vec{16,Float32}(ntuple(Val(16)) do x Core.VecElement(randexp(Float32)) end)
dv8 = Vec{8,Float64}(ntuple(Val(8)) do x Core.VecElement(randexp(Float64)) end)
function bench(jl, c, x)
    display(@benchmark extract($jl($x)))
    display(@benchmark $c(extract($x)))
end

Testing a bunch of functions:
exp:

julia> bench(SLEEF.exp, SLEEFwrap.exp, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.545 ns (0.00% GC)
  median time:      5.686 ns (0.00% GC)
  mean time:        5.816 ns (0.00% GC)
  maximum time:     23.974 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.689 ns (0.00% GC)
  median time:      4.722 ns (0.00% GC)
  mean time:        4.740 ns (0.00% GC)
  maximum time:     23.272 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bench(SLEEF.exp, SLEEFwrap.exp, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.408 ns (0.00% GC)
  median time:      7.449 ns (0.00% GC)
  mean time:        7.467 ns (0.00% GC)
  maximum time:     24.513 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.615 ns (0.00% GC)
  median time:      6.722 ns (0.00% GC)
  mean time:        6.737 ns (0.00% GC)
  maximum time:     20.488 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bench(SLEEF.exp, SLEEFwrap.exp, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4677168108795565027
  --------------
  minimum time:     5.691 ns (0.00% GC)
  median time:      5.731 ns (0.00% GC)
  mean time:        5.779 ns (0.00% GC)
  maximum time:     22.034 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4677168108795565027
  --------------
  minimum time:     5.256 ns (0.00% GC)
  median time:      5.287 ns (0.00% GC)
  mean time:        5.297 ns (0.00% GC)
  maximum time:     14.432 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bench(SLEEF.exp, SLEEFwrap.exp, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4613273474792594525
  --------------
  minimum time:     7.284 ns (0.00% GC)
  median time:      7.321 ns (0.00% GC)
  mean time:        7.336 ns (0.00% GC)
  maximum time:     25.833 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4613273474792594525
  --------------
  minimum time:     11.036 ns (0.00% GC)
  median time:      11.553 ns (0.00% GC)
  mean time:        11.370 ns (0.00% GC)
  maximum time:     38.117 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

log

julia> bench(SLEEF.log, SLEEFwrap.log, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.225 ns (0.00% GC)
  median time:      15.276 ns (0.00% GC)
  mean time:        15.310 ns (0.00% GC)
  maximum time:     31.264 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.967 ns (0.00% GC)
  median time:      10.042 ns (0.00% GC)
  mean time:        10.065 ns (0.00% GC)
  maximum time:     32.280 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.log, SLEEFwrap.log, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.762 ns (0.00% GC)
  median time:      16.993 ns (0.00% GC)
  mean time:        16.964 ns (0.00% GC)
  maximum time:     30.792 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.829 ns (0.00% GC)
  median time:      12.873 ns (0.00% GC)
  mean time:        12.897 ns (0.00% GC)
  maximum time:     27.613 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.log, SLEEFwrap.log, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4552958378306737260
  --------------
  minimum time:     16.331 ns (0.00% GC)
  median time:      16.536 ns (0.00% GC)
  mean time:        16.543 ns (0.00% GC)
  maximum time:     42.043 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4552958378306737260
  --------------
  minimum time:     8.060 ns (0.00% GC)
  median time:      8.115 ns (0.00% GC)
  mean time:        8.130 ns (0.00% GC)
  maximum time:     31.205 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.log, SLEEFwrap.log, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4651049139759164439
  --------------
  minimum time:     18.395 ns (0.00% GC)
  median time:      18.477 ns (0.00% GC)
  mean time:        18.613 ns (0.00% GC)
  maximum time:     45.013 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4651049139759164439
  --------------
  minimum time:     11.021 ns (0.00% GC)
  median time:      11.084 ns (0.00% GC)
  mean time:        11.114 ns (0.00% GC)
  maximum time:     35.427 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

sin

julia> bench(SLEEF.sin, SLEEFwrap.sin, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.354 ns (0.00% GC)
  median time:      19.471 ns (0.00% GC)
  mean time:        19.612 ns (0.00% GC)
  maximum time:     37.226 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.906 ns (0.00% GC)
  median time:      9.953 ns (0.00% GC)
  mean time:        9.972 ns (0.00% GC)
  maximum time:     21.988 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.sin, SLEEFwrap.sin, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.163 ns (0.00% GC)
  median time:      28.265 ns (0.00% GC)
  mean time:        28.329 ns (0.00% GC)
  maximum time:     52.633 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.484 ns (0.00% GC)
  median time:      10.541 ns (0.00% GC)
  mean time:        10.568 ns (0.00% GC)
  maximum time:     27.162 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.sin, SLEEFwrap.sin, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4569599948461514222
  --------------
  minimum time:     20.364 ns (0.00% GC)
  median time:      20.458 ns (0.00% GC)
  mean time:        20.502 ns (0.00% GC)
  maximum time:     47.938 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4569599948461514222
  --------------
  minimum time:     10.426 ns (0.00% GC)
  median time:      10.565 ns (0.00% GC)
  mean time:        10.587 ns (0.00% GC)
  maximum time:     33.371 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> bench(SLEEF.sin, SLEEFwrap.sin, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4605730538145129761
  --------------
  minimum time:     28.796 ns (0.00% GC)
  median time:      28.919 ns (0.00% GC)
  mean time:        29.123 ns (0.00% GC)
  maximum time:     55.898 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4605730538145129761
  --------------
  minimum time:     11.913 ns (0.00% GC)
  median time:      12.026 ns (0.00% GC)
  mean time:        12.050 ns (0.00% GC)
  maximum time:     33.233 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

tan

julia> bench(SLEEF.tan, SLEEFwrap.tan, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.797 ns (0.00% GC)
  median time:      36.895 ns (0.00% GC)
  mean time:        36.988 ns (0.00% GC)
  maximum time:     58.675 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.273 ns (0.00% GC)
  median time:      16.346 ns (0.00% GC)
  mean time:        16.381 ns (0.00% GC)
  maximum time:     34.868 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> bench(SLEEF.tan, SLEEFwrap.tan, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.512 ns (0.00% GC)
  median time:      51.640 ns (0.00% GC)
  mean time:        52.010 ns (0.00% GC)
  maximum time:     73.956 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.053 ns (0.00% GC)
  median time:      14.161 ns (0.00% GC)
  mean time:        14.179 ns (0.00% GC)
  maximum time:     31.734 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> bench(SLEEF.tan, SLEEFwrap.tan, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4606600161933539213
  --------------
  minimum time:     38.064 ns (0.00% GC)
  median time:      38.202 ns (0.00% GC)
  mean time:        38.285 ns (0.00% GC)
  maximum time:     62.710 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  -4606600161933539213
  --------------
  minimum time:     18.630 ns (0.00% GC)
  median time:      18.712 ns (0.00% GC)
  mean time:        18.756 ns (0.00% GC)
  maximum time:     44.121 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> bench(SLEEF.tan, SLEEFwrap.tan, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4609617611958208877
  --------------
  minimum time:     55.713 ns (0.00% GC)
  median time:      55.881 ns (0.00% GC)
  mean time:        56.035 ns (0.00% GC)
  maximum time:     78.817 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     984
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4609617611958208877
  --------------
  minimum time:     17.800 ns (0.00% GC)
  median time:      17.898 ns (0.00% GC)
  mean time:        18.053 ns (0.00% GC)
  maximum time:     42.916 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

cbrt

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, sv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.845 ns (0.00% GC)
  median time:      32.018 ns (0.00% GC)
  mean time:        32.143 ns (0.00% GC)
  maximum time:     54.500 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.222 ns (0.00% GC)
  median time:      26.324 ns (0.00% GC)
  mean time:        26.364 ns (0.00% GC)
  maximum time:     43.927 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, dv4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.175 ns (0.00% GC)
  median time:      36.303 ns (0.00% GC)
  mean time:        36.564 ns (0.00% GC)
  maximum time:     57.701 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.349 ns (0.00% GC)
  median time:      29.205 ns (0.00% GC)
  mean time:        29.250 ns (0.00% GC)
  maximum time:     46.513 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, sv16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4584898609104978811
  --------------
  minimum time:     34.463 ns (0.00% GC)
  median time:      34.570 ns (0.00% GC)
  mean time:        34.634 ns (0.00% GC)
  maximum time:     58.556 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4584898609104978811
  --------------
  minimum time:     23.273 ns (0.00% GC)
  median time:      25.731 ns (0.00% GC)
  mean time:        25.492 ns (0.00% GC)
  maximum time:     50.618 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> bench(SLEEF.cbrt, SLEEFwrap.cbrt, dv8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4607167657796590655
  --------------
  minimum time:     42.291 ns (0.00% GC)
  median time:      42.392 ns (0.00% GC)
  mean time:        42.476 ns (0.00% GC)
  maximum time:     65.205 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4607167657796590655
  --------------
  minimum time:     26.524 ns (0.00% GC)
  median time:      26.741 ns (0.00% GC)
  mean time:        26.800 ns (0.00% GC)
  maximum time:     46.431 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

Performance is currently often 2 or 3x worse than SLEEFwrap.jl (which wraps the C library).

@coveralls
Copy link

Coverage Status

Coverage increased (+36.7%) to 65.182% when pulling 8b83a5a on chriselrod:master into b089af5 on musm:master.

@coveralls
Copy link

coveralls commented Dec 25, 2018

Coverage Status

Coverage increased (+36.6%) to 65.074% when pulling e57ed3c on chriselrod:master into b089af5 on musm:master.

src/utils.jl Show resolved Hide resolved
@musm
Copy link
Owner

musm commented Dec 25, 2018

awesome progress. Can you please remove the Manifest file

src/log.jl Outdated
(d < 0 || isnan(d)) && (x = T(NaN))
d == 0 && (x = -T(Inf))

x = muladd(x, t, T(MLN2) * e)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you can safely replace the previous code with a muladd, I recall this replacement actually makes it less accurate missing the ulp requirements.

Copy link
Author

@chriselrod chriselrod Dec 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through and reverted all of the muladds I've added.
I had avoided adding any in the Doubles code, figuring it was necessary there. But now I'll avoid touching anything unless someone confirms (or I learn enough) to say it's okay.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's try to keep the PR as minimal, and we can open up further PRs if we see that the accuracy is not modified, although I'm pretty sure it is, since I recall trying this.

src/priv.jl Outdated Show resolved Hide resolved
src/priv.jl Outdated Show resolved Hide resolved
invy = 1 / y.hi
zhi = x.hi * invy
Double(zhi, (fma(-zhi, y.hi, x.hi) + fma(-zhi, y.lo, x.lo)) * invy)
end

@inline function ddiv(x::T, y::T) where {T<:IEEEFloat}
@inline function ddiv(x::vIEEEFloat, y::vIEEEFloat)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm guessing these changes in the type signature are required to operate on the vector version?
I'm sure you are aware that the changes is not equivalent to the version on master.
I just want to confirm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version forced x and y to be of the same type. The current version does not.
The reason for that change is that often one of x or y will be a scalar, and the other a Vec.
If you would like type checking to enforce that you don't mix Float32 and Float64s, we could:

function foo(x::vIEEEFloat, y::vIEEEFloat)
    @assert eltype(x) == eltype(y)
    ...

or

function foo(x::T1, y::T2) where {T <: IEEEFloat, T1 <: Union{T,Vec{<:Any,T}}, T2 <: Union{T,Vec<:Any,T}}}
    ...

I haven't tested the second option, but I think something like it should work.

src/exp.jl Outdated
@@ -26,48 +26,49 @@ const min_exp2(::Type{Float32}) = -150f0
c3 = 0.5550410866482046596e-1
c2 = 0.2402265069591012214
c1 = 0.6931471805599452862
return @horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
@horner x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please leave in the explicit return, thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explicit returns.

src/hyp.jl Outdated
@@ -48,16 +50,17 @@ over_th(::Type{Float32}) = 18.714973875f0

Compute hyperbolic tangent of `x`.
"""
function tanh(x::T) where {T<:Union{Float32,Float64}}
function tanh(x::V) where V <: FloatType
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically I'd prefer we left this as where {V <: FloatType}

https://github.com/jrevels/YASGuide

Type variable bindings should always be enclosed within {} brackets when using where syntax, e.g. Vector{Vector{T} where T} is good, Vector{Vector{T}} where {T} is good, Vector{Vector{T}} where T is bad.

The return keyword should always be omitted from return statements within short-form method definitions (f(...) = ...). The return keyword should never be omitted from return statements within any other context (function ... end, macro ... end, etc.).
If a function does not have a clearly appropriate return value, then explicitly return nothing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Added brackets.

src/SLEEF.jl Outdated

EquivalentInteger(::Type{Float64}) = Int == Int32 ? Int32 : Int64
EquivalentInteger(::Type{Float32}) = Int32
EquivalentInteger(::Type{Vec{N,Float64}}) where N = Int == Int32 ? Vec{N,Int32} : Vec{N,Int64}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

braces around where clause. Otherwise this is really hard to comprehend :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also converted these functions to long form to make them even easier to read.

src/SLEEF.jl Outdated
const IntegerType32 = Union{Int32,Vec{<:Any,Int32}}
const IntegerType = Union{IntegerType64,IntegerType32}

EquivalentInteger(::Type{Float64}) = Int == Int32 ? Int32 : Int64
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should all just return Int

can we also rename this to fpinttype, the function is quite similar to https://github.com/JuliaLang/julia/blob/master/base/atomics.jl#L331

Except always returns the machine word size.

If you look at the previous code, we always use Int, even for code that operates on Float32, because if you use Int32 for Float32 inputs of a 64-bit machine, this can crush the range of the calculations for the trig functions (I'm pretty sure, this is true, if I recall correctly).

Copy link
Author

@chriselrod chriselrod Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name from EquivalentInteger to fpinttype.

If Int is Int64 and you're doing vector operations on
Float32, the resulting integer vectors will be twice the bytes, taking up twice the register space.

This can have a significant impact on runtime. In the case of exp, it can add 1-2 nanoseconds.

With 64 bit integers:

.text
vmovups	(%rsi), %zmm2
movabsq	$139690975831012, %rax  # imm = 0x7F0C56FE23E4
vmulps	(%rax){1to16}, %zmm2, %zmm0
vrndscaleps	$4, %zmm0, %zmm3
vextractf64x4	$1, %zmm3, %ymm0
vcvttps2qq	%ymm0, %zmm0
vcvttps2qq	%ymm3, %zmm1
movabsq	$139690975831052, %rax  # imm = 0x7F0C56FE240C
vbroadcastss	(%rax), %zmm4
movabsq	$139690975831060, %rax  # imm = 0x7F0C56FE2414
vcmpnltps	(%rax){1to16}, %zmm2, %k1
movabsq	$139690975831016, %rax  # imm = 0x7F0C56FE23E8
vcmpnltps	%zmm2, %zmm4, %k2
vfmadd231ps	(%rax){1to16}, %zmm3, %zmm2
movabsq	$139690975831020, %rax  # imm = 0x7F0C56FE23EC
vfmadd231ps	(%rax){1to16}, %zmm3, %zmm2
movabsq	$139690975831024, %rax  # imm = 0x7F0C56FE23F0
vbroadcastss	(%rax), %zmm3
movabsq	$139690975831028, %rax  # imm = 0x7F0C56FE23F4
vfmadd213ps	(%rax){1to16}, %zmm2, %zmm3
movabsq	$139690975831032, %rax  # imm = 0x7F0C56FE23F8
vfmadd213ps	(%rax){1to16}, %zmm2, %zmm3
movabsq	$139690975831036, %rax  # imm = 0x7F0C56FE23FC
vfmadd213ps	(%rax){1to16}, %zmm2, %zmm3
movabsq	$139690975831040, %rax  # imm = 0x7F0C56FE2400
vfmadd213ps	(%rax){1to16}, %zmm2, %zmm3
movabsq	$139690975831044, %rax  # imm = 0x7F0C56FE2404
vfmadd213ps	(%rax){1to16}, %zmm2, %zmm3
vmulps	%zmm2, %zmm2, %zmm4
vmulps	%zmm3, %zmm4, %zmm3
vaddps	%zmm3, %zmm2, %zmm2
movabsq	$139690975831048, %rax  # imm = 0x7F0C56FE2408
vaddps	(%rax){1to16}, %zmm2, %zmm2
vpsraq	$1, %zmm1, %zmm3
vpsraq	$1, %zmm0, %zmm4
movabsq	$139690975831064, %rax  # imm = 0x7F0C56FE2418
vpbroadcastq	(%rax), %zmm5
vpaddq	%zmm5, %zmm4, %zmm6
vpaddq	%zmm5, %zmm3, %zmm7
movabsq	$139690975831104, %rax  # imm = 0x7F0C56FE2440
vmovdqa32	(%rax), %zmm8
vpermt2d	%zmm6, %zmm8, %zmm7
vpslld	$23, %zmm7, %zmm6
vmulps	%zmm6, %zmm2, %zmm2
vpsubq	%zmm3, %zmm1, %zmm1
vpsubq	%zmm4, %zmm0, %zmm0
vpaddq	%zmm5, %zmm0, %zmm0
vpaddq	%zmm5, %zmm1, %zmm1
vpermt2d	%zmm0, %zmm8, %zmm1
vpslld	$23, %zmm1, %zmm0
movabsq	$139690975831056, %rax  # imm = 0x7F0C56FE2410
vbroadcastss	(%rax), %zmm1
vmulps	%zmm0, %zmm2, %zmm1 {%k2}
vmovaps	%zmm1, %zmm0 {%k1} {z}
vmovaps	%zmm0, (%rdi)
movq	%rdi, %rax
vzeroupper
retq
nopw	%cs:(%rax,%rax)

This is 60 lines. With 32 bit integers, we have 49 lines:

.text
vmovups	(%rsi), %zmm0
movabsq	$139690975841852, %rax  # imm = 0x7F0C56FE4E3C
vmulps	(%rax){1to16}, %zmm0, %zmm1
vrndscaleps	$4, %zmm1, %zmm1
vcvttps2dq	%zmm1, %zmm2
movabsq	$139690975841892, %rax  # imm = 0x7F0C56FE4E64
vbroadcastss	(%rax), %zmm3
movabsq	$139690975841900, %rax  # imm = 0x7F0C56FE4E6C
vcmpnltps	(%rax){1to16}, %zmm0, %k1
movabsq	$139690975841856, %rax  # imm = 0x7F0C56FE4E40
vcmpnltps	%zmm0, %zmm3, %k2
vfmadd231ps	(%rax){1to16}, %zmm1, %zmm0
movabsq	$139690975841860, %rax  # imm = 0x7F0C56FE4E44
vfmadd231ps	(%rax){1to16}, %zmm1, %zmm0
movabsq	$139690975841864, %rax  # imm = 0x7F0C56FE4E48
vbroadcastss	(%rax), %zmm1
movabsq	$139690975841868, %rax  # imm = 0x7F0C56FE4E4C
vfmadd213ps	(%rax){1to16}, %zmm0, %zmm1
movabsq	$139690975841872, %rax  # imm = 0x7F0C56FE4E50
vfmadd213ps	(%rax){1to16}, %zmm0, %zmm1
movabsq	$139690975841876, %rax  # imm = 0x7F0C56FE4E54
vfmadd213ps	(%rax){1to16}, %zmm0, %zmm1
movabsq	$139690975841880, %rax  # imm = 0x7F0C56FE4E58
vfmadd213ps	(%rax){1to16}, %zmm0, %zmm1
movabsq	$139690975841884, %rax  # imm = 0x7F0C56FE4E5C
vfmadd213ps	(%rax){1to16}, %zmm0, %zmm1
vmulps	%zmm0, %zmm0, %zmm3
vmulps	%zmm1, %zmm3, %zmm1
vaddps	%zmm1, %zmm0, %zmm0
movabsq	$139690975841888, %rax  # imm = 0x7F0C56FE4E60
vaddps	(%rax){1to16}, %zmm0, %zmm0
vpsrld	$1, %zmm2, %zmm1
vpslld	$23, %zmm1, %zmm3
vpbroadcastd	(%rax), %zmm4
vpaddd	%zmm4, %zmm3, %zmm3
vmulps	%zmm3, %zmm0, %zmm0
vpsubd	%zmm1, %zmm2, %zmm1
vpslld	$23, %zmm1, %zmm1
vpaddd	%zmm4, %zmm1, %zmm1
movabsq	$139690975841896, %rax  # imm = 0x7F0C56FE4E68
vbroadcastss	(%rax), %zmm2
vmulps	%zmm1, %zmm0, %zmm2 {%k2}
vmovaps	%zmm2, %zmm0 {%k1} {z}
vmovaps	%zmm0, (%rdi)
movq	%rdi, %rax
vzeroupper
retq
nopl	(%rax)

Here are the parts related to this, in the 64 bit int version:

vrndscaleps	$4, %zmm0, %zmm3
vextractf64x4	$1, %zmm3, %ymm0
vcvttps2qq	%ymm0, %zmm0
vcvttps2qq	%ymm3, %zmm1
...
vpsraq	$1, %zmm1, %zmm3
vpsraq	$1, %zmm0, %zmm4
movabsq	$139690975831064, %rax  # imm = 0x7F0C56FE2418
vpbroadcastq	(%rax), %zmm5
vpaddq	%zmm5, %zmm4, %zmm6
vpaddq	%zmm5, %zmm3, %zmm7
movabsq	$139690975831104, %rax  # imm = 0x7F0C56FE2440
vmovdqa32	(%rax), %zmm8
vpermt2d	%zmm6, %zmm8, %zmm7
vpslld	$23, %zmm7, %zmm6
vmulps	%zmm6, %zmm2, %zmm2
vpsubq	%zmm3, %zmm1, %zmm1
vpsubq	%zmm4, %zmm0, %zmm0
vpaddq	%zmm5, %zmm0, %zmm0
vpaddq	%zmm5, %zmm1, %zmm1
vpermt2d	%zmm0, %zmm8, %zmm1
vpslld	$23, %zmm1, %zmm0

32 bit int:

vrndscaleps	$4, %zmm1, %zmm1
vcvttps2dq	%zmm1, %zmm2
...
vpsrld	$1, %zmm2, %zmm1
vpslld	$23, %zmm1, %zmm3
vpbroadcastd	(%rax), %zmm4
vpaddd	%zmm4, %zmm3, %zmm3
vmulps	%zmm3, %zmm0, %zmm0
vpsubd	%zmm1, %zmm2, %zmm1
vpslld	$23, %zmm1, %zmm1

Allocating and operating on two registers instead of 1.

However, you're are correct:

julia> using SLEEF, SIMD

julia> x = Vec{16,Float32}((1f3,1f5,1f7,1f9,1f11,1f13,1f15,1f17,1f19,1f21,1f23,1f25,1f27,1f29,1f31,1f33))
<16 x Float32>[1000.0, 100000.0, 1.0e7, 1.0e9, 1.0e11, 1.0e13, 1.0e15, 1.0e17, 1.0e19, 1.0e21, 1.0e23, 1.0e25, 1.0e27, 1.0e29, 1.0e31, 1.0e33]

julia> SLEEF.sin(x)
<16 x Float32>[0.82687956, 0.0357488, 0.13669702, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]

julia> using SLEEFwrap

julia> Vec{16,Float32}(SLEEFwrap.sin(x.elts)) # lets print pretty
<16 x Float32>[0.82687956, 0.0357488, 0.42054778, 0.5458434, 0.99810874, 0.96887577, 0.9944343, -0.5699717, 0.5780979, 0.7704365, -0.925232, -0.40585858, -0.97865087, 0.8592228, -0.039693512, 0.33392745]

julia> sin(x)
<16 x Float32>[0.82687956, 0.0357488, 0.42054778, 0.5458434, 0.99810874, 0.96887577, 0.9944343, -0.5699717, 0.5780979, 0.7704365, -0.925232, -0.40585858, -0.97865087, 0.8592228, -0.039693512, 0.33392745]

julia> using BenchmarkTools

julia> @benchmark SLEEFwrap.sin($x.elts)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4400699998796492385
  --------------
  minimum time:     77.960 ns (0.00% GC)
  median time:      79.492 ns (0.00% GC)
  mean time:        79.636 ns (0.00% GC)
  maximum time:     125.621 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     968

julia> @benchmark SLEEF.sin($x).elts
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4400699998796492385
  --------------
  minimum time:     19.888 ns (0.00% GC)
  median time:      19.992 ns (0.00% GC)
  mean time:        20.038 ns (0.00% GC)
  maximum time:     49.129 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark sin($x).elts
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4400699998796492385
  --------------
  minimum time:     198.555 ns (0.00% GC)
  median time:      199.522 ns (0.00% GC)
  mean time:        200.183 ns (0.00% GC)
  maximum time:     254.833 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     600

SLEEF (the C library)'s sin still has solid range on these trig functions, but slows down dramatically compared to values close to 0:

julia> using SIMD: VE

julia> vx16 = Vec{16,Float32}(ntuple(Val(16)) do x VE(randn(Float32)) end);
julia> @benchmark SLEEF.sin($vx16).elts
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4570671147679511021
  --------------
  minimum time:     19.904 ns (0.00% GC)
  median time:      20.008 ns (0.00% GC)
  mean time:        20.170 ns (0.00% GC)
  maximum time:     48.694 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark SLEEFwrap.sin($vx16.elts)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4570671147679511021
  --------------
  minimum time:     10.420 ns (0.00% GC)
  median time:      10.553 ns (0.00% GC)
  mean time:        10.575 ns (0.00% GC)
  maximum time:     33.753 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark sin($vx16).elts
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  4570671147679511021
  --------------
  minimum time:     60.156 ns (0.00% GC)
  median time:      60.584 ns (0.00% GC)
  mean time:        60.808 ns (0.00% GC)
  maximum time:     83.603 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     982

The C library behaves much better in both cases (getting the correct answer much more quickly; for extreme values, that is by getting the correct answers at all).

Perhaps we use 32 bit integers for functions like exp, and 64 bit integers for the periodic trig functions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced all instances of fpinttype(T) with Int in the trig file (locally), but:

julia> SLEEF.sin(x)
<16 x Float32>[0.82687956, 0.0357488, 0.13669702, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]

is still the answer I get.
That is because of line 213:

u = vifelse((!isinf(t)) & (isnegzero(t) | (abs(t) > TRIG_MAX(T))), T(-0.0), u)

and SLEEF.TRIG_MAX(Float32) returning 1.0f7.
However, that still does not explain:

julia> (SLEEF.sin(1e7),SLEEF.sin(1f7),SLEEF.sin_fast(1f7),sin(1e7))
(0.4205477931907825, 0.13669702f0, 0.4205478f0, 0.4205477931907825)

why SLEEF.sin is getting the wrong answer here.

Checking out the latest master (rather than this PR)...

julia> (SLEEF.sin(1e7),SLEEF.sin(1f7),SLEEF.sin_fast(1f7),sin(1e7))
(0.4205477931907825, 0.13669702f0, 0.4205478f0, 0.4205477931907825)

so it is an existing problem.

If you're using TRIG_MAX(::Type{Float32}) = 1f7, then 32 bit integers should be okay.

Copy link
Owner

@musm musm Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, but the trig function is only guaranteed over

Notes
The trigonometric functions are tested to return values with specified accuracy when the argument is within the following range:

Double (Float64) precision trigonometric functions : [-1e+14, 1e+14]
Single (Float32) precision trigonometric functions : [-39000, 39000]

not 1f7

If I recall correctly it's a lot faster for non-vectorized code to always use machine size-int, even if operating on 32 bit floats.

However, according to your analysis this is not true for vector versions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, 32 and 64 bit operations should be the same fast when not vectorized -- as long as you don't need to promote from one to the other. (Note that pointers on 64 bit machines are 64 bits.)

The difference is when they are vectorized. Half the bits means you can fit twice as many into a register, and operate on twice as many per operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants