Support Core.LLVMPtr? #80

chengchingwen · 2021-01-14T15:16:13Z

There is a Core.LLVMPtr type which is used by those gpu related packages. Currently SIMD.jl restrict the type to be Ptr so Core.LLVMPtr can't not use it directly, but it seems workable by reinterpret the Core.LLVMPtr to Ptr. Is it possible to update the signature to allow Core.LLVMPtr?

The text was updated successfully, but these errors were encountered:

eschnett · 2021-01-15T15:32:27Z

At first glance this looks straightforward.

Which of SIMD's functions would you want to call this way? All function which take a Ptr as argument? Do you have an example for or pointer to calling a SIMD function this way? I am asking because I have not used Core.LLVMPtr myself, and its documentation is sparse.
How would Core.LLVMPtr be used in an LLVM intrinsic? Do you have an example?
SIMD switched to a more modern internal representation for Julia 1.6. Which Julia version are you using? Would supporting this for Julia 1.6 be good enough?

KristofferC · 2021-01-15T16:06:57Z

What's the use case? Do you have some example code?

chengchingwen · 2021-01-15T17:20:03Z

I'm trying the vectorized memory access with CUDA.jl. Following some discussion on CUDA's issues list, it's said to be doable with SIMD.jl's vloada and vstorea. However, the newer version of CUDA.jl switch the underlying pointer type for CuDeviceArray to Core.LLVMPtr. Here is the example code I'm talking about:

function device_copy_vector4_kernel(din::CuDeviceArray{T}, dout::CuDeviceArray{T}, N) where T
  idx = threadIdx().x + (blockIdx().x - 1) * blockDim().x
  s = blockDim().x * gridDim().x
  i = idx
  @inbounds while i <= fld(N, 4)
    dinp  = reinterpret(Ptr{T}, pointer(din,  4i - 3)) # `pointer` return a `Core.LLVMPtr`
    doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
    v = vloada(Vec{4, T}, dinp) # v = din[lane+i] 
    vstorea(v, doutp) # dout[lane+i] = v 
    i += s
  end

  rdiff = 4i - N # 1 <=  rdiff <= 3 
  @inbounds if rdiff == 1 # 3 reminder 
    dinp  = reinterpret(Ptr{T}, pointer(din,  4i - 3))
    doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
    v = vloada(Vec{2, T}, dinp)
    vstorea(v, doutp)
    dout[N] = din[N]
  elseif rdiff == 2
    dinp  = reinterpret(Ptr{T}, pointer(din,  4i - 3))
    doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
    v = vloada(Vec{2, T}, dinp)
    vstorea(v, doutp)
  elseif rdiff == 3
    dout[N] = din[N]
  end
  return
end

How would Core.LLVMPtr be used in an LLVM intrinsic? Do you have an example?

I'm not familiar with Core.LLVMPtr. I hope the above example could provide enough information for your question.

Which Julia version are you using? Would supporting this for Julia 1.6 be good enough?

I'm currently using 1.5.3 but I think supporting 1.6 is good enough

eschnett · 2021-02-11T16:12:07Z

@chengchingwen I've looked at the example, and I think I understand now what you are trying to do. I still have a question, though:

Is the example code above working for you, but just overly complex (requiring the reinterpret), or is it failing in some way?

chengchingwen · 2021-02-11T17:28:22Z

It seems working for me. The above example works fine and currently I haven't found any test case that cause failure (But I know nothing about the Core.LLVMPtr so it's possible I didn't test it properly).

chriselrod · 2021-06-27T19:40:47Z

Another reason to support this is gather/scatter of 32-bit types on 64-bit builds of Julia.
If you use Ptr, it'll break up the gather/scatter into two half-width operations.
With LLVMPtr, it'll instead use a single full-width operation.

For example:

julia> using VectorizationBase, SIMD

julia> A = rand(Float32,16,16);

julia> getindex.(Ref(A), 1:16, 1:16)'
1×16 adjoint(::Vector{Float32}) with eltype Float32:
 0.183864  0.903298  0.978251  0.174473  0.697349  0.431003  0.685901  0.37883  …  0.846123  0.141322  0.730092  0.483814  0.459655  0.15992  0.618906

julia> VectorizationBase.vload(stridedpointer(A),  (VectorizationBase.Vec(ntuple(i -> 17i - 17, Val(16))...),))
VectorizationBase.Vec{16, Float32}<0.18386412f0, 0.90329814f0, 0.97825146f0, 0.17447329f0, 0.6973493f0, 0.43100262f0, 0.6859009f0, 0.3788302f0, 0.28956485f0, 0.84612334f0, 0.1413225f0, 0.7300916f0, 0.48381412f0, 0.45965457f0, 0.15991974f0, 0.61890626f0>

julia> SIMD.vgather(vec(A),  SIMD.Vec(ntuple(i -> 17i - 16, Val(16))...))
<16 x Float32>[0.18386412, 0.90329814, 0.97825146, 0.17447329, 0.6973493, 0.43100262, 0.6859009, 0.3788302, 0.28956485, 0.84612334, 0.1413225, 0.7300916, 0.48381412, 0.45965457, 0.15991974, 0.61890626]

VectorizationBase:

# julia> @code_native debuginfo=:none syntax=:intel  VectorizationBase.vload(stridedpointer(A),  (VectorizationBase.Vec(ntuple(i -> 17i - 17, Val(16))...),))
        .text
        mov     rax, qword ptr [rsi]
        vmovups zmm0, zmmword ptr [rdx]
        kxnorw  k1, k0, k0
        vgatherdps      zmm1 {k1}, zmmword ptr [rax + 4*zmm0]
        mov     rax, rdi
        vmovaps zmmword ptr [rdi], zmm1
        vzeroupper
        ret
        nop     word ptr cs:[rax + rax]

SIMD:

# julia> @code_native debuginfo=:none syntax=:intel SIMD.vgather(vec(A),  SIMD.Vec(ntuple(i -> 17i - 16, Val(16))...))
        .text
        vpsllq  zmm0, zmmword ptr [rdx + 64], 2
        vpsllq  zmm1, zmmword ptr [rdx], 2
        vmovq   xmm2, qword ptr [rsi]           # xmm2 = mem[0],zero
        mov     rax, -4
        vmovq   xmm3, rax
        vpaddq  xmm2, xmm2, xmm3
        vpbroadcastq    zmm2, xmm2
        vpaddq  zmm1, zmm2, zmm1
        vpaddq  zmm0, zmm2, zmm0
        kxnorw  k1, k0, k0
        vgatherqps      ymm2 {k1}, ymmword ptr [zmm0]
        kxnorw  k1, k0, k0
        vgatherqps      ymm0 {k1}, ymmword ptr [zmm1]
        mov     rax, rdi
        vinsertf64x4    zmm0, zmm0, ymm2, 1
        vmovaps zmmword ptr [rdi], zmm0
        vzeroupper
        ret
        nop     word ptr [rax + rax]

VectorizationBase does it in an awkward way by assembling a large llvmcall expression, but using LLVMPtr should be unnecessary for gather/scatter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Core.LLVMPtr? #80

Support Core.LLVMPtr? #80

chengchingwen commented Jan 14, 2021

eschnett commented Jan 15, 2021

KristofferC commented Jan 15, 2021

chengchingwen commented Jan 15, 2021

eschnett commented Feb 11, 2021

chengchingwen commented Feb 11, 2021

chriselrod commented Jun 27, 2021 •

edited

Loading

Support Core.LLVMPtr? #80

Support Core.LLVMPtr? #80

Comments

chengchingwen commented Jan 14, 2021

eschnett commented Jan 15, 2021

KristofferC commented Jan 15, 2021

chengchingwen commented Jan 15, 2021

eschnett commented Feb 11, 2021

chengchingwen commented Feb 11, 2021

chriselrod commented Jun 27, 2021 • edited Loading

chriselrod commented Jun 27, 2021 •

edited

Loading