Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Core.LLVMPtr? #80

Open
chengchingwen opened this issue Jan 14, 2021 · 6 comments
Open

Support Core.LLVMPtr? #80

chengchingwen opened this issue Jan 14, 2021 · 6 comments

Comments

@chengchingwen
Copy link

There is a Core.LLVMPtr type which is used by those gpu related packages. Currently SIMD.jl restrict the type to be Ptr so Core.LLVMPtr can't not use it directly, but it seems workable by reinterpret the Core.LLVMPtr to Ptr. Is it possible to update the signature to allow Core.LLVMPtr?

@eschnett
Copy link
Owner

At first glance this looks straightforward.

  • Which of SIMD's functions would you want to call this way? All function which take a Ptr as argument? Do you have an example for or pointer to calling a SIMD function this way? I am asking because I have not used Core.LLVMPtr myself, and its documentation is sparse.
  • How would Core.LLVMPtr be used in an LLVM intrinsic? Do you have an example?
  • SIMD switched to a more modern internal representation for Julia 1.6. Which Julia version are you using? Would supporting this for Julia 1.6 be good enough?

@KristofferC
Copy link
Collaborator

What's the use case? Do you have some example code?

@chengchingwen
Copy link
Author

I'm trying the vectorized memory access with CUDA.jl. Following some discussion on CUDA's issues list, it's said to be doable with SIMD.jl's vloada and vstorea. However, the newer version of CUDA.jl switch the underlying pointer type for CuDeviceArray to Core.LLVMPtr. Here is the example code I'm talking about:

function device_copy_vector4_kernel(din::CuDeviceArray{T}, dout::CuDeviceArray{T}, N) where T
  idx = threadIdx().x + (blockIdx().x - 1) * blockDim().x
  s = blockDim().x * gridDim().x
  i = idx
  @inbounds while i <= fld(N, 4)
    dinp  = reinterpret(Ptr{T}, pointer(din,  4i - 3)) # `pointer` return a `Core.LLVMPtr`
    doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
    v = vloada(Vec{4, T}, dinp) # v = din[lane+i] 
    vstorea(v, doutp) # dout[lane+i] = v 
    i += s
  end

  rdiff = 4i - N # 1 <=  rdiff <= 3 
  @inbounds if rdiff == 1 # 3 reminder 
    dinp  = reinterpret(Ptr{T}, pointer(din,  4i - 3))
    doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
    v = vloada(Vec{2, T}, dinp)
    vstorea(v, doutp)
    dout[N] = din[N]
  elseif rdiff == 2
    dinp  = reinterpret(Ptr{T}, pointer(din,  4i - 3))
    doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
    v = vloada(Vec{2, T}, dinp)
    vstorea(v, doutp)
  elseif rdiff == 3
    dout[N] = din[N]
  end
  return
end

How would Core.LLVMPtr be used in an LLVM intrinsic? Do you have an example?

I'm not familiar with Core.LLVMPtr. I hope the above example could provide enough information for your question.

Which Julia version are you using? Would supporting this for Julia 1.6 be good enough?

I'm currently using 1.5.3 but I think supporting 1.6 is good enough

@eschnett
Copy link
Owner

@chengchingwen I've looked at the example, and I think I understand now what you are trying to do. I still have a question, though:

Is the example code above working for you, but just overly complex (requiring the reinterpret), or is it failing in some way?

@chengchingwen
Copy link
Author

It seems working for me. The above example works fine and currently I haven't found any test case that cause failure (But I know nothing about the Core.LLVMPtr so it's possible I didn't test it properly).

@chriselrod
Copy link
Contributor

chriselrod commented Jun 27, 2021

Another reason to support this is gather/scatter of 32-bit types on 64-bit builds of Julia.
If you use Ptr, it'll break up the gather/scatter into two half-width operations.
With LLVMPtr, it'll instead use a single full-width operation.

For example:

julia> using VectorizationBase, SIMD

julia> A = rand(Float32,16,16);

julia> getindex.(Ref(A), 1:16, 1:16)'
1×16 adjoint(::Vector{Float32}) with eltype Float32:
 0.183864  0.903298  0.978251  0.174473  0.697349  0.431003  0.685901  0.37883    0.846123  0.141322  0.730092  0.483814  0.459655  0.15992  0.618906

julia> VectorizationBase.vload(stridedpointer(A),  (VectorizationBase.Vec(ntuple(i -> 17i - 17, Val(16))...),))
VectorizationBase.Vec{16, Float32}<0.18386412f0, 0.90329814f0, 0.97825146f0, 0.17447329f0, 0.6973493f0, 0.43100262f0, 0.6859009f0, 0.3788302f0, 0.28956485f0, 0.84612334f0, 0.1413225f0, 0.7300916f0, 0.48381412f0, 0.45965457f0, 0.15991974f0, 0.61890626f0>

julia> SIMD.vgather(vec(A),  SIMD.Vec(ntuple(i -> 17i - 16, Val(16))...))
<16 x Float32>[0.18386412, 0.90329814, 0.97825146, 0.17447329, 0.6973493, 0.43100262, 0.6859009, 0.3788302, 0.28956485, 0.84612334, 0.1413225, 0.7300916, 0.48381412, 0.45965457, 0.15991974, 0.61890626]

VectorizationBase:

# julia> @code_native debuginfo=:none syntax=:intel  VectorizationBase.vload(stridedpointer(A),  (VectorizationBase.Vec(ntuple(i -> 17i - 17, Val(16))...),))
        .text
        mov     rax, qword ptr [rsi]
        vmovups zmm0, zmmword ptr [rdx]
        kxnorw  k1, k0, k0
        vgatherdps      zmm1 {k1}, zmmword ptr [rax + 4*zmm0]
        mov     rax, rdi
        vmovaps zmmword ptr [rdi], zmm1
        vzeroupper
        ret
        nop     word ptr cs:[rax + rax]

SIMD:

# julia> @code_native debuginfo=:none syntax=:intel SIMD.vgather(vec(A),  SIMD.Vec(ntuple(i -> 17i - 16, Val(16))...))
        .text
        vpsllq  zmm0, zmmword ptr [rdx + 64], 2
        vpsllq  zmm1, zmmword ptr [rdx], 2
        vmovq   xmm2, qword ptr [rsi]           # xmm2 = mem[0],zero
        mov     rax, -4
        vmovq   xmm3, rax
        vpaddq  xmm2, xmm2, xmm3
        vpbroadcastq    zmm2, xmm2
        vpaddq  zmm1, zmm2, zmm1
        vpaddq  zmm0, zmm2, zmm0
        kxnorw  k1, k0, k0
        vgatherqps      ymm2 {k1}, ymmword ptr [zmm0]
        kxnorw  k1, k0, k0
        vgatherqps      ymm0 {k1}, ymmword ptr [zmm1]
        mov     rax, rdi
        vinsertf64x4    zmm0, zmm0, ymm2, 1
        vmovaps zmmword ptr [rdi], zmm0
        vzeroupper
        ret
        nop     word ptr [rax + rax]

VectorizationBase does it in an awkward way by assembling a large llvmcall expression, but using LLVMPtr should be unnecessary for gather/scatter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants