-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Core.LLVMPtr? #80
Comments
At first glance this looks straightforward.
|
What's the use case? Do you have some example code? |
I'm trying the vectorized memory access with CUDA.jl. Following some discussion on CUDA's issues list, it's said to be doable with SIMD.jl's function device_copy_vector4_kernel(din::CuDeviceArray{T}, dout::CuDeviceArray{T}, N) where T
idx = threadIdx().x + (blockIdx().x - 1) * blockDim().x
s = blockDim().x * gridDim().x
i = idx
@inbounds while i <= fld(N, 4)
dinp = reinterpret(Ptr{T}, pointer(din, 4i - 3)) # `pointer` return a `Core.LLVMPtr`
doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
v = vloada(Vec{4, T}, dinp) # v = din[lane+i]
vstorea(v, doutp) # dout[lane+i] = v
i += s
end
rdiff = 4i - N # 1 <= rdiff <= 3
@inbounds if rdiff == 1 # 3 reminder
dinp = reinterpret(Ptr{T}, pointer(din, 4i - 3))
doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
v = vloada(Vec{2, T}, dinp)
vstorea(v, doutp)
dout[N] = din[N]
elseif rdiff == 2
dinp = reinterpret(Ptr{T}, pointer(din, 4i - 3))
doutp = reinterpret(Ptr{T}, pointer(dout, 4i - 3))
v = vloada(Vec{2, T}, dinp)
vstorea(v, doutp)
elseif rdiff == 3
dout[N] = din[N]
end
return
end
I'm not familiar with
I'm currently using 1.5.3 but I think supporting 1.6 is good enough |
@chengchingwen I've looked at the example, and I think I understand now what you are trying to do. I still have a question, though: Is the example code above working for you, but just overly complex (requiring the |
It seems working for me. The above example works fine and currently I haven't found any test case that cause failure (But I know nothing about the |
Another reason to support this is For example: julia> using VectorizationBase, SIMD
julia> A = rand(Float32,16,16);
julia> getindex.(Ref(A), 1:16, 1:16)'
1×16 adjoint(::Vector{Float32}) with eltype Float32:
0.183864 0.903298 0.978251 0.174473 0.697349 0.431003 0.685901 0.37883 … 0.846123 0.141322 0.730092 0.483814 0.459655 0.15992 0.618906
julia> VectorizationBase.vload(stridedpointer(A), (VectorizationBase.Vec(ntuple(i -> 17i - 17, Val(16))...),))
VectorizationBase.Vec{16, Float32}<0.18386412f0, 0.90329814f0, 0.97825146f0, 0.17447329f0, 0.6973493f0, 0.43100262f0, 0.6859009f0, 0.3788302f0, 0.28956485f0, 0.84612334f0, 0.1413225f0, 0.7300916f0, 0.48381412f0, 0.45965457f0, 0.15991974f0, 0.61890626f0>
julia> SIMD.vgather(vec(A), SIMD.Vec(ntuple(i -> 17i - 16, Val(16))...))
<16 x Float32>[0.18386412, 0.90329814, 0.97825146, 0.17447329, 0.6973493, 0.43100262, 0.6859009, 0.3788302, 0.28956485, 0.84612334, 0.1413225, 0.7300916, 0.48381412, 0.45965457, 0.15991974, 0.61890626] VectorizationBase: # julia> @code_native debuginfo=:none syntax=:intel VectorizationBase.vload(stridedpointer(A), (VectorizationBase.Vec(ntuple(i -> 17i - 17, Val(16))...),))
.text
mov rax, qword ptr [rsi]
vmovups zmm0, zmmword ptr [rdx]
kxnorw k1, k0, k0
vgatherdps zmm1 {k1}, zmmword ptr [rax + 4*zmm0]
mov rax, rdi
vmovaps zmmword ptr [rdi], zmm1
vzeroupper
ret
nop word ptr cs:[rax + rax] SIMD: # julia> @code_native debuginfo=:none syntax=:intel SIMD.vgather(vec(A), SIMD.Vec(ntuple(i -> 17i - 16, Val(16))...))
.text
vpsllq zmm0, zmmword ptr [rdx + 64], 2
vpsllq zmm1, zmmword ptr [rdx], 2
vmovq xmm2, qword ptr [rsi] # xmm2 = mem[0],zero
mov rax, -4
vmovq xmm3, rax
vpaddq xmm2, xmm2, xmm3
vpbroadcastq zmm2, xmm2
vpaddq zmm1, zmm2, zmm1
vpaddq zmm0, zmm2, zmm0
kxnorw k1, k0, k0
vgatherqps ymm2 {k1}, ymmword ptr [zmm0]
kxnorw k1, k0, k0
vgatherqps ymm0 {k1}, ymmword ptr [zmm1]
mov rax, rdi
vinsertf64x4 zmm0, zmm0, ymm2, 1
vmovaps zmmword ptr [rdi], zmm0
vzeroupper
ret
nop word ptr [rax + rax] VectorizationBase does it in an awkward way by assembling a large |
There is a
Core.LLVMPtr
type which is used by those gpu related packages. Currently SIMD.jl restrict the type to bePtr
soCore.LLVMPtr
can't not use it directly, but it seems workable by reinterpret theCore.LLVMPtr
toPtr
. Is it possible to update the signature to allowCore.LLVMPtr
?The text was updated successfully, but these errors were encountered: