Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling complex arrays #60

Open
hofmannmartin opened this issue Jan 13, 2020 · 20 comments
Open

Handling complex arrays #60

hofmannmartin opened this issue Jan 13, 2020 · 20 comments

Comments

@hofmannmartin
Copy link

For optimization of an algorithm operating on complex data I wanted to use SIMD. However, trying to naively handle complex arrays using reinterpret failed. E.g.

using SIMD

vcomplex = Vector{ComplexF32}(undef, 10)
vreal = reinterpret(Float32,vcomplex)

xs = vload(Vec{4,Float32}, vreal, 1)

throws an error

MethodError: no method matching vload(::Type{Vec{4,Float32}}, ::Base.ReinterpretArray{Float32,1,Complex{Float32},Array{Complex{Float32},1}}, ::Int64)
Closest candidates are:
  vload(::Type{Vec{N,T}}, !Matched::Union{DenseArray{T,1}, SubArray{T,1,P,I,true} where I<:Union{Tuple{Vararg{Real,N} where N}, Tuple{AbstractUnitRange,Vararg{Any,N} where N}} where P}, ::Integer) where {N, T} at /home/moeddel/.julia/packages/SIMD/Am38N/src/SIMD.jl:1405
  vload(::Type{Vec{N,T}}, !Matched::Union{DenseArray{T,1}, SubArray{T,1,P,I,true} where I<:Union{Tuple{Vararg{Real,N} where N}, Tuple{AbstractUnitRange,Vararg{Any,N} where N}} where P}, ::Integer, !Matched::Val{Aligned}) where {N, T, Aligned} at /home/moeddel/.julia/packages/SIMD/Am38N/src/SIMD.jl:1405
  vload(::Type{Vec{N,T}}, !Matched::Union{DenseArray{T,1}, SubArray{T,1,P,I,true} where I<:Union{Tuple{Vararg{Real,N} where N}, Tuple{AbstractUnitRange,Vararg{Any,N} where N}} where P}, ::Integer, !Matched::Val{Aligned}, !Matched::Val{Nontemporal}) where {N, T, Aligned, Nontemporal} at /home/moeddel/.julia/packages/SIMD/Am38N/src/SIMD.jl:1405
  ...

Am I missing something, or is this feature not yet supported?

@eschnett
Copy link
Owner

Complex numbers are not supported. They should, of course.

There are two possible ways to implement SIMD algorithms for complex numbers:

  • Handle real and imaginary parts separately. This works well with SIMD operations, but requires disentangling real and imaginary parts when loading values, and combining them when storing.
  • Handle complex numbers as a unit. This simplifies loading and storing, and works fine when adding or subtracting, but complex multiplication (or other nonlinear operations) require swizzling real and imaginary parts.

I know that certain architectures (Blue Gene/Q, old Intel CPUs with 128-bit SIMD lanes) support the second approach. I think that modern CPUs prefer the first approach, but I am not sure.

Do you know which approach would be best, or do you have pointers to literature?

@hofmannmartin
Copy link
Author

I prefer the second approach, since it is also the one used by openblas routines such as https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/caxpy_microk_haswell-2.c. Here for example numbers are permuted (shufflevector) inside the registers to exchange real and imaginary part.

@eschnett
Copy link
Owner

Looking at OpenBLAS is a good idea. Let's go with this approach.

Implementing addition etc. should be straightforward (no shuffling required). Complex multiplication requires shuffling.

@hofmannmartin
Copy link
Author

and i² can be taken into account by a multiplication with a constant vector with alternating signs (e.g. Vec{4,Float32}((-1,1,-1,1)) because it only occurs in the real part.

@hofmannmartin
Copy link
Author

conj too could be implemented by multiplication with the aforementioned vector.

@louisponet
Copy link

Will this require a specific complex Vec with double the data underlying? How does one acquire an NTuple{N,VecElement{ComplexF64}} from llvmcall?

I looked a bit through base in for example simdloops over imaginary things, there it seems to be loading elementptrs, put it in a 2x wide double vec and applies the rest of the math from inside the loop.

@BioTurboNick
Copy link
Contributor

Related to the original error, what's the reason that SIMD isn't able to load from a ReinterpretArray?

I've been trying to bang my head against this for a while and keep getting blocked. It seems like this should work?

a = Vector{UInt8}(undef, 16)
b = reinterpret(UInt32, a)
c = vload(Vec{4, UInt32}, b, 1)

@hofmannmartin
Copy link
Author

As far as I can tell there is simply no method for vload, which supports ReinterpretArray. Only DenseArray, SubArray and Tuple are supported and ReinterpretArray is no subtype of either of these. However, I have not looked into the code, so I have no idea how hard it is to make this addition.

@BioTurboNick
Copy link
Contributor

BioTurboNick commented Aug 20, 2020

Seems like all I had to do was change these lines to get ReinterpretArrays to work:

#ContiguousArray{T,N} = Union{DenseArray{T,N}, ContiguousSubArray{T,N}}
ContiguousArray{T,N} = Union{DenseArray{T,N}, ContiguousSubArray{T,N}, Base.ReinterpretArray{T,1,T2,A} where {A <: Union{DenseArray{T2,N}, ContiguousSubArray{T2,N}}} where {T2}}

#FastContiguousArray{T,N} = Union{DenseArray{T,N}, Base.FastContiguousSubArray{T,N}}
FastContiguousArray{T,N} = Union{DenseArray{T,N}, Base.FastContiguousSubArray{T,N}, Base.ReinterpretArray{T,1,T2,A} where {A <: Union{DenseArray{T2,N}, Base.FastContiguousSubArray{T2,N}}} where {T2}}

Is there something I'm missing?

Edit: PR #72 for this

@hofmannmartin
Copy link
Author

Were you able to run my initial code above with this?

@BioTurboNick
Copy link
Contributor

@hofmannmartin Yes, no errors. Is this what you expect?

vcomplex = Vector{ComplexF32}(undef, 10)
#=
10-element Array{Complex{Float32},1}:
   4.98157f-25 + 0.0f0im
  5.368139f-26 + 0.0f0im
    2.28642f-9 + 0.0f0im
  2.2862459f-9 + 0.0f0im
 3.8254035f-25 + 0.0f0im
  5.088271f-25 + 0.0f0im
 1.3259766f-25 + 0.0f0im
 1.0201263f-25 + 0.0f0im
 1.3262132f-25 + 0.0f0im
 1.3262034f-25 + 0.0f0im
=#

vreal = reinterpret(Float32,vcomplex)
#=
20-element reinterpret(Float32, ::Array{Complex{Float32},1}):
 4.98157f-25
 0.0
 5.368139f-26 
 0.0
 2.28642f-9   
 0.0
 2.2862459f-9 
 0.0
 3.8254035f-25
 0.0
 5.088271f-25 
 0.0
 1.3259766f-25
 0.0
 1.0201263f-25
 0.0
 1.3262132f-25
 0.0
 1.3262034f-25
 0.0
=#

xs = vload(Vec{4,Float32}, vreal, 1)
# <4 x Float32>[4.98157f-25, 0.0, 5.368139f-26, 0.0]

@hofmannmartin
Copy link
Author

Alternating real and imaginary values are indeed what I expected. Might worth writing a small unit test?

@willow-ahrens
Copy link

I would also like to see complex vectors supported. I won't have time for a few weeks, but if I were to file a PR to add support for complex vectors, would anyone be interested in reviewing?

@eschnett
Copy link
Owner

I would be happy to review. Feel free to demonstrate an incomplete work-in-progress version if you are looking for feedback.

The challenging part will be complex number multiplication; this will require shuffling the vectors.

A different internal representation would be to store real and imaginary parts in separate SIMD vectors, i.e. to have Complex{Vec{4,Float32}}. This will make arithmetic operations simple, but puts the onus on loading and storing values, requiring the shuffling then.

@KristofferC
Copy link
Collaborator

The fact that there are two storages possible and depending on your use case, either one might be better makes it a bit unclear to me if SIMD.jl should really pick one. If anything, I personally think having them as separate vector makes a lot of sense because that would make it much more easy to translate scalar code to vector code.

@hofmannmartin
Copy link
Author

Having them separated is the easy part. There is a section in LoopVectorization, which discusses this approach. More interesting to handle, is the native storage order like it is done in OpenBLAS.

I also have an example of a hand written kernel for this function. However, it would be great if this would work out of the box without having to tailor the kernel.

@willow-ahrens
Copy link

All the number types currently supported by SIMD.jl directly translate to native vector instructions. Because the complex vectors can be implemented (almost or entirely?) using existing SIMD.jl methods, it seems the main point of adding complex vector support to SIMD.jl is to use the Vec abstraction to automate the process of manually vectorizing complex code. As is pointed out in that LoopVectorization link, there are several similar "composite" number types which might be vectorized via a similar process, such as doubledouble, raising similar implementation decisions regarding data layout, etc.

Would it make sense to treat the Complex case as an example and test case for a minimal composite type vector interface? We might use a different base type ("VecComplex" or "ComplexVec") which supports the ability to choose between implementations.

@eschnett
Copy link
Owner

It seems both approaches would be useful. Whoever does the work should decide which to explore, with the understanding that SIMD might need to support both approaches in the future, so this road shouldn't be blocked.

I suggest the following:

  • Vec{4,Complex{Float64}} will store internally the equivalent of a Vec{8,Float64}, alternating between real and imaginary parts. Most vector operations should work out of the box. Complex number multiplication will need to shuffle vector lanes, as done in @hofmannmartin's code.
  • We can, in the future, use a new type CompositeVec{N,...} (better names to be suggested) to automatically split types into components.

@photor
Copy link

photor commented Nov 18, 2022

It seems both approaches would be useful. Whoever does the work should decide which to explore, with the understanding that SIMD might need to support both approaches in the future, so this road shouldn't be blocked.

I suggest the following:

* `Vec{4,Complex{Float64}}` will store internally the equivalent of a `Vec{8,Float64}`, alternating between real and imaginary parts. Most vector operations should work out of the box. Complex number multiplication will need to shuffle vector lanes, as done in @hofmannmartin's code.

* We can, in the future, use a new type `CompositeVec{N,...}` (better names to be suggested) to automatically split types into components.

Any progress?

@willow-ahrens
Copy link

willow-ahrens commented Nov 18, 2022

I never found time to do this. Perhaps someone else does?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants