-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD types #2299
Comments
@stevengj You have a fair bit of experience with this kind of stuff in FFTW, and it would be great to hear your thoughts on the topic. |
I have been expecting the use of SIMD in Julia for quite a while -- this is definitely worth the efforts. Here are something that might be useful (later) for implementing math functions with SIMD. Agner maintains a lot of good stuff related to SIMD-based optimization: http://www.agner.org/optimize/#asmlib Another place that is worth looking at is NT2 -- the authors of this library developed a huge collection of SIMD routines (covering nearly every math function that people would ever use). This library itself has a heavy dependencies on Boost for expression templates --- but I think the SIMD core routines can be easily ported though. They seem to have extensively tested the accuracy and performance, but you may talk to them to confirm. Intel has a Short Vector Math Library (SVML) that comes with Intel compiler, which provides a highly optimized SIMD routines for most C99 math functions (with both SSE and AVX version). If you purchase their compiler, this library is redistributable. Last time when I look at the *.so files that come with MATLAB (Linux), I found that they are using this library. The routines here are extremely highly optimized by Intel engineers, and the accuracy are strictly controlled (below 4 ULP, or in other words two least significant bits). AMD LibM implements some SIMD math functions -- but far less than Intel SVML, have no AVX support, and the performance is not as good. It is free of charge though -- but not open source. You might have to pay attention to their license. |
If we just want SIMD vector kernels, it may be easiest to find or write a library of kernels in C/ASM. It may be that the best interface to SIMD is just vectorized operations, not special SIMD types. If we have a VecFloat64 type, and the purpose of it is to reinterpret arrays to that type if you want better performance, but you have to be careful about alignment, that's not a great programming model. I guess my question is how do we want to use this other than inside Our current bitstypes are llvm integers, and bitcasting them to vector types may not be optimal. Maybe a slightly insane idea is to use tuples, and say that the representation of
and say that that will be treated as a SIMD vector. I'm not sure how that will interact with treating these as C structs (@vtjnash ?), but that situation could hardly be worse so it may not matter. Tuples seem to have all the right properties though, including the virtue that they have no defined native representation yet. |
My vote is for using vectorized operations to access SIMD, multiple threads, and hopefully using the same infrastructure to offer GPU processing. Telling the JIT that you want to add to vectors would seem to give it a better chance of optimizing the code then having it figure out what you are doing in a loop. |
This is what I posted on the list last night about accessing LLVM IR from within Julia. I think the issues with focusing on direct SIMD instruction access may be similar. So the more I look at this the less appealing it seems to be. At some point you end up having to rewrite the Julia Codegen code and LLVM C++ convenience classes in Julia. Then you would have to track them in a parallel development path. Not to mention what would have to be done when Julia statically compiles... Maybe what I am looking for is a modularization the Julia's C++ JIT codegen code. The GPU (and "dangerous" CPU code with bounds checking moved to before the for loops, etc...) would then have direct access to parts of the stock Julia codegen (or access to the same LLVM library) and memory management. The codegen modules could register requirements they have for memory allocation and then all memory allocations could meet their specific critiera (alignment, etc...). They would also be able to tag objects in memory, one use of this would be to lock arrays when they are being processed on the GPU. I think the intermediate GPU code could be kept in resources that the CPU JIT compiler would deal with, so they could be exported to libraries. |
@JeffBezanson I do believe what we generally need is more than a set of vectorized functions implemented using SIMD. It is easy to implement or port a C/C++ library that provides a set of function like void vec_add(int n, const double *a, const double *b, double *r);
...... For people who have Intel VML, they may just do ccall on such functions. The problem here can be a little bit more complicated. For example, if you want to do So just consider the expression for (int i = 0; i < n; i += 4)
{
va = _mm256_load_pd(a + i);
vb = _mm256_load_pd(b + i);
t1 = _mm256_add_pd(va, vb);
t2 = _mm256_sub_pd(va, vb);
vr = _mm256_mul_pd(t1, t2);
_mm256_store_pd(r + i, vr);
} We have to pack the computations into the loop body and do it in one pass, instead of calling several vectorized functions. Also, all such Also, the use of SIMD is not just for doing element-wise calculation. A lot of things can benefit from it, which for example including small matrix algebra. The use of 2x2, 3x3, and even 4x4 matrices to transform geometric points are everywhere in graphics and image processing. With directly exposition of SIMD instructions, it is possible to develop a library for such kind of things (without compromising performance). It should be worth noting here that the performance of BLAS is very poor for matrices smaller than 16 x 16 (sometimes just an order of magnitude slower than naive C loops). With width = get_vec_width(VecFloat64)
m = n / width
for i = 0 : m-1
idx = i * width + 1
va = vecload(a, idx)
vb = vecload(b, idx)
t1 = va + vb # overloaded for VecFloat64
t2 = va - vb
vr = t1 .* t2
vecstore(vr, r, idx)
end So we need a specific type to represent The codes above did not deal with trailing elements etc, but that would be much easier to do if we settle on how we do the main loop. In terms of memory alignment. It is true that working with 32-byte (for AVX) or 16-byte (for SSE) aligned memory is faster than unaligned. But that is not a must. It is easy to guarantee that all instances of Array are aligned (as you can control the memory allocation). However, there is no such guarantee for SubArray etc, in such cases, you can simply use unaligned load for things that are not aligned --- the performance difference (in my benchmark and what I see from some others) is below 20% --- definitely better than first copying the things to a temporary. Anyway, I think for all these such things to possibly happen, the first step is to expose SIMD instructions. |
I am not really opposed to adding in SIMD data structures and intrinsic, it would give Julia a speed boost in these situations. However my concern is that if you start using the SIMD instructions in loops at to high of a level, you start to hide the intent of the computation, so that by the time it gets to LLVM or a GPU backend it cannot safely do the optimizations that could be done (at least without Polly (http://polly.llvm.org/) and I am not certain what the overhead of the Polly evaluation is). |
@kk49: I'm not at all clear about what you're advocating for or against here. Part of the issue is that we don't really have clear or consistent terminology for the various approaches that people are discussing here. |
@StefanKarpinski I think adding SIMD operators and data types would be useful, but I don't think it is the final solution for implementing MAP and REDUCE operations for SIMD/multithreaded CPUs or GPUs. A solution will have to deal with memory management, kernel generation, and kernel scheduling. What @lindahua proposes makes sense but where should this transformation be done? As part of a Julia library, a replacement for array.jl, in codegen.cpp, or LLVM JIT/Static compilation? |
Honestly, I think that worrying about GPU and SIMD at the same time is too much – it's paralyzing. Let's tackle one massively hard problem at a time. The reason I like the approach I've laid out for SIMD (based on Viral's ideas) is a that it's simple, it's very flexible and fairly general, and it's clear what's going on – and it maps very cleanly onto LLVM code gen. There's no magical translation going on – it's just regular Julia code that implements vectorized computations by recasting bits types to corresponding SIMD bits types. Jacking things into code gen and memory management or "modularizing Julia's C++ JIT codegen" seems crazy to me. |
I think that with the right interface, writing SIMD versions of vectorized operations generically could be quite clean. Let me try to give an example. function +(a::Array{Float64}, b::Array{Float64})
# check that a and b have the same shape
c = similar(a)
ahead, abody, atail = simd_split(a)
bhead, bbody, btail = simd_split(b)
chead, cbody, ctail = simd_split(c)
for i = 1:length(chead)
chead[i] = ahead[i] + bhead[i]
end
for i = 1:length(cbody)
cbody[i] = abody[i] + bbody[i]
end
for i = 1:length(ctail)
ctail[i] = atail[i] + btail[i]
end
return c
end The |
I agree with Stefan that introducing several SIMD types and a set of operations/functions on them is a clean and easy way, which is pretty much the same as what I was thinking. @kk49 I do appreciate your long term perspective -- but what you suggested might require a fundamental restructuring of Julia code-base -- which is perhaps quite difficult in a near term. You definitely have much more experience with GPU computation than I have. Here is just my two cents. GPU computation is fundamentally different from what we typically do on CPUs. You may have to plan ahead what should be put on device, what on host, and how they communicate (so as to reduce the traffic between host and devices). For supporting GPU in Julia, what about the approach taken by Jacket? It is built on top of MATLAB -- which is worse than Julia in terms of language design -- and still achieves amazing performance. I am pretty sure they did not mess up with the internal code-gen of MATLAB -- all what they do is to build their own VM and interface with it through mex. Julia provides a more convenient way to interface with C functions, so I believe this is achievable without modifying the internal code-gen of Julia. |
@StefanKarpinski The code example is nice. Just one point. Why would you ever want a head part? I think the head part should always have a length of zero. Just consider such an expression If both If In my opinion, for non-aligned arrays, you can simply used unaligned load. The performance difference is no more than 20% in various benchmarks. Introducing a head part might lead to inconsistent splitting between operands in an expression. |
I was assuming that it was possible that our arrays were not aligned to the requirements of SIMD instructions, but if we can skip that part all the better. But the example shows that even handling arbitrarily aligned arrays would be feasible. |
SubArrays could be unaligned. The choice of aligned and non aligned access will be with the programmer anyways. I agree the first order of work would be to get something working. |
I get dahua's point now – the head thing only helps if the two arrays are unaligned I the same way, which obviously they generally wouldn't be. Just doing unaligned fetches seems like the best way to go. |
@lindahua my DeMat library implements the core Jacket functionality of vector (MAP) kernel creation & launching and data movement. (maybe more, I'm not sure what they do for "devectorization"). Getting access to "complex" math operations like sin and cos is a problem right now, since the CUDA compiler generates the assembly for them. The LLVM PTX people are working on that... |
Yes, that's what I mean. If two arrays are unaligned by a different offset, introducing the head part does no help in such situation --- this was the lesson that I learned when I was implementing a C++ library. I also read the disassembly generated by Intel compiler with vectorization turned on. If I don't tell them the arrays are aligned, the compiler simply emits unaligned load/store -- which is nearly as fast as a hand-coded loop using aligned loads. |
@StefanKarpinski I definitely understand that it is crazy :) What about multithreading? |
Good god, you're a madman, @kk49! Are you trying to give me a heart attack? |
I'm hoping for SIMD support that facilitates heavy optimization of spatial convolution from within Julia. That's more involved than just pointwise operations and the best implementation choice depends on the shape and size of the convolution kernel (and the exact set of available SIMD operations). For example a horizontal and a vertical 1D kernel would require very different approaches for optimal performance. I guess reinterpreting arrays would be a functional way to access the data but base library operations won't cut it. I'd rather get my hands dirty at the abstraction level of C intrinsics. |
I'm finding that a bit hard to interpret? Which way of doing this are you saying will or won't cut if for your application? |
The main problem is probably that I'm not really sure myself how far the original proposal would suffice. Maybe it can do more than I was thinking. I'll try to prepare some more specific examples of what I'd like to accomplish. |
Here are two examples of spatial convolutions with small kernels. Let's first assume we filter an image with a horizontal 1x3 kernel. In Julia this would be, with one column of padding on each side of the image: out = similar(in)
for j = 2:size(in,2)-1
for i = 1:size(in,1)
sum = 0.0f0
for k = 1:3
sum += in[i,j+(k-2)] * kernel[k]
end
out[i,j] = sum
end
end In mixed Julia and C intrinsics syntax this could relatively straighforwardly be implemented with SSE2 operations as c = zeros(__m128, 3)
for k = 1:3
c[k] = _mm_load1_ps(kernel[k])
end
out = similar(in)
for j = 2:size(in,2)-1
for i = 1:4:size(in,1)
sum = _mm_set1_ps(0.0f0)
for k = 1:3
x = _mm_load_ps(in[i,j+(k-2)])
sum = _mm_add_ps(sum, _mm_mul_ps(x, c[k]))
end
_mm_storeu_ps(out[i,j], sum)
end
end Obviously this assumes that the height is a multiple of 4 so that alignment comes out right in all columns and there are no tails to handle. The image is supposed to be single precision and _mm_load_ps(a[i,j]) to load a[i:i+3,j] into a SIMD register. If we instead look at a vertical 3x1 filter, the Julia code is nearly the same: out = similar(in)
for j = 1:size(in,2)
for i = 2:size(in,1)-1
sum = 0.0f0
for k = 1:3
sum += in[i+(k-2),j] * kernel[k]
end
out[i,j] = sum
end
end For a SIMD implementation one option is to transpose the image and use the code above but the transposition is costly, especially for such a small kernel as this one. With SSSE3 an in-place option is to do something like this (ignoring heads and tails and assuming alignment comes out right): c1 = _mm_load1_ps(kernel[1])
c2 = _mm_load1_ps(kernel[2])
c3 = _mm_load1_ps(kernel[3])
out = similar(in)
for j = 1:size(in,2)
for i = 5:size(in,1)-4
x1 = _mm_castps_si128(_mm_load_ps(in[i-4,j]))
y2 = _mm_load_ps(in[i,j])
x2 = _mm_castps_si128(y2)
x3 = _mm_castps_si128(_mm_load_ps(in[i+4,j]))
sum = _mm_mul_ps(y2, c2)
y1 = _mm_castsi128_ps(_mm_alignr_epi8(x2,x1,12))
sum = _mm_add_ps(sum, _mm_mul_ps(y1, c1))
y3 = _mm_castsi128_ps(_mm_alignr_epi8(x3,x2,4))
sum = _mm_add_ps(sum, _mm_mul_ps(y3, c3))
_mm_storeu_ps(out[i,j], sum)
end
end After thinking more about it I can imagine the first case could fit with the original proposal but I'm more doubtful about the second case. Anyway I wouldn't mind having the option to program SIMD in Julia at this level of abstraction. It's not pretty, so there certainly should be more convenient alternatives to get SIMD vectorization, but Julia's code generation facilities would come in very handy to avoid having to do this entirely by hand. |
@GunnarFarneback I guess probably your problem can be much easier addressed by faster implementations of If the code is really performance critical, then I think the fastest way to do that without GPU is to do a Currently, it may be just better to focus on making @StefanKarpinski's plan working. |
I should point out that this was originally @ViralBShah's plan; we then fleshed it out together and I just happen to be the one who put it up here. |
This was only a simplified example of the kinds of things I want to do. Of course it can be done in C (or in practice more likely C++) but that is exactly what I'm trying to avoid. |
Here's an old stackoverflow thread that is relevant. Seems like we are not the only ones who want such a thing: There is also the Intel SPMD program compiler that is BSD licensed: |
The Intel SPMD is quite interesting. We probably may not be able to use it directly (as it is a compiler dedicated to a specific C-extension language) though. But might be good to borrow some of the concepts there. Still, I think the easiest thing to do first is to provide SIMD types (either as opaque bit types or immutable composite, I am fine with either way), and expose intrinsic operations for them. Everything else can always be built on top later through high level interface. |
I had done some experimental work to add a vecFloat64 type. I have published this in the branch |
I'm very interested in having SIMD available in Julia. Any news on this topic? I agree with many comments above, that providing inlined SIMD instructions is a sensible first step. I'm not familiar with LLVM/JIT infrastructure Julia is built on, so I must ask: is it technically possible to guarantee that a SIMD instruction in Julia program will be translated to an equivalent single SIMD instruction in executable binaries? If this can be of any help, .NET and Mono, which are similarily JITed environments have SIMD API: http://tirania.org/blog/archive/2008/Nov-03.html and https://github.com/mono/Mono.Simd.Math. For higher level features like SIMDfied matrix operations, I suggest to require proper memory alignment and 4n or 8n size for arguments. In most cases where performance is so critical, it is not a big problem to specify suitable alignment and size. Misaligned or weird sized data could use only non-SIMD functions initially. |
Some of this work is likely to be similar to the recent Float16 work and the refactoring should help here as well. Cc: @loladiro |
I am looking forward to this for quite a while. Once SIMD intrinsics are in place, I can start experimenting stuff in NumericExtensions.jl. |
My take is that having two levels of SIMD support is good. The intrinsic level seems to work best for aggressive coders. Ideally, it should permit code to be written in a vector-width agnostic way (first there was SSE, then AVX, and AVX-512 has been announced). Overloading should be able to cover that need. For higher level use, I'm a fan of the OpenMP 4.0 "simd" loops (descendant of Intel's "pragma simd"). It lets the programmer tell the compiler there are no "simd gotchas" (notably guarantees on iteration progress) and the compiler takes care of the details. Maybe it could be expressed as a @simd macro preceding a Julia for loop? Reduction functionality has to be considered. Markup similar to reduction clauses is one way. Pattern-matching "var op=" is another way. I used to think the pattern-matching approach was awful, but I seen systems do this and I now appreciate that it has a certain simplicity, particularly if trying to extend to prefix-scan patterns. |
@ArchRobison Do take a look at the approach in #4202. |
@ViralBShah Did you mean how CoinMP.jl invokes the Coin library or something else? I'm not clear on the relation to SIMD programming. Is your |
@ViralBShah mistyped. He meant #4042, I believe. |
That's right. My simd branch is no longer around around and the approach that keno is taking in the above pull request is the right way to exploit forward. |
So tuple branch does not satisfy having SIMD types? |
It would be nice to enforce a common naming scheme for typealiases of SIMD tuples in Base. |
We have SIMD types now in the form of tuples, but since I separated out llvmcall we can't actually really do anything with them (LLVM will probably still be able to better optimize and emit appropriate SIMD operations). |
The approach we have taken is to exploit SIMD through compiler optimizations rather than exposing SIMD types. Should we close this? |
Sure. We can always reopen if we change our minds. |
Is there a wrap up anywhere? I can't really infer, what this means for FixedSizeArrays and operations that work on homogeneous composite types. I guess if FixedSizeArrays are implemented via NTuple simd acceleration should be available, right? What about these constructs: immutable Point{N, T}
vec::NTuple{N, T}
end Construction of |
In limited cases (local variables) we already map tuples to simd types. With the tuple redesign this will apply much more broadly. You'll be able to have a flat array of NTuple{4,Float32}, at which point we'll basically have full simd types. |
I have been floating the idea to expose and use SIMD instructions in a generic way. @JeffBezanson has suggested that this is quite feasible. @StefanKarpinski suggested VecInt and VecFloat bits types in this thread:
https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/h8EGtQvdq9U
To refine this further, it may be that we need VecInt32, VecInt64, VecFloat32, and VecFloat64 bits types. Arrays of these Vec types could correspond to simply reinterpreting arrays of the Int32, Int64, Float32, Float64.
We already do 16-byte alignment for memory allocation, although we may need 32-byte alignment for AVX instructions. This could be wasteful for small arrays, but there could be real performance gains for 2x2 and 3x3 matrix operations. As suggested in the thread above by @StefanKarpinski, we may need to do operations on the non-aligned portions in the head and tail separately.
Once we have this, we would probably want to implement all the vectorized base library operations using SIMD, which will give a nice performance boost until we have some kind of threading or fork/shmem based parallelism for Array objects.
Then there are also operations such as trigonometric and various math library functions, which are unlikely to be as accurate as openlibm, but a whole lot faster. Perhaps these should not be used by default, but available as vecsin etc.
The text was updated successfully, but these errors were encountered: