-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More flexible kernel selection #37
Conversation
@@ -42,6 +42,7 @@ install: | |||
# the main build | |||
script: | |||
- | | |||
rustc --print cfg -Ctarget-cpu=native && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this is only for reporting? There is no need to cargo build
with -Ctarget-cpu=native
hereafter, as those are only for compile time optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is just for reporting, so we can see what kind of machine the travis test is running on!
I'm not that interested in -Ctarget-cpu=native
at the moment, since we now use runtime target feature detection to deliver good performance by default instead. I keep an eye on that performance doesn't regress if the user supplies -Ctarget-cpu=native
.
What I am currently unhappy with is that our definition of The same is true for the kernel sizes themselves, which depend on the latency of the various operations and the number of operations that can be executed simultaneously. Does it make sense to introduce compile time flags to prefer certain kernels? |
@SuperFluffy this PR enables custom MR/NR and all other parameters, so I'm happy with that as a good first step. We can now resolve to any specific GemmKernel at runtime/compiletime inside I'd look at if and how we can detect microarch and configuration specific parameters (cache sizes) at build or runtime. I'm wary of only two things, optimizing for computers we don't have, that doesn't make much sense to me, and going too deep into this — I don't set out to replace BLAS or BLIS, but to have good performance by relatively simple and maintainable means. Again, I think it's most interesting with performance that's enabled by default/automatically, that's what's actually going to be used most of the time. That said, if we get more advanced options, there's going to be need for enabling/disabling it. This PR is also important for restoring performance on non-x86 since we can get separate settings for the fallback kernels. |
461b782
to
aa5cb3f
Compare
This PR seems to give a very small improvement for sgemm avx on my configuration. Probably due to amortizing the feature detection, or changed inlining decisions. name 63 ns/iter 62 ns/iter diff ns/iter diff %
layout_f32_032::ccc 2,025 1,984 -41 -2.02%
layout_f32_032::ccf 2,029 1,989 -40 -1.97%
layout_f32_032::cfc 2,273 2,239 -34 -1.50%
layout_f32_032::cff 2,279 2,247 -32 -1.40%
layout_f32_032::fcc 1,773 1,726 -47 -2.65%
layout_f32_032::fcf 1,772 1,731 -41 -2.31%
layout_f32_032::ffc 2,029 2,040 11 0.54%
layout_f32_032::fff 2,034 2,032 -2 -0.10%
mat_mul_f32::m004 211 210 -1 -0.47%
mat_mul_f32::m006 211 209 -2 -0.95%
mat_mul_f32::m008 164 165 1 0.61%
mat_mul_f32::m012 531 529 -2 -0.38%
mat_mul_f32::m016 488 462 -26 -5.33%
mat_mul_f32::m032 2,060 1,987 -73 -3.54%
mat_mul_f32::m064 13,231 13,076 -155 -1.17%
mat_mul_f32::m127 95,668 93,970 -1,698 -1.77% But the best improvement is that each kernel, including the fallback kernel can now make its own inlining decisions, size decisions, and they can all optimize better (and autovectorize, if applicable). |
Introduce trait for selecting gemm implementation (runtime selection). Use a trait since we want the pattern of a callback that is generic in the K: GemmKernel parameter; we can't have generic closures, so use a trait.
This finally separates the kernel impls for sgemm avx, sse2 and fallback, so that they can all use their own tweaked parameters.
This finally separates the kernel impls for dgemm avx, sse2 and fallback, so that they can all use their own tweaked parameters.
Like the other loop macros: we don't unroll them in debug builds, so that debug builds are as quick as possible.
This stands in for const generics, we only need a few sizes anyway. Use it for kernel sizes.
This function should be specialized depending on the kernel row length, so that its copy_nonoverlapping call gets instantiated with a fixed length inline copy. This is necessary now that we have many different kernel configurations.
Move the MR/NR constants into each function, since they don't make sense for the whole file anymore.
Allow separate GemmKernel implementations per feature, so that we can tweak kernel parameters per feature. This also allows us to restore the performance of the fallback kernels.
We introduce a selector trait and when entering for example
sgemm
, we pass control tosgemm_kernel::detect
, and it selects which kernel implementation to use. Control passes back to the gemm module and it launches thegemm_loop
using the selected kernel.This approach allows us to insert a cache (ifunc strategy #23) in the gemm module if we want to cache the function pointer to avoid running detection many times. But it's unlikely that we are going to need that.
This is also an improvement since we don't need to run detection every kernel invocation! Repeated detection is just an atomic load and compare, so it's pretty cheap, but anyway.
This change slightly increases the amount of compiled code due to instantiating the gemm loop with multiple kernels, but it's a relatively little difference in compile time (adds +25% compile time in a crate that is already compiling very quickly, less than 3 seconds in release mode on my configuration).
We also add some ad-hoc types for static dimensions, so that the matrix packing function is instantiated appropriately for each kernel width, now that we have more diverse such.