Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible kernel selection #37

Merged
merged 11 commits into from
Dec 7, 2018
Merged

More flexible kernel selection #37

merged 11 commits into from
Dec 7, 2018

Conversation

bluss
Copy link
Owner

@bluss bluss commented Dec 4, 2018

Allow separate GemmKernel implementations per feature, so that we can tweak kernel parameters per feature. This also allows us to restore the performance of the fallback kernels.

We introduce a selector trait and when entering for example sgemm, we pass control to sgemm_kernel::detect, and it selects which kernel implementation to use. Control passes back to the gemm module and it launches the gemm_loop using the selected kernel.

This approach allows us to insert a cache (ifunc strategy #23) in the gemm module if we want to cache the function pointer to avoid running detection many times. But it's unlikely that we are going to need that.

This is also an improvement since we don't need to run detection every kernel invocation! Repeated detection is just an atomic load and compare, so it's pretty cheap, but anyway.

This change slightly increases the amount of compiled code due to instantiating the gemm loop with multiple kernels, but it's a relatively little difference in compile time (adds +25% compile time in a crate that is already compiling very quickly, less than 3 seconds in release mode on my configuration).

We also add some ad-hoc types for static dimensions, so that the matrix packing function is instantiated appropriately for each kernel width, now that we have more diverse such.

@@ -42,6 +42,7 @@ install:
# the main build
script:
- |
rustc --print cfg -Ctarget-cpu=native &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this is only for reporting? There is no need to cargo build with -Ctarget-cpu=native hereafter, as those are only for compile time optimization?

Copy link
Owner Author

@bluss bluss Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is just for reporting, so we can see what kind of machine the travis test is running on!

I'm not that interested in -Ctarget-cpu=native at the moment, since we now use runtime target feature detection to deliver good performance by default instead. I keep an eye on that performance doesn't regress if the user supplies -Ctarget-cpu=native.

@SuperFluffy
Copy link
Contributor

What I am currently unhappy with is that our definition of KernelAvx or KernelFma (once the other PR is merged) all assume Sandy Bridge as the optimal architecture by hardcoding 8x4 or 8x8 kernels, respectively. These are however not necessarily optimal for each architecture. For example, if I recall correctly Haswell has two FMA gates (I hope I am using the right word), while Zen only has one. So the optimal kernel between the two would be slightly different.

The same is true for the kernel sizes themselves, which depend on the latency of the various operations and the number of operations that can be executed simultaneously.

Does it make sense to introduce compile time flags to prefer certain kernels?

@bluss
Copy link
Owner Author

bluss commented Dec 5, 2018

@SuperFluffy this PR enables custom MR/NR and all other parameters, so I'm happy with that as a good first step. We can now resolve to any specific GemmKernel at runtime/compiletime inside dgemm_kernel::select. And we'll move things out to files named by architecture soon as well.

I'd look at if and how we can detect microarch and configuration specific parameters (cache sizes) at build or runtime. I'm wary of only two things, optimizing for computers we don't have, that doesn't make much sense to me, and going too deep into this — I don't set out to replace BLAS or BLIS, but to have good performance by relatively simple and maintainable means.

Again, I think it's most interesting with performance that's enabled by default/automatically, that's what's actually going to be used most of the time. That said, if we get more advanced options, there's going to be need for enabling/disabling it.

This PR is also important for restoring performance on non-x86 since we can get separate settings for the fallback kernels.

@bluss bluss changed the title WIP More flexible kernel selection More flexible kernel selection Dec 5, 2018
@bluss bluss force-pushed the kernel-select branch 2 times, most recently from 461b782 to aa5cb3f Compare December 5, 2018 23:32
@bluss
Copy link
Owner Author

bluss commented Dec 6, 2018

This PR seems to give a very small improvement for sgemm avx on my configuration. Probably due to amortizing the feature detection, or changed inlining decisions.

 name                 63 ns/iter  62 ns/iter  diff ns/iter  diff % 
 layout_f32_032::ccc  2,025       1,984                -41  -2.02% 
 layout_f32_032::ccf  2,029       1,989                -40  -1.97% 
 layout_f32_032::cfc  2,273       2,239                -34  -1.50% 
 layout_f32_032::cff  2,279       2,247                -32  -1.40% 
 layout_f32_032::fcc  1,773       1,726                -47  -2.65% 
 layout_f32_032::fcf  1,772       1,731                -41  -2.31% 
 layout_f32_032::ffc  2,029       2,040                 11   0.54% 
 layout_f32_032::fff  2,034       2,032                 -2  -0.10% 
 mat_mul_f32::m004    211         210                   -1  -0.47% 
 mat_mul_f32::m006    211         209                   -2  -0.95% 
 mat_mul_f32::m008    164         165                    1   0.61% 
 mat_mul_f32::m012    531         529                   -2  -0.38% 
 mat_mul_f32::m016    488         462                  -26  -5.33% 
 mat_mul_f32::m032    2,060       1,987                -73  -3.54% 
 mat_mul_f32::m064    13,231      13,076              -155  -1.17% 
 mat_mul_f32::m127    95,668      93,970            -1,698  -1.77%

But the best improvement is that each kernel, including the fallback kernel can now make its own inlining decisions, size decisions, and they can all optimize better (and autovectorize, if applicable).

Introduce trait for selecting gemm implementation (runtime selection).

Use a trait since we want the pattern of a callback that is generic in
the K: GemmKernel parameter; we can't have generic closures, so use
a trait.
This finally separates the kernel impls for sgemm avx, sse2 and
fallback, so that they can all use their own tweaked parameters.
This finally separates the kernel impls for dgemm avx, sse2 and
fallback, so that they can all use their own tweaked parameters.
Like the other loop macros: we don't unroll them in debug builds, so
that debug builds are as quick as possible.
This stands in for const generics, we only need a few sizes anyway. Use
it for kernel sizes.
This function should be specialized depending on the kernel row length,
so that its copy_nonoverlapping call gets instantiated with a fixed
length inline copy.

This is necessary now that we have many different kernel configurations.
Move the MR/NR constants into each function, since they don't make sense
for the whole file anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants