More flexible kernel selection #37

bluss · 2018-12-04T20:28:51Z

Allow separate GemmKernel implementations per feature, so that we can tweak kernel parameters per feature. This also allows us to restore the performance of the fallback kernels.

We introduce a selector trait and when entering for example sgemm, we pass control to sgemm_kernel::detect, and it selects which kernel implementation to use. Control passes back to the gemm module and it launches the gemm_loop using the selected kernel.

This approach allows us to insert a cache (ifunc strategy #23) in the gemm module if we want to cache the function pointer to avoid running detection many times. But it's unlikely that we are going to need that.

This is also an improvement since we don't need to run detection every kernel invocation! Repeated detection is just an atomic load and compare, so it's pretty cheap, but anyway.

This change slightly increases the amount of compiled code due to instantiating the gemm loop with multiple kernels, but it's a relatively little difference in compile time (adds +25% compile time in a crate that is already compiling very quickly, less than 3 seconds in release mode on my configuration).

We also add some ad-hoc types for static dimensions, so that the matrix packing function is instantiated appropriately for each kernel width, now that we have more diverse such.

SuperFluffy · 2018-12-05T08:50:49Z

.travis.yml

@@ -42,6 +42,7 @@ install:
 # the main build
 script:
  - |
+      rustc --print cfg -Ctarget-cpu=native &&


I get this is only for reporting? There is no need to cargo build with -Ctarget-cpu=native hereafter, as those are only for compile time optimization?

Yes this is just for reporting, so we can see what kind of machine the travis test is running on!

I'm not that interested in -Ctarget-cpu=native at the moment, since we now use runtime target feature detection to deliver good performance by default instead. I keep an eye on that performance doesn't regress if the user supplies -Ctarget-cpu=native.

SuperFluffy · 2018-12-05T09:06:22Z

What I am currently unhappy with is that our definition of KernelAvx or KernelFma (once the other PR is merged) all assume Sandy Bridge as the optimal architecture by hardcoding 8x4 or 8x8 kernels, respectively. These are however not necessarily optimal for each architecture. For example, if I recall correctly Haswell has two FMA gates (I hope I am using the right word), while Zen only has one. So the optimal kernel between the two would be slightly different.

The same is true for the kernel sizes themselves, which depend on the latency of the various operations and the number of operations that can be executed simultaneously.

Does it make sense to introduce compile time flags to prefer certain kernels?

bluss · 2018-12-05T10:12:49Z

@SuperFluffy this PR enables custom MR/NR and all other parameters, so I'm happy with that as a good first step. We can now resolve to any specific GemmKernel at runtime/compiletime inside dgemm_kernel::select. And we'll move things out to files named by architecture soon as well.

I'd look at if and how we can detect microarch and configuration specific parameters (cache sizes) at build or runtime. I'm wary of only two things, optimizing for computers we don't have, that doesn't make much sense to me, and going too deep into this — I don't set out to replace BLAS or BLIS, but to have good performance by relatively simple and maintainable means.

Again, I think it's most interesting with performance that's enabled by default/automatically, that's what's actually going to be used most of the time. That said, if we get more advanced options, there's going to be need for enabling/disabling it.

This PR is also important for restoring performance on non-x86 since we can get separate settings for the fallback kernels.

bluss · 2018-12-06T20:11:11Z

This PR seems to give a very small improvement for sgemm avx on my configuration. Probably due to amortizing the feature detection, or changed inlining decisions.

 name                 63 ns/iter  62 ns/iter  diff ns/iter  diff % 
 layout_f32_032::ccc  2,025       1,984                -41  -2.02% 
 layout_f32_032::ccf  2,029       1,989                -40  -1.97% 
 layout_f32_032::cfc  2,273       2,239                -34  -1.50% 
 layout_f32_032::cff  2,279       2,247                -32  -1.40% 
 layout_f32_032::fcc  1,773       1,726                -47  -2.65% 
 layout_f32_032::fcf  1,772       1,731                -41  -2.31% 
 layout_f32_032::ffc  2,029       2,040                 11   0.54% 
 layout_f32_032::fff  2,034       2,032                 -2  -0.10% 
 mat_mul_f32::m004    211         210                   -1  -0.47% 
 mat_mul_f32::m006    211         209                   -2  -0.95% 
 mat_mul_f32::m008    164         165                    1   0.61% 
 mat_mul_f32::m012    531         529                   -2  -0.38% 
 mat_mul_f32::m016    488         462                  -26  -5.33% 
 mat_mul_f32::m032    2,060       1,987                -73  -3.54% 
 mat_mul_f32::m064    13,231      13,076              -155  -1.17% 
 mat_mul_f32::m127    95,668      93,970            -1,698  -1.77%

But the best improvement is that each kernel, including the fallback kernel can now make its own inlining decisions, size decisions, and they can all optimize better (and autovectorize, if applicable).

Introduce trait for selecting gemm implementation (runtime selection). Use a trait since we want the pattern of a callback that is generic in the K: GemmKernel parameter; we can't have generic closures, so use a trait.

This finally separates the kernel impls for sgemm avx, sse2 and fallback, so that they can all use their own tweaked parameters.

This finally separates the kernel impls for dgemm avx, sse2 and fallback, so that they can all use their own tweaked parameters.

Like the other loop macros: we don't unroll them in debug builds, so that debug builds are as quick as possible.

This stands in for const generics, we only need a few sizes anyway. Use it for kernel sizes.

This function should be specialized depending on the kernel row length, so that its copy_nonoverlapping call gets instantiated with a fixed length inline copy. This is necessary now that we have many different kernel configurations.

Move the MR/NR constants into each function, since they don't make sense for the whole file anymore.

bluss force-pushed the kernel-select branch from f29333a to edd9aa9 Compare December 4, 2018 20:41

SuperFluffy reviewed Dec 5, 2018

View reviewed changes

bluss force-pushed the kernel-select branch from 52b1fd3 to 28e7007 Compare December 5, 2018 10:15

bluss changed the title ~~WIP More flexible kernel selection~~ More flexible kernel selection Dec 5, 2018

bluss force-pushed the kernel-select branch 2 times, most recently from 461b782 to aa5cb3f Compare December 5, 2018 23:32

FEAT: Add associated constants MR/NR for kernels

fcd2266

bluss force-pushed the kernel-select branch from 5e907c4 to db77e4c Compare December 7, 2018 20:03

bluss added 10 commits December 7, 2018 21:12

FEAT: Add GemmSelect

0d652e7

Introduce trait for selecting gemm implementation (runtime selection). Use a trait since we want the pattern of a callback that is generic in the K: GemmKernel parameter; we can't have generic closures, so use a trait.

FEAT: Use separate GemmKernel impls for sgemm implementations

15c18e6

This finally separates the kernel impls for sgemm avx, sse2 and fallback, so that they can all use their own tweaked parameters.

MAINT: In travis, print all target features that can be used in script

b2a582a

FEAT: Use separate GemmKernel impls for dgemm implementations

c3c7e2b

This finally separates the kernel impls for dgemm avx, sse2 and fallback, so that they can all use their own tweaked parameters.

FIX: Don't unroll loop4 in debug builds

a1b88db

Like the other loop macros: we don't unroll them in debug builds, so that debug builds are as quick as possible.

FEAT: In dgemm sse2 and fallback, use a 4x4 kernel

d1986fd

FEAT: In sgemm sse2 and fallback, use a 8x4 kernel

eaab791

FEAT: Add ConstNum for simple type level integers

a342d8b

This stands in for const generics, we only need a few sizes anyway. Use it for kernel sizes.

FEAT: Use type parameter for size in pack

9753437

This function should be specialized depending on the kernel row length, so that its copy_nonoverlapping call gets instantiated with a fixed length inline copy. This is necessary now that we have many different kernel configurations.

FIX: move kernel-specific constants into each kernel function

c71f07f

Move the MR/NR constants into each function, since they don't make sense for the whole file anymore.

bluss force-pushed the kernel-select branch from db77e4c to c71f07f Compare December 7, 2018 20:20

bluss merged commit e7b63b4 into master Dec 7, 2018

bluss deleted the kernel-select branch December 7, 2018 20:57

bluss mentioned this pull request Dec 13, 2018

Use ifunc strategy or other ways to only check target feature existance once #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible kernel selection #37

More flexible kernel selection #37

bluss commented Dec 4, 2018 •

edited

Loading

SuperFluffy Dec 5, 2018

bluss Dec 5, 2018 •

edited

Loading

SuperFluffy commented Dec 5, 2018

bluss commented Dec 5, 2018

bluss commented Dec 6, 2018 •

edited

Loading

More flexible kernel selection #37

More flexible kernel selection #37

Conversation

bluss commented Dec 4, 2018 • edited Loading

SuperFluffy Dec 5, 2018

Choose a reason for hiding this comment

bluss Dec 5, 2018 • edited Loading

Choose a reason for hiding this comment

SuperFluffy commented Dec 5, 2018

bluss commented Dec 5, 2018

bluss commented Dec 6, 2018 • edited Loading

bluss commented Dec 4, 2018 •

edited

Loading

bluss Dec 5, 2018 •

edited

Loading

bluss commented Dec 6, 2018 •

edited

Loading