-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fma, fused multiply add, for architectures supporting fma #35
Comments
Let's land the pure avx dgemm first |
These performance gains are just lovely:
|
Sgemm is not as impressive, but has some serious improvement as well:
|
SuperFluffy
added a commit
to SuperFluffy/matrixmultiply
that referenced
this issue
Dec 3, 2018
That's amazing |
SuperFluffy
added a commit
to SuperFluffy/matrixmultiply
that referenced
this issue
Dec 4, 2018
SuperFluffy
added a commit
to SuperFluffy/matrixmultiply
that referenced
this issue
Dec 4, 2018
SuperFluffy
added a commit
to SuperFluffy/matrixmultiply
that referenced
this issue
Dec 5, 2018
SuperFluffy
added a commit
to SuperFluffy/matrixmultiply
that referenced
this issue
Dec 7, 2018
This introduces a new trait `DgemmMultiplyAdd` that selects fused multiply add if available, and multiplication followed by addition if now. Tests for avx and fma kernels are disabled for now.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Modern Intel architectures supporting fma instruction sets can perform the first loop calculating the matrix-matrix product between panels a and b in one go using
_mm256_fmadd_pd
. We should implement these and see how they affect performance.The text was updated successfully, but these errors were encountered: