-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect answer in nrm2 computation on Neoverse-n1 #2998
Comments
I believe Neoverse just reuses the ThunderX2 code path, so cc @ashwinyes who wrote that code originally. |
Also cc @ianshmean @yuyichao |
Not much familiar with Julia lang. So this trying to get nrm2 of 100 element vector with all but one element as 0. The exception being -Inf . Right ? |
That is correct. |
0.0 is divided by that Inf element down the road, it should be NaN since that math result is undefined. |
There's no division in this definition. Also every other kernel gives |
I see it here (and in reference fortran same) , so giving Inf is wrong. |
Logical option would be to scan all kinds of NaNs in inputs and use reference algorithm if those are found. |
That C code gives |
The answer should be It would take some time for me to fix it as I don't have the free cycles to look at it. As a temporary workaround, Neoverse could switch to the C code kernel accepting a trade off in performance. Also the C code version will not be parallelized. |
Adding @docularxu to the loop, who has SVE BLAS implementation experience on Arm. |
For the moment, @Keno , what I saw is: The code in
|
Yes, the routines were switched back in #3048. |
I lack the hardware to test this, but I'm now suspecting that the error is not so much in the assembly but in the final computation of the square root (which is done in the C code at the very end of the file) - if the embedded assembly returns the -Inf to it (as it should), this will trivially cause a domain error in sqrt(), leading to the NaN result. (The ARMV8 nrm2.S obviously does everything including the sqrt in assembly, which I guess is what makes it return |
Sorry. I could not find time to look at this earlier. I will look at this in coming days. Note: I wrote the C implementation only for doing multithreaded nrm2 for large input vectors (>=10000). Otherwise it should be same as nrm2.s . |
Not quite the same though - your C implementation is called in both large and small cases, in the latter it still calls the new assembly without the FSQRT and then does a C sqrt() on its result |
In fact, now when I look at it, there are some more differences in the assembly implementation for double precision. The C sqrt is not the issue. There was a Inf / Inf happening in the assembly code resulting in NaN. #3052 should fix this issue. @Keno Please test. The single precision implementation should be correct with the existing code itself. |
Bit late to the party, but I can confirm the bug reported above appears to be fixed in latest |
This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.
This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.
Neoverse N1 (AWS Graviton2):
Works ok with the generic armv8 kernels (and on other architectures)
The text was updated successfully, but these errors were encountered: