Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Closed
Keno opened this issue Nov 20, 2020 · 18 comments
Closed

Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Keno opened this issue Nov 20, 2020 · 18 comments

Comments

@Keno
Copy link
Contributor

Keno commented Nov 20, 2020

Neoverse N1 (AWS Graviton2):

julia> using LinearAlgebra
julia> a = zeros(Float64, 100)
julia> a[1] = -Inf
julia> BLAS.nrm2(a)
NaN
julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.12 DYNAMIC_ARCH NO_AFFINITY neoversen1 MAX_THREADS=32"

Works ok with the generic armv8 kernels (and on other architectures)

OPENBLAS_CORETYPE=armv8 ./julia
julia> using LinearAlgebra
julia> a = zeros(Float64, 100)
julia> a[1] = -Inf
julia> BLAS.nrm2(a)
Inf
julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.12 DYNAMIC_ARCH NO_AFFINITY armv8 MAX_THREADS=32"
@Keno
Copy link
Contributor Author

Keno commented Nov 20, 2020

I believe Neoverse just reuses the ThunderX2 code path, so cc @ashwinyes who wrote that code originally.

@Keno
Copy link
Contributor Author

Keno commented Nov 20, 2020

Also cc @ianshmean @yuyichao

@ashwinyes
Copy link
Contributor

Not much familiar with Julia lang. So this trying to get nrm2 of 100 element vector with all but one element as 0. The exception being -Inf . Right ?

@Keno
Copy link
Contributor Author

Keno commented Nov 20, 2020

That is correct.

@brada4
Copy link
Contributor

brada4 commented Nov 21, 2020

0.0 is divided by that Inf element down the road, it should be NaN since that math result is undefined.

@Keno
Copy link
Contributor Author

Keno commented Nov 21, 2020

There's no division in this definition. Also every other kernel gives Inf here ;).

@brada4
Copy link
Contributor

brada4 commented Nov 22, 2020

I see it here (and in reference fortran same) , so giving Inf is wrong.
https://github.com/xianyi/OpenBLAS/blob/ce3651516f12079f3ca2418aa85b9ad571c3a391/kernel/arm/nrm2.c#L69-L76

@brada4
Copy link
Contributor

brada4 commented Nov 22, 2020

Logical option would be to scan all kinds of NaNs in inputs and use reference algorithm if those are found.

@Keno
Copy link
Contributor Author

Keno commented Nov 22, 2020

That C code gives Inf on the input in question - 0.0/Inf is 0.0 not NaN. Regardless, even if it didn't, it doesn't change the mathematical definition of what this function does (which is just summing the squares of all the elements), making the correct answer definitely Inf. The division in there is just a rescaling.

@ashwinyes
Copy link
Contributor

The answer should be Inf. There could be some corner case not handled properly in the assembly implementation for ThunderX2.

It would take some time for me to fix it as I don't have the free cycles to look at it.

As a temporary workaround, Neoverse could switch to the C code kernel accepting a trade off in performance. Also the C code version will not be parallelized.

@AshokBhat
Copy link

AshokBhat commented Dec 29, 2020

Adding @docularxu to the loop, who has SVE BLAS implementation experience on Arm.
Hi @docularxu, would you have some bandwidth to solve this issue with @ashwinyes?
Regards
Ashok

@docularxu
Copy link

For the moment, @Keno , what I saw is:
armv8 uses: kernel/arm64/nrm2.S
neoversen1 uses: kernel/arm64/dznrm2_thunderx2t99.c

The code in dznrm2_thunderx2t99.c is actually embedded assembly, which I feel hard to understand (I got no experience in nrm2 kernel implementation, sorry) it, plus to know how Inf vs. NaN take effects in the source. But as a short cut, you guessed, if you can compile the OpenBLAS lib by yourself, try this modification: it uses the armv8 64-bit .S implementation. Maybe faster than falling back to pure C.

diff --git a/kernel/arm64/KERNEL.NEOVERSEN1 b/kernel/arm64/KERNEL.NEOVERSEN1
index ea010db4..074d7215 100644
--- a/kernel/arm64/KERNEL.NEOVERSEN1
+++ b/kernel/arm64/KERNEL.NEOVERSEN1
@@ -91,10 +91,10 @@ IDAMAXKERNEL   = iamax_thunderx2t99.c
 ICAMAXKERNEL   = izamax_thunderx2t99.c
 IZAMAXKERNEL   = izamax_thunderx2t99.c
 
-SNRM2KERNEL    = scnrm2_thunderx2t99.c
-DNRM2KERNEL    = dznrm2_thunderx2t99.c
-CNRM2KERNEL    = scnrm2_thunderx2t99.c
-ZNRM2KERNEL    = dznrm2_thunderx2t99.c
+SNRM2KERNEL    = nrm2.S
+DNRM2KERNEL    = nrm2.S
+CNRM2KERNEL    = znrm2.S
+ZNRM2KERNEL    = znrm2.S
 
 DDOTKERNEL     = dot_thunderx2t99.c
 SDOTKERNEL     = dot_thunderx2t99.c

@Keno
Copy link
Contributor Author

Keno commented Dec 31, 2020

Yes, the routines were switched back in #3048.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 31, 2020

I lack the hardware to test this, but I'm now suspecting that the error is not so much in the assembly but in the final computation of the square root (which is done in the C code at the very end of the file) - if the embedded assembly returns the -Inf to it (as it should), this will trivially cause a domain error in sqrt(), leading to the NaN result. (The ARMV8 nrm2.S obviously does everything including the sqrt in assembly, which I guess is what makes it return Inf for srqt(-inf) simply because it lacks the error handling of the C library routine.)
So if I am not completely mistaken, a simple if ssq < 0) return INFINITY (or HUGE_VAL, or abs(ssq) if we need to support pre-C99 compilers here) before the sqrt should take care of this special case with (probably) minimal performance impact...

@ashwinyes
Copy link
Contributor

Sorry. I could not find time to look at this earlier. I will look at this in coming days.

Note: I wrote the C implementation only for doing multithreaded nrm2 for large input vectors (>=10000). Otherwise it should be same as nrm2.s .

@martin-frbg
Copy link
Collaborator

Not quite the same though - your C implementation is called in both large and small cases, in the latter it still calls the new assembly without the FSQRT and then does a C sqrt() on its result

@ashwinyes
Copy link
Contributor

In fact, now when I look at it, there are some more differences in the assembly implementation for double precision. The C sqrt is not the issue. There was a Inf / Inf happening in the assembly code resulting in NaN.

#3052 should fix this issue. @Keno Please test.

The single precision implementation should be correct with the existing code itself.

@giordano
Copy link
Contributor

Bit late to the party, but I can confirm the bug reported above appears to be fixed in latest develop branch.

giordano added a commit to JuliaLang/julia that referenced this issue Jan 26, 2024
This also

* drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not
needed anymore for an [old
bug](OpenMathLib/OpenBLAS#2998) fixed upstream
in OpenBLAS. This results in ~5x speedup in the computation of
`BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than
`LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1
kernels are used, e.g. by default on all Apple Silicon CPUs
* adds a regression test for the above bug
* updates other patches when building openblas from source

Corresponding PR in Yggdrasil:
JuliaPackaging/Yggdrasil#7202.
KristofferC pushed a commit to JuliaLang/LinearAlgebra.jl that referenced this issue Nov 14, 2024
This also

* drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not
needed anymore for an [old
bug](OpenMathLib/OpenBLAS#2998) fixed upstream
in OpenBLAS. This results in ~5x speedup in the computation of
`BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than
`LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1
kernels are used, e.g. by default on all Apple Silicon CPUs
* adds a regression test for the above bug
* updates other patches when building openblas from source

Corresponding PR in Yggdrasil:
JuliaPackaging/Yggdrasil#7202.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants