Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Keno · 2020-11-20T04:13:49Z

Neoverse N1 (AWS Graviton2):

julia> using LinearAlgebra
julia> a = zeros(Float64, 100)
julia> a[1] = -Inf
julia> BLAS.nrm2(a)
NaN
julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.12 DYNAMIC_ARCH NO_AFFINITY neoversen1 MAX_THREADS=32"

Works ok with the generic armv8 kernels (and on other architectures)

OPENBLAS_CORETYPE=armv8 ./julia
julia> using LinearAlgebra
julia> a = zeros(Float64, 100)
julia> a[1] = -Inf
julia> BLAS.nrm2(a)
Inf
julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.12 DYNAMIC_ARCH NO_AFFINITY armv8 MAX_THREADS=32"

The text was updated successfully, but these errors were encountered:

Keno · 2020-11-20T04:31:26Z

I believe Neoverse just reuses the ThunderX2 code path, so cc @ashwinyes who wrote that code originally.

See OpenMathLib/OpenBLAS#2998

Keno · 2020-11-20T04:48:54Z

Also cc @ianshmean @yuyichao

ashwinyes · 2020-11-20T06:22:30Z

Not much familiar with Julia lang. So this trying to get nrm2 of 100 element vector with all but one element as 0. The exception being -Inf . Right ?

Keno · 2020-11-20T09:46:33Z

That is correct.

See OpenMathLib/OpenBLAS#2998

brada4 · 2020-11-21T21:30:52Z

0.0 is divided by that Inf element down the road, it should be NaN since that math result is undefined.

Keno · 2020-11-21T21:33:44Z

There's no division in this definition. Also every other kernel gives Inf here ;).

brada4 · 2020-11-22T11:24:34Z

I see it here (and in reference fortran same) , so giving Inf is wrong.
https://github.com/xianyi/OpenBLAS/blob/ce3651516f12079f3ca2418aa85b9ad571c3a391/kernel/arm/nrm2.c#L69-L76

brada4 · 2020-11-22T14:00:03Z

Logical option would be to scan all kinds of NaNs in inputs and use reference algorithm if those are found.

Keno · 2020-11-22T16:32:09Z

That C code gives Inf on the input in question - 0.0/Inf is 0.0 not NaN. Regardless, even if it didn't, it doesn't change the mathematical definition of what this function does (which is just summing the squares of all the elements), making the correct answer definitely Inf. The division in there is just a rescaling.

ashwinyes · 2020-11-22T16:45:07Z

The answer should be Inf. There could be some corner case not handled properly in the assembly implementation for ThunderX2.

It would take some time for me to fix it as I don't have the free cycles to look at it.

As a temporary workaround, Neoverse could switch to the C code kernel accepting a trade off in performance. Also the C code version will not be parallelized.

AshokBhat · 2020-12-29T10:51:02Z

Adding @docularxu to the loop, who has SVE BLAS implementation experience on Arm.
Hi @docularxu, would you have some bandwidth to solve this issue with @ashwinyes?
Regards
Ashok

docularxu · 2020-12-31T09:41:09Z

For the moment, @Keno , what I saw is:
armv8 uses: kernel/arm64/nrm2.S
neoversen1 uses: kernel/arm64/dznrm2_thunderx2t99.c

The code in dznrm2_thunderx2t99.c is actually embedded assembly, which I feel hard to understand (I got no experience in nrm2 kernel implementation, sorry) it, plus to know how Inf vs. NaN take effects in the source. But as a short cut, you guessed, if you can compile the OpenBLAS lib by yourself, try this modification: it uses the armv8 64-bit .S implementation. Maybe faster than falling back to pure C.

diff --git a/kernel/arm64/KERNEL.NEOVERSEN1 b/kernel/arm64/KERNEL.NEOVERSEN1
index ea010db4..074d7215 100644
--- a/kernel/arm64/KERNEL.NEOVERSEN1
+++ b/kernel/arm64/KERNEL.NEOVERSEN1
@@ -91,10 +91,10 @@ IDAMAXKERNEL   = iamax_thunderx2t99.c
 ICAMAXKERNEL   = izamax_thunderx2t99.c
 IZAMAXKERNEL   = izamax_thunderx2t99.c
 
-SNRM2KERNEL    = scnrm2_thunderx2t99.c
-DNRM2KERNEL    = dznrm2_thunderx2t99.c
-CNRM2KERNEL    = scnrm2_thunderx2t99.c
-ZNRM2KERNEL    = dznrm2_thunderx2t99.c
+SNRM2KERNEL    = nrm2.S
+DNRM2KERNEL    = nrm2.S
+CNRM2KERNEL    = znrm2.S
+ZNRM2KERNEL    = znrm2.S
 
 DDOTKERNEL     = dot_thunderx2t99.c
 SDOTKERNEL     = dot_thunderx2t99.c

Keno · 2020-12-31T10:03:09Z

Yes, the routines were switched back in #3048.

martin-frbg · 2020-12-31T16:47:24Z

I lack the hardware to test this, but I'm now suspecting that the error is not so much in the assembly but in the final computation of the square root (which is done in the C code at the very end of the file) - if the embedded assembly returns the -Inf to it (as it should), this will trivially cause a domain error in sqrt(), leading to the NaN result. (The ARMV8 nrm2.S obviously does everything including the sqrt in assembly, which I guess is what makes it return Inf for srqt(-inf) simply because it lacks the error handling of the C library routine.)
So if I am not completely mistaken, a simple if ssq < 0) return INFINITY (or HUGE_VAL, or abs(ssq) if we need to support pre-C99 compilers here) before the sqrt should take care of this special case with (probably) minimal performance impact...

ashwinyes · 2021-01-01T07:19:20Z

Sorry. I could not find time to look at this earlier. I will look at this in coming days.

Note: I wrote the C implementation only for doing multithreaded nrm2 for large input vectors (>=10000). Otherwise it should be same as nrm2.s .

martin-frbg · 2021-01-01T07:44:01Z

Not quite the same though - your C implementation is called in both large and small cases, in the latter it still calls the new assembly without the FSQRT and then does a C sqrt() on its result

ashwinyes · 2021-01-01T10:52:46Z

In fact, now when I look at it, there are some more differences in the assembly implementation for double precision. The C sqrt is not the issue. There was a Inf / Inf happening in the assembly code resulting in NaN.

#3052 should fix this issue. @Keno Please test.

The single precision implementation should be correct with the existing code itself.

giordano · 2024-01-24T15:48:38Z

Bit late to the party, but I can confirm the bug reported above appears to be fixed in latest develop branch.

This also * drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not needed anymore for an [old bug](OpenMathLib/OpenBLAS#2998) fixed upstream in OpenBLAS. This results in ~5x speedup in the computation of `BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than `LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1 kernels are used, e.g. by default on all Apple Silicon CPUs * adds a regression test for the above bug * updates other patches when building openblas from source Corresponding PR in Yggdrasil: JuliaPackaging/Yggdrasil#7202.

Keno added a commit to JuliaPackaging/Yggdrasil that referenced this issue Nov 20, 2020

[OpenBLAS] Disable specialized Neoverse-N1 kernels for nrm2

e94a194

See OpenMathLib/OpenBLAS#2998

Keno added a commit to JuliaPackaging/Yggdrasil that referenced this issue Nov 20, 2020

[OpenBLAS] Disable specialized Neoverse-N1 kernels for nrm2

29a016f

See OpenMathLib/OpenBLAS#2998

Keno mentioned this issue Nov 20, 2020

[OpenBLAS] Disable specialized Neoverse-N1 kernels for nrm2 JuliaPackaging/Yggdrasil#2145

Merged

vchuravy pushed a commit to JuliaPackaging/Yggdrasil that referenced this issue Nov 20, 2020

[OpenBLAS] Disable specialized Neoverse-N1 kernels for nrm2 (#2145)

9d1fdad

See OpenMathLib/OpenBLAS#2998

martin-frbg mentioned this issue Nov 22, 2020

WIP/Test Add utest checking NRM2 with inf #3003

Closed

martin-frbg mentioned this issue Dec 21, 2020

Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1 #3048

Merged

martin-frbg closed this as completed May 2, 2021

giordano mentioned this issue Feb 22, 2022

[OpenBLAS] Add v0.3.20 JuliaPackaging/Yggdrasil#4485

Merged

martin-frbg mentioned this issue Nov 17, 2022

Updated RISC-V vector support for review. #3808

Closed

giordano mentioned this issue Jan 24, 2024

[OpenBLAS] Build the BFloat16 kernels in OpenBLAS JuliaPackaging/Yggdrasil#7202

Merged

giordano mentioned this issue Jan 26, 2024

[OpenBLAS_jll] Update to new build with BFloat16 kernels JuliaLang/julia#53059

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Keno commented Nov 20, 2020

Keno commented Nov 20, 2020

Keno commented Nov 20, 2020

ashwinyes commented Nov 20, 2020

Keno commented Nov 20, 2020

brada4 commented Nov 21, 2020

Keno commented Nov 21, 2020

brada4 commented Nov 22, 2020

brada4 commented Nov 22, 2020

Keno commented Nov 22, 2020

ashwinyes commented Nov 22, 2020

AshokBhat commented Dec 29, 2020 •

edited

Loading

docularxu commented Dec 31, 2020

Keno commented Dec 31, 2020

martin-frbg commented Dec 31, 2020 •

edited

Loading

ashwinyes commented Jan 1, 2021

martin-frbg commented Jan 1, 2021

ashwinyes commented Jan 1, 2021

giordano commented Jan 24, 2024

Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Incorrect answer in nrm2 computation on Neoverse-n1 #2998

Comments

Keno commented Nov 20, 2020

Keno commented Nov 20, 2020

Keno commented Nov 20, 2020

ashwinyes commented Nov 20, 2020

Keno commented Nov 20, 2020

brada4 commented Nov 21, 2020

Keno commented Nov 21, 2020

brada4 commented Nov 22, 2020

brada4 commented Nov 22, 2020

Keno commented Nov 22, 2020

ashwinyes commented Nov 22, 2020

AshokBhat commented Dec 29, 2020 • edited Loading

docularxu commented Dec 31, 2020

Keno commented Dec 31, 2020

martin-frbg commented Dec 31, 2020 • edited Loading

ashwinyes commented Jan 1, 2021

martin-frbg commented Jan 1, 2021

ashwinyes commented Jan 1, 2021

giordano commented Jan 24, 2024

AshokBhat commented Dec 29, 2020 •

edited

Loading

martin-frbg commented Dec 31, 2020 •

edited

Loading