GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995

albestro · 2023-04-07T14:01:07Z

On Apple M2 it seems there is a numerical error when using CGEMM, so just with single precision complex type. Actually, we found problems with CHER2K as well, but our speculation is that it probably shares code with GEMM.

Hereafter, a small reproducer and an excerpt of internal OpenBLAS tests where it is reported CGEMM failure.

Reproducer

Here a reproducer of the GEMM problem.

#include <complex>
#include <iostream>

extern "C" {
  void zgemm_(const char*, const char*,
      int*, int*, int*,
      std::complex<double>*,
      std::complex<double>*, int*,
      std::complex<double>*, int*,
      std::complex<double>*,
      std::complex<double>*, int*);

  void cgemm_(const char*, const char*,
      int*, int*, int*,
      std::complex<float>*,
      std::complex<float>*, int*,
      std::complex<float>*, int*,
      std::complex<float>*,
      std::complex<float>*, int*);
}

template <class T>
using gemm_func_t = void (*)(const char*, const char*,
    int*, int*, int*,
    std::complex<T>*,
    std::complex<T>*, int*,
    std::complex<T>*, int*,
    std::complex<T>*,
    std::complex<T>*, int*);

template <class T>
void test_gemm(gemm_func_t<T> gemm) {
  std::complex<T> a(1, 1);
  std::complex<T> b(1, 1);
  std::complex<T> c(1, 1);

  int m = 1, n = 1, k = 1;
  int ld = 1;

  std::complex<T> alpha{0.0, 1.0}, beta{0.0, 0.0};

  gemm("N", "N",
      &m, &n, &k,
      &alpha,
      &a, &ld,
      &b, &ld,
      &beta,
      &c, &ld);

  std::cout << c << "\n";
}

int main() {
  test_gemm<float>(&cgemm_);
  test_gemm<double>(&zgemm_);
}

Whose output on an Apple M2 is

$ ./example
(0,0)
(-2,0)

when it is supposed to give exactly the same result, i.e. (-2, 0).

OpenBLAS test

Here an excerpt of the output of OpenBLAS tests, where a failure related to CGEMM is reported (it was the only failure).

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (  -0.716881    ,  -0.369739    )  (  -0.716881    ,  -0.369739    )
       2  (   0.418750    ,    1.08548    )  (   0.418750    ,    1.08548    )
       3  (   0.211250    ,   -1.68483    )  (    1.21094    ,  -0.470845    )
 ******* CGEMM  FAILED ON CALL NUMBER:
   9674: CGEMM ('T','N',  3,  1, 31,( 0.7,-0.9), A, 32, B, 32,( 1.0, 0.0), C,  4).

Thanks to @rasolca for helping investigating this.

The text was updated successfully, but these errors were encountered:

rasolca · 2023-04-07T15:26:02Z

Note: It looks like that cgemm ignores the imaginary part of alpha.

martin-frbg · 2023-04-07T15:47:11Z

Hmm that's weird, M1/M2 (VORTEX target) uses the same CGEMM kernel as Neoverse and most of the older Cortex A cpu lineup. At least the testsuite error should have shown up elsewhere before.

martin-frbg · 2023-04-07T20:40:01Z

Can you please add which compiler (and version) you are using ? (And I assume you are trying a reasonably recent release, or the develop branch - do not get misled by the old master branch.) Our CI records no CGEMM error on M1, and I would assume the (documented parts of the) M2 architecture to be sufficiently similar.
Autodetection may be a different matter - can you please provide the output of the "hw.cpufamily" sysctl ? Unless you specified TARGET=VORTEX, you will probably have obtained a (mostly) generic ARMV8 build (which nevertheless should have a working CGEMM, given the commonalities mentioned above)

albestro · 2023-04-11T06:57:48Z

Thanks @martin-frbg for having a look at that, and sorry for my incomplete report.

Currently, I built [email protected] (using spack) with [email protected] and [email protected] as Fortran compiler (both got via Homebrew).

Following up your other question about CPUFAMILY, I get this (?strange) value

$ sysctl hw.cpufamily
hw.cpufamily: -634136515

and about how spack builds openblas, I think it does not pass any specific TARGET, just ARCH=arm64 DYNAMIC_ARCH=1.

'make' '-j12' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'DYNAMIC_ARCH=1' 'DYNAMIC_OLDER=1' 'TARGET=GENERIC' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' 'libs' 'netlib' 'shared'

martin-frbg · 2023-04-11T07:36:50Z

Thank you - the sysctl output looks like it overflowed - I have just pushed a PR with an id value I found on the 'net, but it would only be used for target autodetection in host-only buildd anyway, not DYNAMIC_ARCH.
(DYNAMIC_ARCH should fall back to generic ARMV8 when it encounters an unknown cpu - which should work at least in theory, maybe your compiler version is over-optimizing either the function itself or its test code). Did you do the build on the M2, or on some other hardware (which would influence compiler options used for the common parts of the code) ?

albestro · 2023-04-11T12:59:03Z

I'm building on my M2 machine for my M2 machine. Since I'm using spack, I'm not fully aware of the options I should configure the build with.

For instance, if you think that DYNAMIC_ARCH should not be used, or you would propose a different configuration for M2, I can give it a try and eventually fix the spack package.

This is the logic in spack where it decides to use DYNAMIC_ARCH=1 TARGET=GENERIC, which might just be a fallback that we could improve.

martin-frbg · 2023-04-11T15:18:12Z

If you are only building for that machine, please try TARGET=VORTEX instead of DYNAMIC_ARCH=1 TARGET=GENERIC

martin-frbg · 2023-04-12T14:23:26Z

btw your build options worked fine on M1 with Apple clang 14.0.0 (clang-1400.0.29.202) and gfortran 12.2 from Homebrew
(this was on gcc104 in the GCC Compile Farm, a Mac Mini - unfortunately I do not have access to an M2-based system).

Ironically, building for "DYNAMIC_ARCH" on a Mac currently gets you everything except the "VORTEX" target, resulting in the runtime fallback to somewhat generic ARMV8 code (and even if "VORTEX" was there, it would not have been autoselected due to the unknown id)

albestro · 2023-04-13T08:12:01Z

Sorry for the delay. I've temporarily changed the build recipe in spack to follow what you suggested about using ARCH=arm64 TARGET=VORTEX.

The differences in build lines are these

- 'make' '-j12' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'DYNAMIC_ARCH=1' 'DYNAMIC_OLDER=1' 'TARGET=GENERIC' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' 'libs' 'netlib' 'shared'
+ 'make' '-j1' 'tests' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=VORTEX' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib'

But now we get more test failures.

➜  ~ spack install --test root openblas
[+] /usr (external perl-5.34.0-pphzgjefjymx3xzk2ihl72a7suutdknm)
==> Installing openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl
==> No binary for openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl found: installing from source
==> Using cached archive: /Users/ialberto/spack/var/spack/cache/_source-cache/archive/5d/5d9491d07168a5d00116cdc068a40022c3455bf9293c7cb86a65b1054d7e5114.tar.gz
==> No patches needed for openblas
==> openblas: Executing phase: 'edit'
==> openblas: Executing phase: 'build'
==> Error: ProcessError: Command exited with status 2:
    'make' '-j1' 'tests' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=VORTEX' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib'

5 errors found in build log:
     9204      ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
     9205                            EXPECTED RESULT                    COMPUTED RESULT
     9206            1  (  -0.764301    ,    1.64338    )  (   -1.60147    ,   0.377609    )
     9207            2  (   -1.26522    ,   0.713745    )  (   -1.05814    ,  -0.197428    )
     9208            3  (  -0.340585    ,  -0.117354    )  (  -0.307966    ,   0.307697    )
     9209           THESE ARE THE RESULTS FOR COLUMN   9
  >> 9210     ******* CGEMM  FAILED ON CALL NUMBER:
     9211       11601: CGEMM ('N','T',  3, 31, 31,( 0.7,-0.9), A,  4, B, 32,( 1.3,-1.1), C,  4).
     9212
     9213      CHEMM  PASSED THE COMPUTATIONAL TESTS (  1296 CALLS)
     9214
     9215      ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
     9216                            EXPECTED RESULT                    COMPUTED RESULT
     9217            1  (  -0.214247E-01,  -0.479739    )  (   0.224413    ,  -0.191207    )
     9218            2  (   0.518189    ,  -0.525011    )  (   0.449746    ,   0.532335E-01)
     9219           THESE ARE THE RESULTS FOR COLUMN   7
  >> 9220     ******* CSYMM  FAILED ON CALL NUMBER:
     9221         610: CSYMM ('R','L',  2,  7,( 0.7,-0.9), A,  8, B,  3,( 0.0, 0.0), C,  3)    .
     9222
     9223      CTRMM  PASSED THE COMPUTATIONAL TESTS (  2592 CALLS)
     9224
     9225      CTRSM  PASSED THE COMPUTATIONAL TESTS (  2592 CALLS)
     9226

     ...

     9243           11  (   0.887918    ,   -1.45399    )  (   0.887918    ,   -1.45399    )
     9244           12  (  -0.470259    ,  -0.537774    )  (  -0.470259    ,  -0.537774    )
     9245           13  (    1.05278    ,    2.97257    )  (    1.05279    ,    2.97257    )
     9246           14  (  -0.886690    ,   0.321344    )  (  -0.886690    ,   0.321344    )
     9247           15  (   0.996669    ,    1.55196    )  (   0.996669    ,    1.55196    )
     9248           THESE ARE THE RESULTS FOR COLUMN  17
  >> 9249     ******* CHER2K FAILED ON CALL NUMBER:
     9250        1295: CHER2K('L','C', 31, 31,( 0.7,-0.9), A, 32, B, 32, 1.0, C, 32)           .
     9251
     9252      CSYR2K PASSED THE COMPUTATIONAL TESTS (  1296 CALLS)
     9253
     9254      END OF TESTS
     9255     OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat3 < ./zblat3.dat

     ...

     11028    TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
     11029      ERR: test_dnrm2.c:65  expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)
     11030    TEST 34/36 potrf:bug_695 [OK]
     11031    TEST 35/36 potrf:smoketest_trivial [OK]
     11032    TEST 36/36 kernel_regress:skx_avx [OK]
     11033    RESULTS: 36 tests (35 ok, 1 failed, 0 skipped) ran in 7 ms
  >> 11034    make[1]: *** [run_test] Error 1
  >> 11035    make: *** [tests] Error 2

See build log for details:
  /var/folders/n3/4fl85lys2s12bm0zf6xqcdmw0000gp/T/ialberto/spack-stage/spack-stage-openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl/spack-build-out.txt

martin-frbg · 2023-04-13T10:46:50Z

Thanks - this is somewhat surprising as it would suggest the M2 is markedly different from the M1 (unless it is the compiler playing optimization tricks on us - the CTRMM kernels should be exactly the same in both ARMV8 and VORTEX setups and the DNRM2 failure in the utest corner case does not show up on M1). Do you get any suspicious compiler warnings in the build log ? Oh, and do you get the CGEMM/CSYMM/CHER2K failures in both test and ctest ?

albestro · 2023-04-13T12:48:33Z

I will check thoroughly in the next hours (hopefully by the end of tomorrow).

Just a quick feedback about a couple of tests I did just before your last comment that I didn't had the time to report.

In addition to previous tests with [email protected], I gave a try to [email protected] and [email protected] too with same options ARCH=arm64 TARGET=VORTEX, and I got different test failures (strange enough). I will report more in detail later if it might be helpful.

Not sure about which kind of tests I used, but my next try will be to build it manually to have full direct control over the build, and I will also run manually tests.

Stay tuned 😉

martin-frbg · 2023-04-13T13:06:16Z

Hmmm. Getting the utest failure now(?) on M1 as well, although this was supposed to be fixed. For the CGEMM, can you please try changing the register assignment for ALPHA_I in kernel/arm64/cgemm_kernel_8x4.S from "w18" to "w19" ?

martin-frbg · 2023-04-13T16:21:59Z

Hopefully all addressed by #4003 now ...

albestro · 2023-04-14T17:13:21Z

It took a bit of time to get back, but here I am. These are the notes I collected during my last tests.

TL;DR

🤔 building with clang@12 + [email protected] reports linking warnings, and various branches fail in similar (but different) ways.
👍 switching to clang@12 + [email protected] fixed linking warnings and all test passed in 0.3.23!
👎 unfortunately my reproducer still report the wrong result
👉 in openblas@develop (after v0.3.23 release) c_check needs to be fixed (missing space before "[" at line 41)
🤔 using -march=armv8.3-a fails in a different way TEST 33/36 dnrm2:/dnrm2_tiny [FAIL] ERR: test_dnrm2.c:65 expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)

Clang@12 + [email protected] (for fortran compiler)

I tried these three branches:

0.3.23 394a9fb
develop (Fix x86 detection error #4002) 6f759a9
Fix instabilities in CGEMM/CTRMM/DNRM2 on Apple M1/M2 under OSX #4003 f096a33

0.3.23

passing 'ARCH=arm64' 'TARGET=VORTEX' => march=armv8.3-a

 TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
 ERR: test_dnrm2.c:65  expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)

without passing TARGET/ARCH/DYNAMIC_ARCH/... => march=armv8-a
- CGEMM fails
- cblas_cgemm fails

develop

missing space before ] at line 41 of c_check
warning: result of comparison of constant 3660830781 with expression of type 'int32_t' (aka 'int') is always false
without passing TARGET/ARCH/DYNAMIC_ARCH/... => march=armv8-a
- CSYRK fails
- cblas_cgemm fails

PR4003

missing space before ] at line 41 of c_check

without passing TARGET/ARCH/DYNAMIC_ARCH/... => march=armv8.3-a

 TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
 ERR: test_dnrm2.c:65  expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)

then it stops, so no clue about cgemm

Linker warnings

In all builds above, by looking at the output I was able to spot a long list of link-time warnings.

ld: warning: dylib (/opt/homebrew/Cellar/gcc@11/11.3.0/lib/gcc/11/libgfortran.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: dylib (/opt/homebrew/Cellar/gcc@11/11.3.0/lib/gcc/11/libquadmath.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: could not create compact unwind for ___emutls_get_address: registers 23 and 24 not saved contiguously in frame
ld: warning: could not create compact unwind for _spotrf2_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgetrf2_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbbrd_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbcon_: registers 23 and 24 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbequ_: registers 72 and 73 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbrfs_: registers 27 and 28 not saved contiguously in frame
...

CLANG@12 + [email protected] (for fortran compiler)

Reading the first two lines above, I decided to give a try to a fresher version of the GCC, [email protected] (always provided by HomeBrew). I still get the first warning lines, but I don't get anymore the rest of linking warnings.

ld: warning: dylib (/opt/homebrew/Cellar/gcc/12.2.0/lib/gcc/current/libquadmath.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: dylib (/opt/homebrew/Cellar/gcc/12.2.0/lib/gcc/current/libgfortran.dylib) was built for newer macOS version (13.0) than being linked (11.0)

And, more importantly, [email protected] passed all tests! 🍾

Openblas test passes, but... not my reproducer

Unfortunately, I tried my reproducer, and despite openblas test results, it was not reporting the right result. ❌

What's next?

Next try will be to test #4003 but forcing arm-v8 instead of arm-v8.3a (otherwise I don't get openblas built), and see (if it passes internal tests, which is perfectly reasonable) what will be my reproducer results. I'll let you know next week.

In the meanwhile, if you have any additional test you'd like to do, e.g. linking warnings ring a bell to you, just let me know.

martin-frbg · 2023-04-14T21:06:22Z

Thank you, will fix the c_check regression in #4004 .
The "was built for newer macOS" linker warning is known to result from a mismatch between the Apple SDK in use and the MACOSX_DEPLOYMENT_TARGET environment variable. The latter is currently set to 11.0 in Makefile.system (to fix an earlier problem where the SDK itself apparently defaulted to an inappropriate version - PR #3670). From the message
you got, it looks like you need to set it to 13. (Not entirely sure yet if/how this depends on installed OS or SDK - on the gccfarm machine with kernel 21.6.0, the corresponding warning vanishes when the deployment target is set to 12.0)

Curious how/why the DNRM2 testcase still fails despite my attempt in #4003 to prevent the compiler from removing critical code. Anyway at the time the utest code is run, the build of the library should already be complete and it should be possible to get the other parts of the internal tests to execute by running make -C test and make -C ctest respectively (repeating any other arguments you gave to the initial make).

martin-frbg · 2023-04-14T21:35:07Z

BTW I can get the DNRM2 testcase to work (again) by moving my declaration of the volatile FLOAT sca from the top of the CNAME function in dznrm2_thunderx2t99.c down into the sca = fabs(scale); line (where I originally had it before cleaning up the patch for the PR). Pure compiler shenanigans...

martin-frbg · 2023-04-15T10:15:12Z

Modified #4003 accordingly to "fix" DNRM2 on "VORTEX" for good. (And incidentally, an "armv8-a" generic "ARMV8" build would use a different DNRM2 kernel that has not shown any problems with the "Inf vs. 0" corner case so far.)
As far as I could find out, the linker warnings about being unable to create a "compact unwind" should be harmless, but the only way to get rid of them appears to be by adding linker flags to suppress creation of a compact unwind and keep the "dwarf" unwind of the embedded debug information instead as seen in grain-lang/binaryen.ml#136 - not sure if this is worth the bother at this time.

albestro · 2023-04-17T09:48:58Z

Built #4003 (+ #4004 patch) with both compilers pairs

clang@12 + gcc@11
clang@12 + gcc@12

And:

👍🏻 without specifying target/arch, it now uses -march=armv8.3-a
✅ passes all tests
✅ passes the reproducer
✅ our linear algebra library now passess all tests
ℹ️ gcc@11 still reports warnings about unwind, so confirm it does not seem to generate any problem
ℹ️ export MACOSX_DEPLOYMENT_TARGET=13.0; fixes the warning about the mismatch

martin-frbg · 2023-04-18T15:18:14Z

Thanks for the feedback, I have merged #4003 now. (BTW the SciPy folks are also seeing some issues with their testsuite and unpatched 0.3.23 - not clear yet if it is the exact same problem or perhaps tiny FMA-related inaccuracies on the new hardware)

martin-frbg mentioned this issue Apr 10, 2023

Support Apple A15/M2 cpus through the existing VORTEX target #4000

Merged

martin-frbg mentioned this issue Apr 13, 2023

Fix instabilities in CGEMM/CTRMM/DNRM2 on Apple M1/M2 under OSX #4003

Merged

martin-frbg closed this as completed Apr 19, 2023

albestro mentioned this issue May 3, 2023

openblas: fix build for Apple M2 spack/spack#37402

Draft

ViralBShah mentioned this issue Jan 26, 2024

Non-deterministic mul! of ComplexF32 arrays under Apple Silicon JuliaLang/LinearAlgebra.jl#1051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995

GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995

albestro commented Apr 7, 2023 •

edited

Loading

rasolca commented Apr 7, 2023

martin-frbg commented Apr 7, 2023

martin-frbg commented Apr 7, 2023 •

edited

Loading

albestro commented Apr 11, 2023

martin-frbg commented Apr 11, 2023

albestro commented Apr 11, 2023

martin-frbg commented Apr 11, 2023

martin-frbg commented Apr 12, 2023

albestro commented Apr 13, 2023

martin-frbg commented Apr 13, 2023

albestro commented Apr 13, 2023

martin-frbg commented Apr 13, 2023

martin-frbg commented Apr 13, 2023

albestro commented Apr 14, 2023

martin-frbg commented Apr 14, 2023

martin-frbg commented Apr 14, 2023 •

edited

Loading

martin-frbg commented Apr 15, 2023

albestro commented Apr 17, 2023

martin-frbg commented Apr 18, 2023

GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995

GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995

Comments

albestro commented Apr 7, 2023 • edited Loading

Reproducer

OpenBLAS test

rasolca commented Apr 7, 2023

martin-frbg commented Apr 7, 2023

martin-frbg commented Apr 7, 2023 • edited Loading

albestro commented Apr 11, 2023

martin-frbg commented Apr 11, 2023

albestro commented Apr 11, 2023

martin-frbg commented Apr 11, 2023

martin-frbg commented Apr 12, 2023

albestro commented Apr 13, 2023

martin-frbg commented Apr 13, 2023

albestro commented Apr 13, 2023

martin-frbg commented Apr 13, 2023

martin-frbg commented Apr 13, 2023

albestro commented Apr 14, 2023

TL;DR

Clang@12 + [email protected] (for fortran compiler)

0.3.23

develop

PR4003

Linker warnings

CLANG@12 + [email protected] (for fortran compiler)

Openblas test passes, but... not my reproducer

What's next?

martin-frbg commented Apr 14, 2023

martin-frbg commented Apr 14, 2023 • edited Loading

martin-frbg commented Apr 15, 2023

albestro commented Apr 17, 2023

martin-frbg commented Apr 18, 2023

albestro commented Apr 7, 2023 •

edited

Loading

martin-frbg commented Apr 7, 2023 •

edited

Loading

martin-frbg commented Apr 14, 2023 •

edited

Loading