Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995

Closed
albestro opened this issue Apr 7, 2023 · 19 comments
Closed

Comments

@albestro
Copy link

albestro commented Apr 7, 2023

On Apple M2 it seems there is a numerical error when using CGEMM, so just with single precision complex type. Actually, we found problems with CHER2K as well, but our speculation is that it probably shares code with GEMM.

Hereafter, a small reproducer and an excerpt of internal OpenBLAS tests where it is reported CGEMM failure.

Reproducer

Here a reproducer of the GEMM problem.

#include <complex>
#include <iostream>

extern "C" {
  void zgemm_(const char*, const char*,
      int*, int*, int*,
      std::complex<double>*,
      std::complex<double>*, int*,
      std::complex<double>*, int*,
      std::complex<double>*,
      std::complex<double>*, int*);

  void cgemm_(const char*, const char*,
      int*, int*, int*,
      std::complex<float>*,
      std::complex<float>*, int*,
      std::complex<float>*, int*,
      std::complex<float>*,
      std::complex<float>*, int*);
}

template <class T>
using gemm_func_t = void (*)(const char*, const char*,
    int*, int*, int*,
    std::complex<T>*,
    std::complex<T>*, int*,
    std::complex<T>*, int*,
    std::complex<T>*,
    std::complex<T>*, int*);

template <class T>
void test_gemm(gemm_func_t<T> gemm) {
  std::complex<T> a(1, 1);
  std::complex<T> b(1, 1);
  std::complex<T> c(1, 1);

  int m = 1, n = 1, k = 1;
  int ld = 1;

  std::complex<T> alpha{0.0, 1.0}, beta{0.0, 0.0};

  gemm("N", "N",
      &m, &n, &k,
      &alpha,
      &a, &ld,
      &b, &ld,
      &beta,
      &c, &ld);

  std::cout << c << "\n";
}

int main() {
  test_gemm<float>(&cgemm_);
  test_gemm<double>(&zgemm_);
}

Whose output on an Apple M2 is

$ ./example
(0,0)
(-2,0)

when it is supposed to give exactly the same result, i.e. (-2, 0).

OpenBLAS test

Here an excerpt of the output of OpenBLAS tests, where a failure related to CGEMM is reported (it was the only failure).

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (  -0.716881    ,  -0.369739    )  (  -0.716881    ,  -0.369739    )
       2  (   0.418750    ,    1.08548    )  (   0.418750    ,    1.08548    )
       3  (   0.211250    ,   -1.68483    )  (    1.21094    ,  -0.470845    )
 ******* CGEMM  FAILED ON CALL NUMBER:
   9674: CGEMM ('T','N',  3,  1, 31,( 0.7,-0.9), A, 32, B, 32,( 1.0, 0.0), C,  4).

Thanks to @rasolca for helping investigating this.

@rasolca
Copy link

rasolca commented Apr 7, 2023

Note: It looks like that cgemm ignores the imaginary part of alpha.

@martin-frbg
Copy link
Collaborator

Hmm that's weird, M1/M2 (VORTEX target) uses the same CGEMM kernel as Neoverse and most of the older Cortex A cpu lineup. At least the testsuite error should have shown up elsewhere before.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 7, 2023

Can you please add which compiler (and version) you are using ? (And I assume you are trying a reasonably recent release, or the develop branch - do not get misled by the old master branch.) Our CI records no CGEMM error on M1, and I would assume the (documented parts of the) M2 architecture to be sufficiently similar.
Autodetection may be a different matter - can you please provide the output of the "hw.cpufamily" sysctl ? Unless you specified TARGET=VORTEX, you will probably have obtained a (mostly) generic ARMV8 build (which nevertheless should have a working CGEMM, given the commonalities mentioned above)

@albestro
Copy link
Author

Thanks @martin-frbg for having a look at that, and sorry for my incomplete report.

Currently, I built [email protected] (using spack) with [email protected] and [email protected] as Fortran compiler (both got via Homebrew).

Following up your other question about CPUFAMILY, I get this (?strange) value

$ sysctl hw.cpufamily
hw.cpufamily: -634136515

and about how spack builds openblas, I think it does not pass any specific TARGET, just ARCH=arm64 DYNAMIC_ARCH=1.

'make' '-j12' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'DYNAMIC_ARCH=1' 'DYNAMIC_OLDER=1' 'TARGET=GENERIC' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' 'libs' 'netlib' 'shared'

@martin-frbg
Copy link
Collaborator

Thank you - the sysctl output looks like it overflowed - I have just pushed a PR with an id value I found on the 'net, but it would only be used for target autodetection in host-only buildd anyway, not DYNAMIC_ARCH.
(DYNAMIC_ARCH should fall back to generic ARMV8 when it encounters an unknown cpu - which should work at least in theory, maybe your compiler version is over-optimizing either the function itself or its test code). Did you do the build on the M2, or on some other hardware (which would influence compiler options used for the common parts of the code) ?

@albestro
Copy link
Author

I'm building on my M2 machine for my M2 machine. Since I'm using spack, I'm not fully aware of the options I should configure the build with.

For instance, if you think that DYNAMIC_ARCH should not be used, or you would propose a different configuration for M2, I can give it a try and eventually fix the spack package.

This is the logic in spack where it decides to use DYNAMIC_ARCH=1 TARGET=GENERIC, which might just be a fallback that we could improve.

@martin-frbg
Copy link
Collaborator

If you are only building for that machine, please try TARGET=VORTEX instead of DYNAMIC_ARCH=1 TARGET=GENERIC

@martin-frbg
Copy link
Collaborator

btw your build options worked fine on M1 with Apple clang 14.0.0 (clang-1400.0.29.202) and gfortran 12.2 from Homebrew
(this was on gcc104 in the GCC Compile Farm, a Mac Mini - unfortunately I do not have access to an M2-based system).

Ironically, building for "DYNAMIC_ARCH" on a Mac currently gets you everything except the "VORTEX" target, resulting in the runtime fallback to somewhat generic ARMV8 code (and even if "VORTEX" was there, it would not have been autoselected due to the unknown id)

@albestro
Copy link
Author

Sorry for the delay. I've temporarily changed the build recipe in spack to follow what you suggested about using ARCH=arm64 TARGET=VORTEX.

The differences in build lines are these

- 'make' '-j12' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'DYNAMIC_ARCH=1' 'DYNAMIC_OLDER=1' 'TARGET=GENERIC' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' 'libs' 'netlib' 'shared'
+ 'make' '-j1' 'tests' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=VORTEX' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib'

But now we get more test failures.

~ spack install --test root openblas
[+] /usr (external perl-5.34.0-pphzgjefjymx3xzk2ihl72a7suutdknm)
==> Installing openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl
==> No binary for openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl found: installing from source
==> Using cached archive: /Users/ialberto/spack/var/spack/cache/_source-cache/archive/5d/5d9491d07168a5d00116cdc068a40022c3455bf9293c7cb86a65b1054d7e5114.tar.gz
==> No patches needed for openblas
==> openblas: Executing phase: 'edit'
==> openblas: Executing phase: 'build'
==> Error: ProcessError: Command exited with status 2:
    'make' '-j1' 'tests' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=VORTEX' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib'

5 errors found in build log:
     9204      ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
     9205                            EXPECTED RESULT                    COMPUTED RESULT
     9206            1  (  -0.764301    ,    1.64338    )  (   -1.60147    ,   0.377609    )
     9207            2  (   -1.26522    ,   0.713745    )  (   -1.05814    ,  -0.197428    )
     9208            3  (  -0.340585    ,  -0.117354    )  (  -0.307966    ,   0.307697    )
     9209           THESE ARE THE RESULTS FOR COLUMN   9
  >> 9210     ******* CGEMM  FAILED ON CALL NUMBER:
     9211       11601: CGEMM ('N','T',  3, 31, 31,( 0.7,-0.9), A,  4, B, 32,( 1.3,-1.1), C,  4).
     9212
     9213      CHEMM  PASSED THE COMPUTATIONAL TESTS (  1296 CALLS)
     9214
     9215      ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
     9216                            EXPECTED RESULT                    COMPUTED RESULT
     9217            1  (  -0.214247E-01,  -0.479739    )  (   0.224413    ,  -0.191207    )
     9218            2  (   0.518189    ,  -0.525011    )  (   0.449746    ,   0.532335E-01)
     9219           THESE ARE THE RESULTS FOR COLUMN   7
  >> 9220     ******* CSYMM  FAILED ON CALL NUMBER:
     9221         610: CSYMM ('R','L',  2,  7,( 0.7,-0.9), A,  8, B,  3,( 0.0, 0.0), C,  3)    .
     9222
     9223      CTRMM  PASSED THE COMPUTATIONAL TESTS (  2592 CALLS)
     9224
     9225      CTRSM  PASSED THE COMPUTATIONAL TESTS (  2592 CALLS)
     9226

     ...

     9243           11  (   0.887918    ,   -1.45399    )  (   0.887918    ,   -1.45399    )
     9244           12  (  -0.470259    ,  -0.537774    )  (  -0.470259    ,  -0.537774    )
     9245           13  (    1.05278    ,    2.97257    )  (    1.05279    ,    2.97257    )
     9246           14  (  -0.886690    ,   0.321344    )  (  -0.886690    ,   0.321344    )
     9247           15  (   0.996669    ,    1.55196    )  (   0.996669    ,    1.55196    )
     9248           THESE ARE THE RESULTS FOR COLUMN  17
  >> 9249     ******* CHER2K FAILED ON CALL NUMBER:
     9250        1295: CHER2K('L','C', 31, 31,( 0.7,-0.9), A, 32, B, 32, 1.0, C, 32)           .
     9251
     9252      CSYR2K PASSED THE COMPUTATIONAL TESTS (  1296 CALLS)
     9253
     9254      END OF TESTS
     9255     OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat3 < ./zblat3.dat

     ...

     11028    TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
     11029      ERR: test_dnrm2.c:65  expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)
     11030    TEST 34/36 potrf:bug_695 [OK]
     11031    TEST 35/36 potrf:smoketest_trivial [OK]
     11032    TEST 36/36 kernel_regress:skx_avx [OK]
     11033    RESULTS: 36 tests (35 ok, 1 failed, 0 skipped) ran in 7 ms
  >> 11034    make[1]: *** [run_test] Error 1
  >> 11035    make: *** [tests] Error 2

See build log for details:
  /var/folders/n3/4fl85lys2s12bm0zf6xqcdmw0000gp/T/ialberto/spack-stage/spack-stage-openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl/spack-build-out.txt

@martin-frbg
Copy link
Collaborator

Thanks - this is somewhat surprising as it would suggest the M2 is markedly different from the M1 (unless it is the compiler playing optimization tricks on us - the CTRMM kernels should be exactly the same in both ARMV8 and VORTEX setups and the DNRM2 failure in the utest corner case does not show up on M1). Do you get any suspicious compiler warnings in the build log ? Oh, and do you get the CGEMM/CSYMM/CHER2K failures in both test and ctest ?

@albestro
Copy link
Author

I will check thoroughly in the next hours (hopefully by the end of tomorrow).

Just a quick feedback about a couple of tests I did just before your last comment that I didn't had the time to report.

In addition to previous tests with [email protected], I gave a try to [email protected] and [email protected] too with same options ARCH=arm64 TARGET=VORTEX, and I got different test failures (strange enough). I will report more in detail later if it might be helpful.

Not sure about which kind of tests I used, but my next try will be to build it manually to have full direct control over the build, and I will also run manually tests.

Stay tuned 😉

@martin-frbg
Copy link
Collaborator

Hmmm. Getting the utest failure now(?) on M1 as well, although this was supposed to be fixed. For the CGEMM, can you please try changing the register assignment for ALPHA_I in kernel/arm64/cgemm_kernel_8x4.S from "w18" to "w19" ?

@martin-frbg
Copy link
Collaborator

Hopefully all addressed by #4003 now ...

@albestro
Copy link
Author

It took a bit of time to get back, but here I am. These are the notes I collected during my last tests.

TL;DR

  • 🤔 building with clang@12 + [email protected] reports linking warnings, and various branches fail in similar (but different) ways.
  • 👍 switching to clang@12 + [email protected] fixed linking warnings and all test passed in 0.3.23!
  • 👎 unfortunately my reproducer still report the wrong result
  • 👉 in openblas@develop (after v0.3.23 release) c_check needs to be fixed (missing space before "[" at line 41)
  • 🤔 using -march=armv8.3-a fails in a different way TEST 33/36 dnrm2:/dnrm2_tiny [FAIL] ERR: test_dnrm2.c:65 expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)

Clang@12 + [email protected] (for fortran compiler)

I tried these three branches:

0.3.23

  • passing 'ARCH=arm64' 'TARGET=VORTEX' => march=armv8.3-a
     TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
     ERR: test_dnrm2.c:65  expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)
  • without passing TARGET/ARCH/DYNAMIC_ARCH/... => march=armv8-a
    • CGEMM fails
    • cblas_cgemm fails

develop

  • missing space before ] at line 41 of c_check
  • warning: result of comparison of constant 3660830781 with expression of type 'int32_t' (aka 'int') is always false
  • without passing TARGET/ARCH/DYNAMIC_ARCH/... => march=armv8-a
    • CSYRK fails
    • cblas_cgemm fails

PR4003

  • missing space before ] at line 41 of c_check
  • without passing TARGET/ARCH/DYNAMIC_ARCH/... => march=armv8.3-a
     TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
     ERR: test_dnrm2.c:65  expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)
    then it stops, so no clue about cgemm

Linker warnings

In all builds above, by looking at the output I was able to spot a long list of link-time warnings.

ld: warning: dylib (/opt/homebrew/Cellar/gcc@11/11.3.0/lib/gcc/11/libgfortran.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: dylib (/opt/homebrew/Cellar/gcc@11/11.3.0/lib/gcc/11/libquadmath.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: could not create compact unwind for ___emutls_get_address: registers 23 and 24 not saved contiguously in frame
ld: warning: could not create compact unwind for _spotrf2_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgetrf2_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbbrd_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbcon_: registers 23 and 24 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbequ_: registers 72 and 73 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbrfs_: registers 27 and 28 not saved contiguously in frame
...

CLANG@12 + [email protected] (for fortran compiler)

Reading the first two lines above, I decided to give a try to a fresher version of the GCC, [email protected] (always provided by HomeBrew). I still get the first warning lines, but I don't get anymore the rest of linking warnings.

ld: warning: dylib (/opt/homebrew/Cellar/gcc/12.2.0/lib/gcc/current/libquadmath.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: dylib (/opt/homebrew/Cellar/gcc/12.2.0/lib/gcc/current/libgfortran.dylib) was built for newer macOS version (13.0) than being linked (11.0)

And, more importantly, [email protected] passed all tests! 🍾

Openblas test passes, but... not my reproducer

Unfortunately, I tried my reproducer, and despite openblas test results, it was not reporting the right result. ❌

What's next?

Next try will be to test #4003 but forcing arm-v8 instead of arm-v8.3a (otherwise I don't get openblas built), and see (if it passes internal tests, which is perfectly reasonable) what will be my reproducer results. I'll let you know next week.

In the meanwhile, if you have any additional test you'd like to do, e.g. linking warnings ring a bell to you, just let me know.

@martin-frbg
Copy link
Collaborator

Thank you, will fix the c_check regression in #4004 .
The "was built for newer macOS" linker warning is known to result from a mismatch between the Apple SDK in use and the MACOSX_DEPLOYMENT_TARGET environment variable. The latter is currently set to 11.0 in Makefile.system (to fix an earlier problem where the SDK itself apparently defaulted to an inappropriate version - PR #3670). From the message
you got, it looks like you need to set it to 13. (Not entirely sure yet if/how this depends on installed OS or SDK - on the gccfarm machine with kernel 21.6.0, the corresponding warning vanishes when the deployment target is set to 12.0)

Curious how/why the DNRM2 testcase still fails despite my attempt in #4003 to prevent the compiler from removing critical code. Anyway at the time the utest code is run, the build of the library should already be complete and it should be possible to get the other parts of the internal tests to execute by running make -C test and make -C ctest respectively (repeating any other arguments you gave to the initial make).

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 14, 2023

BTW I can get the DNRM2 testcase to work (again) by moving my declaration of the volatile FLOAT sca from the top of the CNAME function in dznrm2_thunderx2t99.c down into the sca = fabs(scale); line (where I originally had it before cleaning up the patch for the PR). Pure compiler shenanigans...

@martin-frbg
Copy link
Collaborator

Modified #4003 accordingly to "fix" DNRM2 on "VORTEX" for good. (And incidentally, an "armv8-a" generic "ARMV8" build would use a different DNRM2 kernel that has not shown any problems with the "Inf vs. 0" corner case so far.)
As far as I could find out, the linker warnings about being unable to create a "compact unwind" should be harmless, but the only way to get rid of them appears to be by adding linker flags to suppress creation of a compact unwind and keep the "dwarf" unwind of the embedded debug information instead as seen in grain-lang/binaryen.ml#136 - not sure if this is worth the bother at this time.

@albestro
Copy link
Author

Built #4003 (+ #4004 patch) with both compilers pairs

  • clang@12 + gcc@11
  • clang@12 + gcc@12

And:

  • 👍🏻 without specifying target/arch, it now uses -march=armv8.3-a
  • ✅ passes all tests
  • ✅ passes the reproducer
  • ✅ our linear algebra library now passess all tests
  • ℹ️ gcc@11 still reports warnings about unwind, so confirm it does not seem to generate any problem
  • ℹ️ export MACOSX_DEPLOYMENT_TARGET=13.0; fixes the warning about the mismatch

@martin-frbg
Copy link
Collaborator

Thanks for the feedback, I have merged #4003 now. (BTW the SciPy folks are also seeing some issues with their testsuite and unpatched 0.3.23 - not clear yet if it is the exact same problem or perhaps tiny FMA-related inaccuracies on the new hardware)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants