-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEMM numerical errors with complex single-precision (cgemm) on Apple M2 #3995
Comments
Note: It looks like that cgemm ignores the imaginary part of |
Hmm that's weird, M1/M2 (VORTEX target) uses the same CGEMM kernel as Neoverse and most of the older Cortex A cpu lineup. At least the testsuite error should have shown up elsewhere before. |
Can you please add which compiler (and version) you are using ? (And I assume you are trying a reasonably recent release, or the |
Thanks @martin-frbg for having a look at that, and sorry for my incomplete report. Currently, I built Following up your other question about CPUFAMILY, I get this (?strange) value $ sysctl hw.cpufamily
hw.cpufamily: -634136515 and about how spack builds openblas, I think it does not pass any specific 'make' '-j12' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'DYNAMIC_ARCH=1' 'DYNAMIC_OLDER=1' 'TARGET=GENERIC' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' 'libs' 'netlib' 'shared' |
Thank you - the sysctl output looks like it overflowed - I have just pushed a PR with an id value I found on the 'net, but it would only be used for target autodetection in host-only buildd anyway, not DYNAMIC_ARCH. |
I'm building on my M2 machine for my M2 machine. Since I'm using spack, I'm not fully aware of the options I should configure the build with. For instance, if you think that This is the logic in spack where it decides to use |
If you are only building for that machine, please try |
btw your build options worked fine on M1 with Apple clang 14.0.0 (clang-1400.0.29.202) and gfortran 12.2 from Homebrew Ironically, building for "DYNAMIC_ARCH" on a Mac currently gets you everything except the "VORTEX" target, resulting in the runtime fallback to somewhat generic ARMV8 code (and even if "VORTEX" was there, it would not have been autoselected due to the unknown id) |
Sorry for the delay. I've temporarily changed the build recipe in spack to follow what you suggested about using The differences in build lines are these - 'make' '-j12' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'DYNAMIC_ARCH=1' 'DYNAMIC_OLDER=1' 'TARGET=GENERIC' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' 'libs' 'netlib' 'shared'
+ 'make' '-j1' 'tests' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=VORTEX' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib' But now we get more test failures. ➜ ~ spack install --test root openblas
[+] /usr (external perl-5.34.0-pphzgjefjymx3xzk2ihl72a7suutdknm)
==> Installing openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl
==> No binary for openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl found: installing from source
==> Using cached archive: /Users/ialberto/spack/var/spack/cache/_source-cache/archive/5d/5d9491d07168a5d00116cdc068a40022c3455bf9293c7cb86a65b1054d7e5114.tar.gz
==> No patches needed for openblas
==> openblas: Executing phase: 'edit'
==> openblas: Executing phase: 'build'
==> Error: ProcessError: Command exited with status 2:
'make' '-j1' 'tests' 'CC=/Users/ialberto/spack/lib/spack/env/clang/clang' 'FC=/Users/ialberto/spack/lib/spack/env/clang/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=VORTEX' 'USE_LOCKING=1' 'USE_OPENMP=0' 'USE_THREAD=0' 'TIMER=INT_CPU_TIME' 'RANLIB=ranlib'
5 errors found in build log:
9204 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
9205 EXPECTED RESULT COMPUTED RESULT
9206 1 ( -0.764301 , 1.64338 ) ( -1.60147 , 0.377609 )
9207 2 ( -1.26522 , 0.713745 ) ( -1.05814 , -0.197428 )
9208 3 ( -0.340585 , -0.117354 ) ( -0.307966 , 0.307697 )
9209 THESE ARE THE RESULTS FOR COLUMN 9
>> 9210 ******* CGEMM FAILED ON CALL NUMBER:
9211 11601: CGEMM ('N','T', 3, 31, 31,( 0.7,-0.9), A, 4, B, 32,( 1.3,-1.1), C, 4).
9212
9213 CHEMM PASSED THE COMPUTATIONAL TESTS ( 1296 CALLS)
9214
9215 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
9216 EXPECTED RESULT COMPUTED RESULT
9217 1 ( -0.214247E-01, -0.479739 ) ( 0.224413 , -0.191207 )
9218 2 ( 0.518189 , -0.525011 ) ( 0.449746 , 0.532335E-01)
9219 THESE ARE THE RESULTS FOR COLUMN 7
>> 9220 ******* CSYMM FAILED ON CALL NUMBER:
9221 610: CSYMM ('R','L', 2, 7,( 0.7,-0.9), A, 8, B, 3,( 0.0, 0.0), C, 3) .
9222
9223 CTRMM PASSED THE COMPUTATIONAL TESTS ( 2592 CALLS)
9224
9225 CTRSM PASSED THE COMPUTATIONAL TESTS ( 2592 CALLS)
9226
...
9243 11 ( 0.887918 , -1.45399 ) ( 0.887918 , -1.45399 )
9244 12 ( -0.470259 , -0.537774 ) ( -0.470259 , -0.537774 )
9245 13 ( 1.05278 , 2.97257 ) ( 1.05279 , 2.97257 )
9246 14 ( -0.886690 , 0.321344 ) ( -0.886690 , 0.321344 )
9247 15 ( 0.996669 , 1.55196 ) ( 0.996669 , 1.55196 )
9248 THESE ARE THE RESULTS FOR COLUMN 17
>> 9249 ******* CHER2K FAILED ON CALL NUMBER:
9250 1295: CHER2K('L','C', 31, 31,( 0.7,-0.9), A, 32, B, 32, 1.0, C, 32) .
9251
9252 CSYR2K PASSED THE COMPUTATIONAL TESTS ( 1296 CALLS)
9253
9254 END OF TESTS
9255 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat3 < ./zblat3.dat
...
11028 TEST 33/36 dnrm2:dnrm2_tiny [FAIL]
11029 ERR: test_dnrm2.c:65 expected 0.000e+00, got inf (diff -inf, tol 1.000e-13)
11030 TEST 34/36 potrf:bug_695 [OK]
11031 TEST 35/36 potrf:smoketest_trivial [OK]
11032 TEST 36/36 kernel_regress:skx_avx [OK]
11033 RESULTS: 36 tests (35 ok, 1 failed, 0 skipped) ran in 7 ms
>> 11034 make[1]: *** [run_test] Error 1
>> 11035 make: *** [tests] Error 2
See build log for details:
/var/folders/n3/4fl85lys2s12bm0zf6xqcdmw0000gp/T/ialberto/spack-stage/spack-stage-openblas-0.3.23-wnfb64533ja4uy7gcchpzk4jfgt65ezl/spack-build-out.txt |
Thanks - this is somewhat surprising as it would suggest the M2 is markedly different from the M1 (unless it is the compiler playing optimization tricks on us - the CTRMM kernels should be exactly the same in both ARMV8 and VORTEX setups and the DNRM2 failure in the utest corner case does not show up on M1). Do you get any suspicious compiler warnings in the build log ? Oh, and do you get the CGEMM/CSYMM/CHER2K failures in both test and ctest ? |
I will check thoroughly in the next hours (hopefully by the end of tomorrow). Just a quick feedback about a couple of tests I did just before your last comment that I didn't had the time to report. In addition to previous tests with Not sure about which kind of tests I used, but my next try will be to build it manually to have full direct control over the build, and I will also run manually tests. Stay tuned 😉 |
Hmmm. Getting the utest failure now(?) on M1 as well, although this was supposed to be fixed. For the CGEMM, can you please try changing the register assignment for ALPHA_I in kernel/arm64/cgemm_kernel_8x4.S from "w18" to "w19" ? |
Hopefully all addressed by #4003 now ... |
It took a bit of time to get back, but here I am. These are the notes I collected during my last tests. TL;DR
Clang@12 + [email protected] (for fortran compiler)I tried these three branches:
0.3.23
develop
PR4003
Linker warningsIn all builds above, by looking at the output I was able to spot a long list of link-time warnings. ld: warning: dylib (/opt/homebrew/Cellar/gcc@11/11.3.0/lib/gcc/11/libgfortran.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: dylib (/opt/homebrew/Cellar/gcc@11/11.3.0/lib/gcc/11/libquadmath.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: could not create compact unwind for ___emutls_get_address: registers 23 and 24 not saved contiguously in frame
ld: warning: could not create compact unwind for _spotrf2_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgetrf2_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbbrd_: registers 25 and 26 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbcon_: registers 23 and 24 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbequ_: registers 72 and 73 not saved contiguously in frame
ld: warning: could not create compact unwind for _sgbrfs_: registers 27 and 28 not saved contiguously in frame
... CLANG@12 + [email protected] (for fortran compiler)Reading the first two lines above, I decided to give a try to a fresher version of the GCC, [email protected] (always provided by HomeBrew). I still get the first warning lines, but I don't get anymore the rest of linking warnings. ld: warning: dylib (/opt/homebrew/Cellar/gcc/12.2.0/lib/gcc/current/libquadmath.dylib) was built for newer macOS version (13.0) than being linked (11.0)
ld: warning: dylib (/opt/homebrew/Cellar/gcc/12.2.0/lib/gcc/current/libgfortran.dylib) was built for newer macOS version (13.0) than being linked (11.0) And, more importantly, [email protected] passed all tests! 🍾 Openblas test passes, but... not my reproducerUnfortunately, I tried my reproducer, and despite openblas test results, it was not reporting the right result. ❌ What's next?Next try will be to test #4003 but forcing In the meanwhile, if you have any additional test you'd like to do, e.g. linking warnings ring a bell to you, just let me know. |
Thank you, will fix the c_check regression in #4004 . Curious how/why the DNRM2 testcase still fails despite my attempt in #4003 to prevent the compiler from removing critical code. Anyway at the time the utest code is run, the build of the library should already be complete and it should be possible to get the other parts of the internal tests to execute by running |
BTW I can get the DNRM2 testcase to work (again) by moving my declaration of the |
Modified #4003 accordingly to "fix" DNRM2 on "VORTEX" for good. (And incidentally, an "armv8-a" generic "ARMV8" build would use a different DNRM2 kernel that has not shown any problems with the "Inf vs. 0" corner case so far.) |
Built #4003 (+ #4004 patch) with both compilers pairs
And:
|
Thanks for the feedback, I have merged #4003 now. (BTW the SciPy folks are also seeing some issues with their testsuite and unpatched 0.3.23 - not clear yet if it is the exact same problem or perhaps tiny FMA-related inaccuracies on the new hardware) |
On Apple M2 it seems there is a numerical error when using
CGEMM
, so just with single precision complex type. Actually, we found problems withCHER2K
as well, but our speculation is that it probably shares code withGEMM
.Hereafter, a small reproducer and an excerpt of internal OpenBLAS tests where it is reported
CGEMM
failure.Reproducer
Here a reproducer of the GEMM problem.
Whose output on an Apple M2 is
when it is supposed to give exactly the same result, i.e.
(-2, 0)
.OpenBLAS test
Here an excerpt of the output of OpenBLAS tests, where a failure related to
CGEMM
is reported (it was the only failure).Thanks to @rasolca for helping investigating this.
The text was updated successfully, but these errors were encountered: