Restored ArmSVE general storage case. (#708)

Details: - Restored general storage case in armsve kernels. - Reason for doing this: Though real `g`-storage is difficult to speedup, `g`-codepath here can provide a good support for transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`. - By experience, this solution is only *a little* slower than in-reg transpose. Plus in-reg transpose is only possible for a fixed VL in our case. - (cherry picked from 4e18cd3) Refined emacs handling of indentation. (#717) Details: - This refines the emacs autoformatting to be better in line with contribution guidelines. - Removed a stray shebang in a .mk file which confuses emacs about the file mode, which should be makefile-mode. (emacs also removes stray whitespace at the ends of lines.) - (cherry picked from 0ba6e9e) Updated hpx namespace for make_count_shape. (#725) Details: - The hpx namespace for *counting_shape changed. This PR updates the use of counting_shape in blis to comply with the change in hpx. - Co-authored-by: ctaylor <[email protected]> - (cherry picked from 059f151) Added an 'arm64' entry to `.travis.yml`. (#726) Details: - Added a new 'arm64' entry to the .travis.yml file in an attempt to get Travis CI to compile both NEON and SVE kernels, even if only NEON kernels are exercised in the testing. With this new 'arm64' entry, the 'cortexa57' entry becomes redundant and may be removed. Thanks to RuQing Xu for this suggestion. - Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in bli_kernels_arm64.h, which meant that the default value of 64 was being used. This caused a runtime consistency check to fail in bli_gks.c (in Travis CI), one which requires that mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is defined as BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2 This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64' configuration, thus overriding the default and (hopefully) avoiding the aforementioned consistency check failures. - Appended '|| cat ./output.testsuite' to all 'make' commands in travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion. - Whitespace changes. - (cherry picked from 0b421ef) Redirect grep stderr to /dev/null. (#723) Details: - In common.mk, added a redirection of stderr to /dev/null for the grep command being used to gather a list of header files #included from bli_cntx_ref.c. The redirection is desirable because as of grep 3.8, regular expressions with "stray" backslashes trigger warnings [1]. But removing the backslash seems to break the BLIS build system when using pre-3.8 versions of grep, so this seems to be easiest way to satisfy the BLIS build system for both pre- and post-3.8 grep environments. [1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html - (cherry picked from b1d3fc7) Added runtime selection of 'power' config family. (#718) Details: - Created a 'power' umbrella configuration family, which, when targeted at configure-time, will build both 'power9' and 'power10' subconfigs. (With this feature, a BLIS shared library could be compiled on a power9 system and run on power10 and vice-versa. Unoptimised code will execute if it is linked and run on any other generic system.) - This new configuration family will only work with gcc, since that is the only compiler supported by both power9 and power10 subconfigs in BLIS. - Documented power9 and power10 as supported microarchitectures in the docs/HardwareSupport.md document. - (cherry picked from e3d352f) Define `BLIS_VERSION_STRING` in `blis.h`. (#720) Details: - Previously, the version string was communicated from configure to config.mk (via the config.mk.in template), where it was included via the top-level Makefile, where it was then used to define the preprocessor macro BLIS_VERSION_STRING via a command line argument to the compiler (via -D). This macro is then used within bli_info.c to initialize a static string which can then be queried via the bli_info_get_version_str() function. However, there are some applications that may find utility in being able to access the version string by inspecting the monolithic (flattened) blis.h header file that is created at compile time and installed alongside the library. This commit moves the definition of BLIS_VERSION_STRING into bli_config.h (via the bli_config.h.in template) so that it is embedded in blis.h. The version string is now available in three places: - the static/shared library, which is installed in the 'lib' subdirectory of the install prefix (query-able via the bli_info_get_version_str() function); - the config.mk makefile fragment, which is installed in the 'share' subdirectory of the install prefix (in the VERSION variable); - the blis.h header file, which is installed in the 'include' subdirectory of the install prefix (via the BLIS_VERSION_STRING macro constant). Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this change. - CREDITS file update. - (cherry picked from e730c68) Typecast printf() args to avoid compiler warnings. (#716) Details: - In bli_thread_range_tlb.c, typecast integer arguments passed to printf() -- which are typically disabled unless debugging -- to type "long" to guarantee a match to the "%ld" format specifiers used in those calls. This avoids spurious warnings with certain compilers in certain toolchain environments, such as 32-bit RISC-V (rv32iv). - (cherry picked from dc5d00a) Use here-document for 'configure --help' output. (#714) Details: - Changed the configure script function that outputs "--help" text to do so via so-called "here-document" syntax for improved readability and maintainability. The change eliminates hundreds of echo statements and makes it easier to change existing configure options' help text, along with other benefits such as eliminating the need to escape double- quote characters ("). - (cherry picked from ecbcf40) Merge tlb- and slab/rr-specific gemm macrokernels. (#711) Details: - Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr- specific one (var2) so that a single function can be compiled with either tlb or slab/rr support, depending on the value of the BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating information from both approaches: the start/end/inc for the JR and IR loops from slab or rr partitioning; and the number of assigned microtiles, plus the starting IR dimension offset for all iterations after the first (ir_next). With these changes, slab, rr, and tlb can all be parameterized by initializing a similar set of variables prior to the jr loop. - Removed the wrap-around logic that sets the "b_next" field of the auxinfo_t struct, which executes during the last IR iteration of the last JR iteration. The potential benefit of this code is so minor (and hinges on the microkernel making use of the b_next field) that it's arguably not worth including. The code also does the wrong thing for some threads whenever JR_NT > 1, since only thread 0 (in the JR group) would even compute with the first micropanel of B. - Re-expressed the definition of bli_is_last_iter_slrr so that slab and tlb use the same code rather than rr and tlb. - Adjusted the initialization of the gemm control tree accordingly. - (cherry picked from c334ec2) Fixed mis-mapped instruction for VEXTRACTF64X2. (#713) Details: - This commit fixes a typo in the macro definition for the extended inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro was previously defined (incorrectly) in terms of the vextractf64x4 instruction rather than vextractf64x2. - CREDITS file update. - (cherry picked from 5793a77) Defined lt, lte, gt, gte + misc. other updates. (#712) Details: - Changed invertsc operation to be a non-destructive operation; that is, it now takes separate input and output operands. This change applies to both the object and typed APIs. - Defined an alternative square root operation, sqrtrsc, which, when operating on complex scalars, assumes the imaginary part of the input to be zero. - Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym so that when the source matrix has an implicit unit diagonal, the operation leaves the diagonal of the destination matrix untouched. Previously, the operations would interpret an implicit unit diagonal on the source matrix as a request to manifest the unit diagonal *explicitly* on output (either as something to copy in the case of copym, or something to compute with in the cases of addm, subm, axpym, scal2m, and xpbym). It turns out that this behavior was too cute by half and could cause unintended headaches for practical use cases. (This change in behavior also required small modifications to the trmv and trsv testsuite modules so that they would properly test matrices with unit diagonals.) - Added missing dependencies for copym to gemv, ger, hemv, trmv, and trsv testsuite modules. - Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in frame/util, which use lt, lte, gt, and gte level-0 scalar macros. - Trivial variable rename in bli_part.c to harmonize with other variable naming conventions. - (cherry picked from 16d2e9e) Implement cntx_t pointer caching in gks. (#709) Details: - Refactored the gks cntx_t query functions so that: (1) there is a clearer pattern of similarity between functions that query a native context and those that query its induced (1m) counterpart; and (2) queried cntx_t pointers (for both native and induced cntx_t pointers) are cached (by default), or deep-queried upon each invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined. - Refactored query-related functions in bli_arch.c to cache the queried arch_t value (by default), or deep-query the arch_t value upon each invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined. - Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is repopulated each time the function is called. (It is still only allocated once on first call.) This was mostly done in preparation for some future in which the arch_t value might change at runtime. In such a scenario, the induced method context would need to be recalculated any time the native context changes. - Added preprocessor logic to bli_config_macro_defs.h to handle enabling or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING). - For now, cntx_t pointer caching is enabled by default and does not correspond to any official configure option. Disabling can be done by inserting a #define for BLIS_DISABLE_GKS_CACHING into the appropriate bli_family_*.h header file within the configuration of interest. - Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers (and not just arch_t values) be cached. - Comment updates. - (cherry picked from 9a366b1) Fixing type-mismatch errors in power10 sandbox (#701) Details: - This commit fixes a mismatch between the function type signature of bli_gemm_ex() required by BLIS and the version of the function defined within the power10 sandbox. It also performs typecasting upon calling bli_gemm_front() to attain type consistency with the type signature defined by BLIS for bli_gemm_front(). - (cherry picked from b895ec9) Define new global scalar (obj_t) constants. (#703) Details: - This commit defines the following new global scalar constants: - BLIS_ONE_I: This constant encodes the imaginary unit. - BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit. - BLIS_NAN: This constant encodes a not-a-number value. Both real and imaginary parts are set to NaN for complex datatypes. - (cherry picked from 38d88d5) Disable power10 kernels other than sgemm, dgemm. (#705) Details: - There is a power10 sandbox which uses microkernels for datatypes other than float and double (or scomplex/dcomplex). In a regular power10- configured build (that is, with the sandbox disabled), there were compile errors for some of these other non-sgemm/non-dgemm microkernels. This commit protects those kernels with a new cpp macro guard (which is defined in sandbox/power10/bli_sandbox.h) that prevents that kernel code from being compiled for normal, non-sandbox power10 builds. - (cherry picked from cdb22b8) Fix k = 0 edge case in power10 microkernels (#706) Details: - When power10 sgemm and dgemm microkernels are called with k = 0, they become caught in infinite loops and segfault. This is fixed now via an early exit in the case of k = 0. - (cherry picked from d220f9c) Fixed clang compiler warning in bli_l0_ft.h. Details: - Fixed a type redefinition in frame/0/bli_l0_ft.h that unintentionally slipped in with commit 02b5acd.
flame · May 21, 2024 · 950d309 · 950d309
1 parent 8c29b37
commit 950d309
Show file tree

Hide file tree

Showing 50 changed files with 1,316 additions and 1,275 deletions.
diff --git a/.dir-locals.el b/.dir-locals.el
@@ -1,11 +1,32 @@
-;; Emacs C mode formatting for the BLIS layout requirements.
-((c-mode . ((c-file-style . "stroustrup")
-	    (c-basic-offset . 4)
-	    (comment-start . "// ")
-	    (comment-end . "")
-	    (indent-tabs-mode . t)
-	    (tab-width . 4)
-	    (parens-require-spaces . nil)
-	    (require-final-newline . t)
-	    (eval add-hook `before-save-hook `delete-trailing-whitespace)
-	    )))
+;; Emacs formatting for the BLIS layout requirements.
+
+(
+ ;; Recognize *.mk files as Makefile fragments
+ (auto-mode-alist . (("\\.mk\\'" . makefile-mode)) )
+
+ ;; Makefiles require tabs and are almost always width 8
+ (makefile-mode . (
+                   (indent-tabs-mode . t)
+                   (tab-width . 8)
+                   )
+                )
+
+ ;; C code formatting roughly according to docs/CodingConventions.md
+ (c-mode . (
+            (c-file-style . "bsd")
+            (c-basic-offset . 4)
+            (comment-start . "// ")
+            (comment-end . "")
+            (parens-require-spaces . nil)
+            )
+         )
+
+ ;; Default formatting for all source files not overriden above
+ (prog-mode . (
+               (indent-tabs-mode . nil)
+               (tab-width . 4)
+               (require-final-newline . t)
+               (eval add-hook `before-save-hook `delete-trailing-whitespace)
+               )
+            )
+)
diff --git a/.travis.yml b/.travis.yml
@@ -62,6 +62,15 @@ matrix:
       CC=aarch64-linux-gnu-gcc-10 CXX=aarch64-linux-gnu-g++-10 \
       PACKAGES="gcc-10-aarch64-linux-gnu g++-10-aarch64-linux-gnu libc6-dev-arm64-cross qemu-system-arm qemu-user" \
       TESTSUITE_WRAPPER="qemu-aarch64 -cpu max,sve=true,sve512=true -L /usr/aarch64-linux-gnu/"
+  # arm64 build and fast testsuite (qemu)
+  # NOTE: This entry omits the -cpu flag so that while both NEON and SVE kernels
+  # are compiled, only NEON kernels will be tested. (h/t to RuQing Xu)
+  - os: linux
+    compiler: aarch64-linux-gnu-gcc-10
+    env: OOT=0 TEST=FAST SDE=0 THR="none" CONF="arm64" \
+      CC=aarch64-linux-gnu-gcc-10 CXX=aarch64-linux-gnu-g++-10 \
+      PACKAGES="gcc-10-aarch64-linux-gnu g++-10-aarch64-linux-gnu libc6-dev-arm64-cross qemu-system-arm qemu-user" \
+      TESTSUITE_WRAPPER="qemu-aarch64 -L /usr/aarch64-linux-gnu/"
 install:
 - if [ "$CC" = "gcc"  ] && [ "$TRAVIS_OS_NAME" = "linux" ]; then export CC="gcc-9"; fi
 - if [ -n "$PACKAGES" ] && [ "$TRAVIS_OS_NAME" = "linux" ]; then sudo apt-get install -y $PACKAGES; fi

diff --git a/CREDITS b/CREDITS
@@ -5,124 +5,128 @@ Acknowledgements
 
 The BLIS framework was originally authored by
 
-  Field Van Zee            @fgvanzee           (The University of Texas at Austin)
+  Field Van Zee            @fgvanzee                  (The University of Texas at Austin)
 
-but many others have contributed code and feedback, including
+but many others have contributed code, ideas, and feedback, including
 
-  Jay Acosta               @jay-acosta         (Oracle)
-  Sameer Agarwal           @sandwichmaker      (Google)
-  Murtaza Ali                                  (Texas Instruments)
-  Sajid Ali                @s-sajid-ali        (Northwestern University)
+  Jay Acosta               @jay-acosta                (Oracle)
+  Sameer Agarwal           @sandwichmaker             (Google)
+  Murtaza Ali                                         (Texas Instruments)
+  Sajid Ali                @s-sajid-ali               (Northwestern University)
   Erling Andersen          @erling-d-andersen
   Alex Arslan              @ararslan
-  Vernon Austel                                (IBM, T.J. Watson Research Center)
-  Satish Balay             @balay              (Argonne National Laboratory)
+  Vernon Austel                                       (IBM, T.J. Watson Research Center)
+  Mohsen Aznaveh           @Aznaveh                   (Texas A&M University)
+  Satish Balay             @balay                     (Argonne National Laboratory)
   Kihiro Bando             @bandokihiro
-  Matthew Brett            @matthew-brett      (University of Birmingham)
+  Matthew Brett            @matthew-brett             (University of Birmingham)
   Jérémie du Boisberranger @jeremiedbb
-  Jed Brown                @jedbrown           (Argonne National Laboratory)
+  Jed Brown                @jedbrown                  (Argonne National Laboratory)
   Robin Christ             @robinchrist
   Dilyn Corner             @dilyn-corner
-  Mat Cross                @matcross           (NAG)
+  Mat Cross                @matcross                  (NAG)
                            @decandia50
-  Daniël de Kok            @danieldk           (Explosion)
-  Kay Dewhurst             @jkd2016            (Max Planck Institute, Halle, Germany)
-  Jeff Diamond                                 (Oracle)
+  Harsh Dave               @HarshDave12               (AMD)
+  Tim Davis                @DrTimothyAldenDavis       (Texas A&M University)
+  Daniël de Kok            @danieldk                  (Explosion)
+  Kay Dewhurst             @jkd2016                   (Max Planck Institute, Halle, Germany)
+  Jeff Diamond                                        (Oracle)
   Johannes Dieterich       @iotamudelta
   Krzysztof Drewniak       @krzysz00
-  Marat Dukhan             @Maratyszcza        (Google)
-  Victor Eijkhout          @VictorEijkhout     (Texas Advanced Computing Center)
-  Evgeny Epifanovsky       @epifanovsky        (Q-Chem)
+  Marat Dukhan             @Maratyszcza               (Google)
+  Victor Eijkhout          @VictorEijkhout            (Texas Advanced Computing Center)
+  Evgeny Epifanovsky       @epifanovsky               (Q-Chem)
   Isuru Fernando           @isuruf
   Roman Gareev             @gareevroman
   Richard Goldschmidt      @SuperFluffy
   Chris Goodyer
   Alexander Grund          @Flamefire
-  John Gunnels             @jagunnels          (IBM, T.J. Watson Research Center)
+  John Gunnels             @jagunnels                 (IBM, T.J. Watson Research Center)
   Ali Emre Gülcü           @Lephar
-  Jeff Hammond             @jeffhammond        (Intel)
+  Jeff Hammond             @jeffhammond               (Intel)
   Jacob Gorm Hansen        @jacobgorm
-  Shivaprashanth H                             (Global Edge)
+  Shivaprashanth H                                    (Global Edge)
   Jean-Michel Hautbois     @jhautbois
   Ian Henriksen            @insertinterestingnamehere (The University of Texas at Austin)
-  Greg Henry                                   (Intel)
+  Greg Henry                                          (Intel)
   Minh Quan Ho             @hominhquan
   Matthew Honnibal         @honnibal
   Stefan Husmann           @stefanhusmann
-  Francisco Igual          @figual             (Universidad Complutense de Madrid)
+  Francisco Igual          @figual                    (Universidad Complutense de Madrid)
   Madeesh Kannan           @shadeMe
   Tony Kelman              @tkelman
-  Lee Killough             @leekillough        (Cray)
-  Mike Kistler             @mkistler           (IBM, Austin Research Laboratory)
-  Ivan Korostelev          @ivan23kor          (University of Alberta)
-  Kyungmin Lee             @kyungminlee        (Ohio State University)
+  Lee Killough             @leekillough               (Cray)
+  Mike Kistler             @mkistler                  (IBM, Austin Research Laboratory)
+  Ivan Korostelev          @ivan23kor                 (University of Alberta)
+  Kyungmin Lee             @kyungminlee               (Ohio State University)
   Michael Lehn             @michael-lehn
   Shmuel Levine            @ShmuelLevine
                            @lschork2
   Dave Love                @loveshack
-  Tze Meng Low                                 (The University of Texas at Austin)
-  Ye Luo                   @ye-luo             (Argonne National Laboratory)
-  Ricardo Magana           @magania            (Hewlett Packard Enterprise)
-  Madan mohan Manokar      @madanm3            (AMD)
+  Tze Meng Low                                        (The University of Texas at Austin)
+  Ye Luo                   @ye-luo                    (Argonne National Laboratory)
+  Ricardo Magana           @magania                   (Hewlett Packard Enterprise)
+  Madan mohan Manokar      @madanm3                   (AMD)
   Giorgos Margaritis
-  Bryan Marker             @bamarker           (The University of Texas at Austin)
-  Simon Lukas Märtens      @ACSimon33          (RWTH Aachen University)
-  Devin Matthews           @devinamatthews     (The University of Texas at Austin)
+  Bryan Marker             @bamarker                  (The University of Texas at Austin)
+  Simon Lukas Märtens      @ACSimon33                 (RWTH Aachen University)
+  Devin Matthews           @devinamatthews            (The University of Texas at Austin)
   Stefanos Mavros          @smavros
-  Mithun Mohan             @MithunMohanKadavil (AMD)
+  Mithun Mohan             @MithunMohanKadavil        (AMD)
                            @moon-chilled
   Ilknur Mustafazade       @Runkli
                            @nagsingh
-  Bhaskar Nallani          @BhaskarNallani     (AMD)
-  Stepan Nassyr            @stepannassyr       (Jülich Supercomputing Centre)
+  Bhaskar Nallani          @BhaskarNallani            (AMD)
+  Stepan Nassyr            @stepannassyr              (Jülich Supercomputing Centre)
   Nisanth M P              @nisanthmp
-  Nisanth Padinharepatt                        (AMD)
+  Nisanth Padinharepatt                               (AMD)
   Ajay Panyala             @ajaypanyala
-  Marc-Antoine Parent      @maparent           (Conversence)
-  Devangi Parikh           @dnparikh           (The University of Texas at Austin)
-  Elmar Peise              @elmar-peise        (RWTH-Aachen)
+  Marc-Antoine Parent      @maparent                  (Conversence)
+  Devangi Parikh           @dnparikh                  (The University of Texas at Austin)
+  Elmar Peise              @elmar-peise               (RWTH-Aachen)
   Clément Pernet           @ClementPernet
   Ilya Polkovnichenko
-  Jack Poulson             @poulson            (Stanford)
+  Jack Poulson             @poulson                   (Stanford)
   Mathieu Poumeyrol        @kali
-  Christos Psarras         @ChrisPsa           (RWTH Aachen University)
+  Christos Psarras         @ChrisPsa                  (RWTH Aachen University)
                            @pkubaj
                            @qnerd
   Michael Rader            @mrader1248
-  Pradeep Rao              @pradeeptrgit       (AMD)
+  Pradeep Rao              @pradeeptrgit              (AMD)
   Aleksei Rechinskii
-  Leick Robinson           @LeickR             (Oracle)
+  Leick Robinson           @LeickR                    (Oracle)
   Karl Rupp                @karlrupp
-  Paul Sandoz              @PaulSandoz         (Oracle)
-  Martin Schatz                                (The University of Texas at Austin)
+  Paul Sandoz              @PaulSandoz                (Oracle)
+  Martin Schatz                                       (The University of Texas at Austin)
   Nico Schlömer            @nschloe
   Rene Sitt
-  Tony Skjellum            @tonyskjellum       (The University of Tennessee at Chattanooga)
-  Mikhail Smelyanskiy                          (Intel, Parallel Computing Lab)
+  Tony Skjellum            @tonyskjellum              (The University of Tennessee at Chattanooga)
+  Mikhail Smelyanskiy                                 (Intel, Parallel Computing Lab)
   Nathaniel Smith          @njsmith
   Shaden Smith             @ShadenSmith
-  Tyler Smith              @tlrmchlsmth        (The University of Texas at Austin)
+  Tyler Smith              @tlrmchlsmth               (The University of Texas at Austin)
   Snehith                  @ArcadioN09
-  Paul Springer            @springer13         (RWTH Aachen University)
-  Adam J. Stewart          @adamjstewart       (University of Illinois at Urbana-Champaign)
+  Paul Springer            @springer13                (RWTH Aachen University)
+  Adam J. Stewart          @adamjstewart              (University of Illinois at Urbana-Champaign)
   Vladimir Sukarev
+  Harihara Sudhan S        @ihariharasudhan           (AMD)
   Chengguo Sun             @chengguosun
-  Santanu Thangaraj                            (AMD)
-  Nicholai Tukanov         @nicholaiTukanov    (The University of Texas at Austin)
-  Rhys Ulerich             @RhysU              (The University of Texas at Austin)
-  Robert van de Geijn      @rvdg               (The University of Texas at Austin)
-  Meghana Vankadari        @Meghana-vankadari  (AMD)
-  Kiran Varaganti          @kvaragan           (AMD)
-  Natalia Vassilieva                           (Hewlett Packard Enterprise)
+  Santanu Thangaraj                                   (AMD)
+  Nicholai Tukanov         @nicholaiTukanov           (The University of Texas at Austin)
+  Rhys Ulerich             @RhysU                     (The University of Texas at Austin)
+  Robert van de Geijn      @rvdg                      (The University of Texas at Austin)
+  Meghana Vankadari        @Meghana-vankadari         (AMD)
+  Kiran Varaganti          @kvaragan                  (AMD)
+  Natalia Vassilieva                                  (Hewlett Packard Enterprise)
                            @h-vetinari
-  Andrew Wildman           @awild82            (University of Washington)
-  Zhang Xianyi             @xianyi             (Chinese Academy of Sciences)
+  Andrew Wildman           @awild82                   (University of Washington)
+  Zhang Xianyi             @xianyi                    (Chinese Academy of Sciences)
   Benda Xu                 @heroxbd
-  Guodong Xu               @docularxu          (Linaro.org)
-  RuQing Xu                @xrq-phys           (The University of Tokyo)
+  Guodong Xu               @docularxu                 (Linaro.org)
+  RuQing Xu                @xrq-phys                  (The University of Tokyo)
   Costas Yamin             @cosstas
-  Chenhan Yu               @ChenhanYu          (The University of Texas at Austin)
-  Roman Yurchak            @rth                (Symerio)
+  Chenhan Yu               @ChenhanYu                 (The University of Texas at Austin)
+  Roman Yurchak            @rth                       (Symerio)
   Stefano Zampini          @stefanozampini
   M. Zhou                  @cdluminate
 

diff --git a/build/bli_config.h.in b/build/bli_config.h.in
@@ -45,6 +45,8 @@
 // Enabled kernel sets (kernel_list)
 @kernel_list_defines@
 
+#define BLIS_VERSION_STRING "@version@"
+
 #if @enable_system@
 #define BLIS_ENABLE_SYSTEM
 #else

diff --git a/common.mk b/common.mk
@@ -101,7 +101,7 @@ get-noopt-cflags-for     = $(strip $(CFLAGS_PRESET) \
                                    $(call load-var-for,CLANGFLAGS,$(1)) \
                                    $(call load-var-for,CPPROCFLAGS,$(1)) \
                                    $(CTHREADFLAGS) \
-                                   $(CINCFLAGS) $(VERS_DEF) \
+                                   $(CINCFLAGS) \
                             )
 
 get-noopt-cxxflags-for   = $(strip $(CFLAGS_PRESET) \
@@ -113,7 +113,7 @@ get-noopt-cxxflags-for   = $(strip $(CFLAGS_PRESET) \
                                    $(call load-var-for,CPPROCFLAGS,$(1)) \
                                    $(CTHREADFLAGS) \
                                    $(CXXTHREADFLAGS) \
-                                   $(CINCFLAGS) $(VERS_DEF) \
+                                   $(CINCFLAGS) \
                             )
 
 get-refinit-cflags-for   = $(strip $(call load-var-for,COPTFLAGS,$(1)) \
@@ -534,6 +534,7 @@ GREP       := grep
 EGREP      := grep -E
 XARGS      := xargs
 INSTALL    := install -c
+DEVNULL    := /dev/null
 
 # Script for creating a monolithic header file.
 #FLATTEN_H  := $(DIST_PATH)/build/flatten-headers.sh
@@ -1193,7 +1194,18 @@ CBLAS_H_FLAT    := $(BASE_INC_PATH)/$(CBLAS_H)
 # files will be needed when compiling bli_cntx_ref.c with the monolithic header.
 ifeq ($(strip $(SHARE_PATH)),.)
 REF_KER_SRC     := $(DIST_PATH)/$(REFKERN_DIR)/bli_cntx_ref.c
-REF_KER_HEADERS := $(shell $(GREP) "\#include" $(REF_KER_SRC) | sed -e "s/\#include [\"<]\([a-zA-Z0-9\_\.\/\-]*\)[\">].*/\1/g" | $(GREP) -v $(BLIS_H))
+#
+# NOTE: A redirect to /dev/null has been added to the grep command below because
+# as of version 3.8, grep outputs warnings when encountering stray backslashes
+# in regular expressions [1]. Versions older than 3.8 not only do not complain,
+# but actually seem to *require* the backslash, perhaps because of the way we
+# are invoking grep via GNU make's shell command. WHEN DEBUGGING ANYTHING
+# INVOLVING THE MAKE VARIABLE BELOW, PLEASE CONSIDER TEMPORARILY REMOVING THE
+# REDIRECT TO /dev/null SO THAT YOU SEE ANY MESSAGES SENT TO STANDARD ERROR.
+#
+# [1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html
+#
+REF_KER_HEADERS := $(shell $(GREP) "\#include" $(REF_KER_SRC) 2> $(DEVNULL) | sed -e "s/\#include [\"<]\([a-zA-Z0-9\_\.\/\-]*\)[\">].*/\1/g" | $(GREP) -v $(BLIS_H))
 endif
 
 # Match each header found above with the path to that header, and then strip
@@ -1244,10 +1256,6 @@ BLIS_CONFIG_H   := ./bli_config.h
 # --- Special preprocessor macro definitions -----------------------------------
 #
 
-# Define a C preprocessor macro to communicate the current version so that it
-# can be embedded into the library and queried later.
-VERS_DEF       := -DBLIS_VERSION_STRING=\"$(VERSION)\"
-
 # Define a C preprocessor flag that is *only* defined when BLIS is being
 # compiled. (In other words, an application that #includes blis.h will not
 # get this cpp macro.)

diff --git a/config/arm64/bli_family_arm64.h b/config/arm64/bli_family_arm64.h
@@ -39,6 +39,8 @@
 // -- MEMORY ALLOCATION --------------------------------------------------------
 
 #define BLIS_SIMD_ALIGN_SIZE 16
+
+#define BLIS_SIMD_MAX_SIZE 128 // Note: The default is 64.
 #define BLIS_SIMD_MAX_NUM_REGISTERS 32
 
 // SVE-specific configs.

diff --git a/config/old/newarch/make_defs.mk b/config/old/newarch/make_defs.mk
@@ -1,6 +1,6 @@
-#!/bin/bash
 #
-#  BLIS    
+#
+#  BLIS
 #  An object-based framework for developing high-performance BLAS-like
 #  libraries.
 #
@@ -47,7 +47,7 @@ CC             := gcc
 CC_VENDOR      := gcc
 endif
 
-# Enable IEEE Standard 1003.1-2004 (POSIX.1d). 
+# Enable IEEE Standard 1003.1-2004 (POSIX.1d).
 # NOTE: This is needed to enable posix_memalign().
 CPPROCFLAGS    := -D_POSIX_C_SOURCE=200112L
 CMISCFLAGS     := -std=c99
@@ -67,13 +67,13 @@ endif
 CKOPTFLAGS     := $(COPTFLAGS)
 
 ifeq ($(CC_VENDOR),gcc)
-CKVECFLAGS     := 
+CKVECFLAGS     :=
 else
 ifeq ($(CC_VENDOR),icc)
-CKVECFLAGS     := 
+CKVECFLAGS     :=
 else
 ifeq ($(CC_VENDOR),clang)
-CKVECFLAGS     := 
+CKVECFLAGS     :=
 else
 $(error gcc, icc, or clang is required for this configuration.)
 endif
@@ -83,4 +83,3 @@ endif
 # Store all of the variables here to new variables containing the
 # configuration name.
 $(eval $(call store-make-defs,$(THIS_CONFIG)))
-