Refactor the control tree infrastructure #710

devinamatthews · 2023-01-13T22:24:40Z

This PR adds major new functionality for instantiating BLAS-like operations using the BLIS framework.

A "plugin" architecture.
- Users are now able to register new kernels, kernel preferences, and block sizes at runtime, directly from user applications.
- Plugins can be created, configured, and built using only an installed version of BLIS, no source or source code changes required.
- Plugins support both reference and optimized kernels, as well as custom configuration-to-kernel set mappings.
- Building plugins (including reference and relevant optimized kernels) for enabled architectures or architecture families is automated, as is linking into the final library.
- The configure script is now installed as configure-plugin. In this mode, it can be used to initialize a plugin from a template including optional example code, and prepare a build system for compiling the plugin into a shared or static library.
- Additional configuration files, templates, and build system components are also installed to %prefix%/share/blis.
- The cntx_t struct now has extensible data structures for holding kernels, preferences, and blocksizes. These are based on a "stack" structure which contains a list of fixed-size data blocks. Adding a new entry (which may require allocating a new block or reallocating the block pointer array) requires locking, but looking up entries is lock-free and takes O(1) time.
- Kernels can depend on either 1 or 2 type parameters (e.g. mixed-precision packing requires 2). The func2_t struct supports the latter, but can be implicitly cast to func_t if only "diagonal" entries are needed. The number of type parameters can be inferred from the kernel ID for type-safety.
- Functions have been added to register new kernels, preferences, and blocksizes with the global kernel structure. This creates corresponding entries in each allocated context and returns the next available ID. Plugins use this API to register user kernels, although the user is responsible for tracking the returned IDs for later lookup. Setting newly-registered reference kernels, as well as overriding these with optimized kernels is done in exactly the same manner as in bli_cntx_init_ref and bli_cntx_init_<config>.
Restructuring of the control and thread control trees.
- The control tree has been substantially restructured to support more flexibility.
- The "default" control trees for gemm (also used for syrk/herk/syr2k/her2k/symm/hemm/trmm/trmm3) and trsm are now represented as a single structure containing all necessary control tree nodes and parameters.
- An API has been added to modify the default gemm/trsm control trees.
- This same API is used by the framework and packm/gemm/trsm variants to access specific control tree nodes.
- Users can alternatively create a custom control tree from scratch.
- The block sizes are now encoded directly in the control tree, rather than via loop IDs. The logic for adjusting block sizes for certain operations has been moved to the control tree initialization.
- Type information is encoded in the control tree to drive proper selection of packing and computational kernels provided by the user.
- The packing microkernel now receives an opaque "params" struct which is user-definable and can be used to pass additional information through the call stack.
- The auxinfo_t struct has been updated with a params field for opaque user data as well as the global offsets of the current microtile.
- The packm and gemm variants can be overridden by the user, and also receive an opaque params struct via the associated control tree node.
- The structure-aware packing kernel bli_packm_struc_cxk is no longer hard-coded to be called from the default packm variant, but can be overridden by the user. It also supports mixed-precision/mixed-domain natively now.
- The thread control tree is now created entirely up-front by inspecting the control tree. The required number of threads at each level is encoded in the control tree via loop IDs (actually a bitfield of loop IDs), although the ordering and number of such IDs is arbitrary. The logic for adjusting the number of threads at each level based on operation type (e.g. trmm) is now in the control tree initialization and expressed by combining loop IDs from multiple levels into a single level.
- The mem_t object containing the pack buffer pointer has been moved from the control tree to the thread control tree. The control tree is strictly const throughout the operation, and only a single copy is shared by all threads.
- The thread control tree node for packing has been changed so that there is no longer a "fake" node indicating a team of single threads. Instead, the number of threads and thread IDs in the "normal" thread control tree node are used. This change has also been made to the gemmsup thread control tree and packing variants, as well as to the gemmlike sandbox.
- Parameters controlling packing (e.g. inversion of the diagonal, direction, schema) are not stored directly in the control tree but in the opaque params struct. The packing control tree node and its default parameter struct are stored together in the "combined" gemm/trsm control tree structure and initialized as a unit. Users can update these parameters individually or substitute a custom packm variant and params struct.
- The "target" and "execution" datatypes has been removed from the obj_t struct and replaced by type information in the control tree.
- The "sub-node" and "sub-prenode" of a control tree node have been replaced by an arbitrary number of sub-nodes accessed by index. There is a hard cap on the number of sub-nodes (currently 2). Sub-nodes are added during control tree initialization, after creation/initialization of the parent node through an updated API.
- The L3 thread decorator has been significantly simplified and directly calls bli_l3_int. The control tree is created externally, and it is no longer necessary to alias matrices or set object pack schemas. Also, the rntm_t passed in may be NULL. Finally, family and scalar information is no longer needed here.
- bli_l3_int is now a simple inline function which extracts the next control tree node and variant and calls it.
- bli_*_front have been removed and inlined into the expert OAPI with significant simplification.
- 1m (or other induced method) no longer uses an alternative cntx_t.
- The pack_fn/ker_fn pointers and associated params fields on the obj_t were removed in favor of the present solution.
Overhaul of variable substitution in configure.
- The configure script has been somewhat re-written to use a centralized mechanism for substituting variables into build system and other configuration files.
- All substitution variables go through the same pathway now, which necessitated some variable naming changes for variables which were named the same in e.g. Makefile and bli_config.h but with different definitions.
- CC and CXX variables can now contain spaces, e.g. g++ -std=c++17. This provides better support for integration with build tooling such as autotools.
Overhaul of packing kernels.
- Currently there are two packing kernels referenced in the cntx_t structure for MRxk and NRxk shaped micropanels, respectively. These have now been merged into one kernel which is responsible for packing any dense rectangular portion of either A or B.
- The packing kernel now receives information about the register block size (cdim_max) and duplication factor (the "broadcast-B" format, although this can also apply to the A matrix).
- The structure-aware packing kernel (bli_packm_struc_cxk, which is now user-overridable) also receives global offsets of the current micropanel within A or B.
- Explicit kernels for packing the diagonal blocks of triangular/symmetric/hermitian matrices have been added to the cntx_t. This means that the bli_packm_struc_ckx "kernel" no longer needs to directly touch data (except to zero out some regions).
- bli_packm_struc_cxk has also been updated to work only in terms of real datatypes when computing offsets and when zeroing data, which greatly simplifies mixed-domain/1m packing.
- bli_packm_scalar has been updated to better support complex scalars in mixed-domain operations.
- Pack schemas for PACKED_ROW_PANELS* and PACKED_COL_PANELS* have been merged into simply PACKED_PANELS*. This reflects the merging of the packing kernels into a single generic kernel. There were only a very few places which needed the row/column information and this is now supplied by alternative means.
- Packing variants always behave "as-if" the A matrix were being packed (i.e. the code assumes packing row-stored panels). Packing of B is handled by applying an implicit or explicit transpose before packing. This change also applies to gemmsup.
Improved MD/MP support.
- All L3 operations (except trsm) now support full mixed-domain/mixed-precision operation.
- Explicit 1m packing kernels have been added in the cntx_t.
- An explicit 1m microkernel wrapper has been added to the cntx_t.
- An extra packing kernel for the "ro" format has been added, along with the pack_t enumeration value. This supports the packing for real*complex -> real, including potential scaling by a complex alpha, support for structured matrices, etc.
- Extra microkernel wrappers for mixed domain operations have been added to support the ccr (and by extension, crc), rcc, and crr cases. Notably this includes full support for general stride storage and complex alpha/beta.
- Packing kernels and gemm microkernels are now "templated" based on two type parameters rather than one. For packing this allows direct optimization of mixed-precision kernels, and for gemm microkernels this allows direct optimization of mixed-precision without writing to a temporary buffer. Reference packing kernels are directly instantiated for all mixes of precisions, while by default mixed-precision gemm microkernels are supported via a microkernel wrapper. The "old" way of specifying optimized kernels using a single type parameter works unchanged.
- alpha and beta are typecast appropriately to the computational or output datatype, respectively, and always to the complex domain. Scalar typecasting has also been added to gemmsup for safety.
- The gemm macrokernel doesn't have to do any typecasting anymore, as a microkernel wrapper or optimized mixed-precision/mixed-domain kernel now handles this.
- 1m and mixed-domain operations now always use a microkernel wrapper, rather than adjusting parameters in the gemm macrokernel.
- The gemmt macrokernel does still have to handle explicit write-back of microtiles which intersect the diagonal, although typecasting has already been performed.
- The gemmt_x_ker_var2, trmm_xx_ker_var2, and trsm_xx_ker_var2 functions have been removed. The appropriate macrokernel pointer is selected during control tree initialization.
- Real domain MR/NR are checked for even-ness based on the gemm microkernel's row preference in order to guarantee proper 1m and mixed-domain operation.
- Full range of mixed-domain/mixed-precision functionality tested in the testsuite (input.*.mixed).
Other changes:
- The build system has been updated to support C++ source files throughout the framework. While the intent is not to add such files to BLIS itself, this supports plugins written in C++.
- Many instances of configuration-specific code have been simplified by introducing an INSERT_GENTCONF macro which instantiates a block of code for each enabled sub-configuration. The ConfigurationHowTo.md document has been updated to match.
- PASTEMACn/PASTECHn/PASTEF77n have been removed in favor of variadic macros which accept any number of arguments (up to a reasonable limit).
- The INSERT_GENTFUNC* macros have been updated to clean up mixed-precision and mixed-domain instantiations.
- bli_align_dim_to_mult has been updated to support rounding either up or down based on a flag.
- Checking for empty matrices and other early exits (L3 only) has been consolidated into a single utility function.
- The auxinfo_t struct is always passed as const.
- The new function bli_obj_alias_submatrix aliases a matrix while also resetting the root to NULL, offsets to zero (while adjusting the buffer), and applying any implicit transpose.
- L3 pruning functions now only check matrix structure to see what to do, not the operation family.
- gemmsup packing has been updated to use the "normal" pack buffer allocation routines.
- Remove duplicate checks for early return from gemmsup handler.
- bli_determine_blocksize has been significantly simplified.
- Partitioning packed panels is no longer allowed.
- Added bli_xxsame macros.
- Automated the calculation of info bit shifts and masks based on predefined bit sizes for various flags. This greatly simplifies reordering, adding, or removing flags from the info/info2 bitfields.
- Moved more BLIS_NUM_* macros into the corresponding enums as the last entry so that the value is automatically computed.
- Better const-correctness in some level0 scalar macros.
- Better mixed-precision support in some level0 scalar macros.
- Added a bli_axpbys_mxn macro.
- bli_thread_range_sub takes explicit thread ID and number of threads rather than a thrinfo_t node.
- "De-templated" BLIS gemmlike sandbox (specifically, bls_gemm_bp_var1 and bls_packm_var1).
- Combined bls_l3_packm_[ab] into one function with thin wrappers.
- Deleted bls_packm_var[23].
- Add a "termination tag" to the testsuite output so that make check-blis can accurately check for successful completion.
- Add a new function to centrally compute FLOPs for L3 operations in the testsuite.

Now warning-free on M1 Mac with gcc 12.

…thread decorator entry function.

Also, check the runtime if sup is enabled from within bli_gemmsup.

Don't store the *intended* packing schema in the objects, only the *actual* schema once the object is packed for real. This was necessary before to pass the schemas down to the thread decorator which then used that information to create the control tree. Since control tree creation is in bli_*_ex now this is no longer necessary. The schemas also have to be separately passed to bli_gemm_md which is kind of ugly, but that code is ripe for refactoring anyways.

- User-customizable fields will live on in another form. - The only non-redundant functionality in bli_l3_int was attaching scalars to objects which now has a helper function called from bli_*_ex. bli_l3_int still exists but is just a simple inline function so that callers don't have to unpack the variant function pointer themselves.

- BLIS_ONE_I: the imaginary unit - BLIS_MINUS_ONE_I: the negative imaginary unit - BLIS_NAN: a not-a-number value. Both real and imaginary parts are set to NaN for complex datatypes.

The problem occurs when there are at least two teams of threads packing different parts of a matrix, and where each team has at least two threads; call them team A and team B. The problematic sequence is: 1. The chief of team A checks out a block B and broadcasts the pointer to its teammates. 2. Team A completely packs their data and perform a barrier amongst themselves. 3. Team A commences computing with the packed data. 4. The chief of team A finishes computing before its teammates, then calls bli_thrinfo_free on its thrinfo_t struct (which contains the mem_t object referencing the buffer B). This causes buffer B to be checked back in to the pba. 5. The chief of team B checks out the *same* block B that was just checked back in and broadcasts the pointer to its teammates. 6. DATA RACE: now the remaining threads of team A are reading *while* team B are writing to the same buffer B. If team A write new data before team B are done computing then an incorrect result is generated. The solution is to place a global barrier before the call to bli_thrinfo_free at the end of the computation.

- Create structures to hold the entire gemm/trsm control tree. - Control tree nodes are now initialized (in place) instead of created and returned. - Packm control tree nodes have their parameters tucked into a struct along with the control tree node itself. We can cast back from the cntl_t* to this packm_cntl_t* object to read/write parameters. This same trick will be used for other control nodes in the future. - Partially fix up unpackm while we're at it, although that code is not expected to actually work. - Note that the code to allocate control tree nodes from an sba pool still exists (#ifdef'ed out) in case we want to have that as an option alongside stack allocation.

In this phase, add the threading granularity (block size multiple) and weighted threading preference. These are now determined and set in the control tree initialization (gemm and trsm), and stored in a part_cntl_t struct. This information is now read in the gemm/trsm variants and passed directly to bli_thread_range_[mn]dim. Note that this change requires passing the matrix objects to the control tree intitialization (along with the cntx_t). For now, the schemas are determined separated and also passed in---in the future the schemas will also be determined within the control tree initialziation so that these functions have signatures much like bli_gemm_ex itself (except the rntm_t).

Combine all of the bli_thread_range(_weighted)?_[tlbr]2[tlbr] functions into one. This function takes additional parameters describing the dimension (M or N), direction, and use of weighting. This saves quite a bit of code. bli_thread_range_[mn]dim are also simple wrappers around bli_thread_range now. Also, make bli_thread_range take a simple dim_t for the blocksize rather than a num_t/blksz_t combination.

Store specific b_alg ang b_max values in the control tree nodes. bli_determine_blocksize now takes these directly instead of using the block size id. Adjustment of KC for triangular/symmetric matrices is handled up-front when constructing the control tree.

…rol tree. Instead of looking at the operation family from the control tree, only look at which matrices are triangular.

Partitioning direction is now encoded directly into the control tree and simply read by the variants. Note that the direction of the loops for triangular matrices is the same as before, but the "other" dimension (where no data dependency occurs in trmm/trsm) is always forwards now.

The remianing places where this was used have been replaced by other paramaters: - In bli_gemm_blk_var3, the choice whether or not to clear beta uses a new `reset_beta` flag on part_cntl_t. - The family is passed directly into bli_l3_decor (again).

…ding sub-prenodes) without any hard-coded logic for different operation families. Also potentially enable more than two sub-nodes.

The proper variant is now selected up-front during control tree construction.

This entailed making new control tree nodes for the gemm and trsm macro-kernels (gemmt and trmm use the gemm one). Since it is expected that there is a thrinfo_t node for BOTH the jr and ir levels of parallelism, a "dummy" empty control tree node is attached to the macrokernel node.

# Conflicts: # frame/1m/packm/bli_packm_blk_var1.c # frame/3/bli_l3_decor.c # frame/3/bli_l3_sup_packm_var.c # frame/3/bli_l3_thrinfo.c # frame/3/gemm/bli_gemm_cntl.c # frame/3/gemm/bli_gemm_ker_var2.c # frame/3/gemmt/bli_gemmt_l_ker_var2.c # frame/3/gemmt/bli_gemmt_u_ker_var2.c # frame/3/gemmt/bli_gemmt_var.h # frame/3/gemmt/bli_gemmt_x_ker_var2.c # frame/3/trmm/bli_trmm_ll_ker_var2.c # frame/3/trmm/bli_trmm_lu_ker_var2.c # frame/3/trmm/bli_trmm_rl_ker_var2.c # frame/3/trmm/bli_trmm_ru_ker_var2.c # frame/3/trmm/bli_trmm_var.h # frame/3/trmm/bli_trmm_xx_ker_var2.c # frame/3/trsm/bli_trsm_ll_ker_var2.c # frame/3/trsm/bli_trsm_lu_ker_var2.c # frame/3/trsm/bli_trsm_xx_ker_var2.c # frame/base/bli_rntm.c # frame/thread/bli_thread.c # frame/thread/bli_thread.h # sandbox/gemmlike/bls_l3_packm_var1.c # sandbox/gemmlike/bls_l3_packm_var2.c

devinamatthews · 2023-01-13T22:26:31Z

@fgvanzee this PR is phase 1: it puts all of the logic and parameters into the control tree but does not yet allow for user customization. It also does not cleanly handle induced methods and mixed-domain computation (actually the latter is broken but I'm not concerned ATM). This PR shouldn't be merged in its current form but I wanted to give you a chance to review the changes so far and make any necessary changes.

devinamatthews · 2023-01-13T22:27:57Z

Looks like it needs to symbol filed updated too. Ugh.

…based types.

…ntrol_trees

# Conflicts: # frame/base/bli_gks.c # frame/include/bli_arch_config.h

…nges due to the addition of INSERT_GENTCONF.

…ls to use (only implemented in gemm for now).

…sm microkernels.

devinamatthews · 2024-02-15T17:50:37Z

@Aaron-Hutchinson can you check the changes I made to the SiFive x280 packing kernels in this PR? The Travis test passed but I wanted to make sure it looks OK.

devinamatthews · 2024-02-16T19:00:46Z

@fgvanzee this should be ready to merge now unless @Aaron-Hutchinson has any comments on SiFive x280 packing.

Aaron-Hutchinson · 2024-02-26T20:46:06Z

@devinamatthews Very sorry that I didn't see this sooner. I believe @myeh01 is far more familiar with these packing kernels, so I defer judgment to him. It looks ok to me.

CC: @nick-knight

myeh01 · 2024-02-26T22:48:50Z

kernels/sifive_x280/1m/bli_packm_sifive_x280_asm.c

+    float kappa_cast = *kappa;
+
+    // MRxk kernel
+    if (cdim == 7 && cdim_max == 7 && cdim_bcast == 1)


For our packing kernels, we would like to call the vectorized code when cdim < cdim_max as well.

Suggested change

if (cdim == 7 && cdim_max == 7 && cdim_bcast == 1)

if (cdim <= 7 && cdim_max == 7 && cdim_bcast == 1)

And similarly for the NRxk kernels. This should amount to a total of 8 such changes in this file.

myeh01 · 2024-02-26T22:51:35Z

@devinamatthews Thanks for the hard work! I saw only one thing, which is to call the vectorized packing kernels when cdim <= cdim_max (and not just when cdim == cdim_max). Otherwise, LGTM!

- Add support for configuring and building pre-initialized plugins (configure-plugin --build and make) out of the source tree. - Fix various issues with C++-based plugins such as premature inclusion of blis.h, C++ language flags, predefined CXX variables with spaces, etc.

- Remove designated initializer syntax. This isn't officially supported until C++20. - Put initializers in the order in which they are defined in the struct. Even with standard or extension support for designated initializers, initializing non-static members out-of-order is an error in C++. - Remove the conditional code which uses `-1` as the default value of the `pack_buf` member of `mem_t` in C, but `BLIS_BUFFER_FOR_GEN_USE` in C++. Simply use the latter as a common-sense default.

# Conflicts: # frame/include/bli_type_defs.h

…correctly [ci skip].

# Conflicts: # frame/include/bli_type_defs.h

# Conflicts: # frame/base/bli_rntm.h # frame/include/bli_type_defs.h

…ous '=' by hand. This reverts commit 1aac15a, reversing changes made to d02de76.

Details: 1. A "plugin" architecture. - Users are now able to register new kernels, kernel preferences, and blocksizes at runtime, directly from user applications. - Plugins can be created, configured, and built using only an installed version of BLIS -- no source or source code changes required. - Plugins support both reference and optimized kernels, as well as custom configuration-to-kernel-set mappings. - Building plugins (including reference and relevant optimized kernels) for enabled architectures or architecture families is automated, as is linking into the final library. - The configure script is now installed as 'configure-plugin'. In this mode, it can be used to initialize a plugin from a template including optional example code, and prepare a build system for compiling the plugin into a shared or static library. - Additional configuration files, templates, and build system components are also installed to '%prefix%/share/blis'. - The cntx_t struct now has extensible data structures for holding kernels, preferences, and blocksizes. These are based on a "stack" structure which contains a list of fixed-size data blocks. Adding a new entry (which may require allocating a new block or reallocating the block pointer array) requires locking, but looking up entries is lock-free and takes O(1) time. - Kernels can depend on either 1 or 2 type parameters (e.g. mixed-precision packing requires 2). The func2_t struct supports the latter, but can be implicitly cast to func_t if only "diagonal" entries are needed. The number of type parameters can be inferred from the kernel ID for type safety. - Functions have been added to register new kernels, preferences, and blocksizes with the global kernel structure (gks). This creates corresponding entries in each allocated context and returns the next available ID. Plugins use this API to register user kernels, although the user is responsible for tracking the returned IDs for later lookup. Setting newly-registered reference kernels, as well as overriding these with optimized kernels is done in exactly the same manner as in bli_cntx_init_ref() and bli_cntx_init_<subconfig>(). 2. Restructuring of the control and thread control trees. - The control tree has been substantially restructured to support more flexibility. - The "default" control trees for gemm (also used for hemm/symm/herk/her2k/syrk/syr2k/trmm/trmm3) and trsm are now represented as a single structure containing all necessary control tree nodes and parameters. - An API has been added to modify the default gemm/trsm control trees. - This same API is used by the framework and packm/gemm/trsm variants to access specific control tree nodes. - Users can alternatively create a custom control tree from scratch. - The blocksizes are now encoded directly in the control tree, rather than via loop IDs. The logic for adjusting blocksizes for certain operations has been moved to the control tree initialization. - Type information is encoded in the control tree to drive proper selection of packing and computational kernels provided by the user. - The packing microkernel now receives an opaque "params" struct which is user-definable and can be used to pass additional information through the call stack. - The auxinfo_t struct has been updated with a .params field for opaque user data as well as the global offsets of the current microtile. - The packm and gemm variants can be overridden by the user, and also receive an opaque params struct via the associated control tree node. - The structure-aware packing kernel bli_packm_struc_cxk() is no longer hard-coded to be called from the default packm variant, but can be overridden by the user. It also supports mixed-precision/mixed-domain natively now. - The thread control tree (thrinfo_t) is now created entirely up-front by inspecting the control tree. The required number of threads at each level is encoded in the control tree via loop IDs (actually a bitfield of loop IDs), although the ordering and number of such IDs is arbitrary. The logic for adjusting the number of threads at each level based on operation type (e.g. trmm) is now in the control tree initialization and expressed by combining loop IDs from multiple levels into a single level. - The mem_t object containing the pack buffer pointer has been moved from the control tree to the thread control tree. NOTE: **The control tree is now strictly const throughout the operation, and only a single copy is shared by all threads.** - The thread control tree node for packing has been changed so that there is no longer a "fake" node indicating a team of single threads. Instead, the number of threads and thread IDs in the "normal" thread control tree node are used. This change has also been made to the gemmsup thread control tree and packing variants, as well as to the gemmlike sandbox. - Parameters controlling packing (e.g. inversion of the diagonal, direction, schema) are not stored directly in the control tree but in the opaque params struct. The packing control tree node and its default params struct are stored together in the "combined" gemm/trsm control tree structure and initialized as a unit. Users can update these parameters individually or substitute a custom packm variant and params struct. - The "target" and "execution" datatypes has been removed from the obj_t struct and replaced by type information in the control tree. - The "sub-node" and "sub-prenode" of a control tree node have been replaced by an arbitrary number of sub-nodes accessed by index. There is a hard cap on the number of sub-nodes (currently 2). Sub-nodes are added during control tree initialization, *after* creation/initialization of the parent node through an updated API. - The level-3 thread decorator has been significantly simplified and directly calls bli_l3_int(). The control tree is created externally, and it is no longer necessary to alias matrices or set object pack schemas. Also, the rntm_t passed in may be NULL. Finally, family and scalar information is no longer needed here. - bli_l3_int() is now a simple inline function which extracts the next control tree node and variant and calls it. - bli_*_front() have been removed and inlined into the expert object API with significant simplification. - 1m (or other induced method) no longer uses an alternative cntx_t. - The .pack_fn/.ker_fn pointers and associated params fields on the obj_t were removed in favor of the present solution. 3. Overhaul of variable substitution in configure script. - The configure script has been somewhat re-written to use a centralized mechanism for substituting variables into build system and other configuration files. - All substitution variables go through the same pathway now, which necessitated some variable naming changes for variables which were named the same in e.g. Makefile and bli_config.h but with different definitions. - CC and CXX variables can now contain spaces, e.g. 'g++ -std=c++17'. This provides better support for integration with build tooling such as autotools. 4. Overhaul of packing kernels. - Previously there were two packing kernels referenced in the cntx_t structure for MRxk and NRxk shaped micropanels, respectively. These have now been merged into one kernel which is responsible for packing any dense rectangular portion of either A or B. - The packing kernel now receives information about the register blocksize (cdim_max) and duplication factor (the "broadcast-B" format, although this can also apply to the A matrix). - The structure-aware packing kernel (bli_packm_struc_cxk(), which is now user-overridable) also receives global offsets of the current micropanel within A or B. - Explicit kernels for packing the diagonal blocks of triangular/symmetric/Hermitian matrices have been added to the cntx_t. This means that the bli_packm_struc_ckx() "kernel" no longer needs to directly touch data (except to zero out some regions). - bli_packm_struc_cxk() has also been updated to work only in terms of fundamental elements (i.e., real datatypes) when computing offsets and when zeroing data, which greatly simplifies mixed-domain/1m packing. - bli_packm_scalar() has been updated to better support complex scalars in mixed-domain operations. - Pack schemas for PACKED_ROW_PANELS* and PACKED_COL_PANELS* have been merged into simply PACKED_PANELS*. This reflects the merging of the packing kernels into a single generic kernel. There were only a very few places which needed the row/column information and this is now supplied by alternative means. - Packing variants always behave "as if" the A matrix were being packed (i.e. the code assumes packing column-stored row panels). Packing of B is handled by applying an implicit or explicit transpose before packing. This change also applies to gemmsup. 5. Improved MD/MP support. - All level-3 operations (except trsm) now support full mixed-domain/mixed-precision operation. - Explicit 1m packing kernels have been added in the cntx_t. - An explicit 1m microkernel wrapper has been added to the cntx_t. - An extra packing kernel for the "ro" format has been added, along with the pack_t enumeration value. This supports the packing for real*complex -> real, including potential scaling by a complex alpha, support for structured matrices, etc. - Extra microkernel wrappers for mixed-domain operations have been added to support the 'ccr' (and by extension, 'crc'), 'rcc', and 'crr' cases. Notably this includes full support for general stride storage and complex alpha/beta. - Packing kernels and gemm microkernels are now "templated" based on two type parameters rather than one. For packing this allows direct optimization of mixed-precision kernels, and for gemm microkernels this allows direct optimization of mixed-precision without writing to a temporary buffer. Reference packing kernels are directly instantiated for all mixes of precisions, while by default mixed-precision gemm microkernels are supported via a microkernel wrapper. The "old" way of specifying optimized kernels using a single type parameter works unchanged. - alpha and beta are typecast appropriately to the computational or output datatype, respectively, and **always** to the complex domain. Scalar typecasting has also been added to gemmsup for safety. - The gemm macrokernel doesn't have to do any typecasting anymore, as a microkernel wrapper or optimized mixed-precision/mixed-domain kernel now handles this. - 1m and mixed-domain operations now always use a microkernel wrapper, rather than adjusting parameters in the gemm macrokernel. - The gemmt macrokernel **does** still have to handle explicit write-back of microtiles which intersect the diagonal, although typecasting has already been performed. - The gemmt_x_ker_var2(), trmm_xx_ker_var2(), and trsm_xx_ker_var2() functions have been removed. The appropriate macrokernel pointer is selected during control tree initialization. - Real domain MR/NR are checked for even-ness based on the gemm microkernel's row preference in order to guarantee proper 1m and mixed-domain operation. - Full range of mixed-domain/mixed-precision functionality tested in the testsuite ('input.*.mixed'). 6. Other changes: - The build system has been updated to support C++ source files throughout the framework. While the intent is not to add such files to BLIS itself, this supports plugins written in C++. - Many instances of configuration-specific code have been simplified by introducing an INSERT_GENTCONF macro which instantiates a block of code for each enabled sub-configuration. The ConfigurationHowTo.md document has been updated accordingly. - PASTEMAC?/PASTECH?/PASTEF77? have been removed in favor of variadic macros which accept any number of arguments (up to a reasonable limit). - The INSERT_GENTFUNC* macros have been updated to clean up mixed-precision and mixed-domain instantiations. - bli_align_dim_to_mult() has been updated to support rounding either up or down based on a flag. - Checking for empty matrices and other early exits (level-3 only) has been consolidated into a single utility function. - The auxinfo_t struct is always passed as const. - The new function bli_obj_alias_submatrix() aliases a matrix while also resetting the root to NULL, offsets to zero (while adjusting the buffer), and applying any implicit transpose. - Level-3 pruning functions now only check matrix structure to see what to do, not the operation family. - gemmsup packing has been updated to use the "normal" pack buffer allocation routines. - Remove duplicate checks for early return from gemmsup handler. - bli_determine_blocksize() has been significantly simplified. - Partitioning packed panels is no longer allowed. - Added bli_xxsame macros. - Automated the calculation of info bit shifts and masks based on predefined bit sizes for various flags. This greatly simplifies reordering, adding, or removing flags from the info/info2 bitfields. - Moved more BLIS_NUM_* macros into the corresponding enums as the last entry so that the value is automatically computed. - Better const-correctness in some level0 scalar macros. - Better mixed-precision support in some level0 scalar macros. - Added a bli_axpbys_mxn() macro. - bli_thread_range_sub() takes explicit thread ID and number of threads rather than a thrinfo_t node. - "De-templated" BLIS gemmlike sandbox (specifically, bls_gemm_bp_var1() and bls_packm_var1()). - Combined bls_l3_packm_[ab]() into one function with thin wrappers. - Deleted bls_packm_var[23](). - Add a "termination tag" to the testsuite output so that 'make check-blis' can accurately check for successful completion. - Add a new function to centrally compute FLOPs for level-3 operations in the testsuite. - (cherry picked from a49238e)

devinamatthews added 23 commits December 11, 2022 18:48

Get rid of warnings.

9484d05

Now warning-free on M1 Mac with gcc 12.

Move control tree creation back into bli_*_front and simplify the l3 …

4c2e2a9

…thread decorator entry function.

Merge bli_*_ex and bli_*_front for l3 operations.

846ed67

Consolidate checks for zero dimensions, etc. into one function.

267df0b

Also, check the runtime if sup is enabled from within bli_gemmsup.

Simplify object aliasing in l3 expert OAPIs.

559cc3f

Move initialization of rntm ways of parallelism to l3 thread decorator.

9f7d3a7

Add new constants.

5914bd5

- BLIS_ONE_I: the imaginary unit - BLIS_MINUS_ONE_I: the negative imaginary unit - BLIS_NAN: a not-a-number value. Both real and imaginary parts are set to NaN for complex datatypes.

Simplify the l3_prune functions and remove need to reference the cont…

ff30f4f

…rol tree. Instead of looking at the operation family from the control tree, only look at which matrices are triangular.

Refactor cntl_t such that we can specify the desired threading (inclu…

fa0817e

…ding sub-prenodes) without any hard-coded logic for different operation families. Also potentially enable more than two sub-nodes.

Store the packm variant and kernel pointers in the control tree.

5b8bdaa

Move block size multiples for packing into control tree.

5e81537

Remove the "xx" macro-kernel variants for gemmt/trmm/trsm.

980a52f

The proper variant is now selected up-front during control tree construction.

devinamatthews requested a review from fgvanzee January 13, 2023 22:24

devinamatthews added 4 commits January 18, 2023 16:32

Make gemmbp control tree the only option for gemm.

49f053f

Fix bug where assignment is used instead of equality testing.

c091ab0

Silence some annoying warnings.

092a86e

Add type-safety, type-agnosticism, and user parameters to packm kernels.

e0a2dd1

devinamatthews and others added 10 commits November 18, 2023 16:20

Add BLIS_ENABLE_STD_COMPLEX macro to explicitly request std::complex-…

9723262

…based types.

Merge branch 'std-complex-fix' into new_control_trees

e322de2

Trival changes, mostly whitespace.

5135103

Merge remote-tracking branch 'upstream/new_control_trees' into new_co…

704b7b8

…ntrol_trees

Merge branch 'master' into new_control_trees

81f460a

# Conflicts: # frame/base/bli_gks.c # frame/include/bli_arch_config.h

Update documentation for adding new sub-configuration to reflects cha…

ba7a472

…nges due to the addition of INSERT_GENTCONF.

Add C microtile offset information to auxinfo_t for custom microkerne…

09c78e4

…ls to use (only implemented in gemm for now).

Update SiFive x280 packing kernels for new API.

7df41ed

Add const qualifier to auxinfo_t parameter of SiFive x280 gemm/gemmtr…

4e2ed2b

…sm microkernels.

Fix include.

dc3e615

myeh01 reviewed Feb 26, 2024

View reviewed changes

devinamatthews and others added 7 commits March 5, 2024 13:21

Allow asm kernels in bli_packm_sifive_x280_asm.c to pack partial slices.

0c6d47e

Fix typos and clarify.

d88dbd4

Fix problem with overriding Makefile rules [ci skip].

ecbeab6

Merge branch 'no-designated-initializers' into new_control_trees

5a602ac

# Conflicts: # frame/include/bli_type_defs.h

The rcc and ccr reference microkernel wrapper files were named in…

01bfc69

…correctly [ci skip].

fgvanzee approved these changes Mar 26, 2024

View reviewed changes

devinamatthews added 4 commits March 26, 2024 16:17

Add comments [ci skip].

9bdfd94

Merge branch 'no-designated-initializers' into new_control_trees

d02de76

# Conflicts: # frame/include/bli_type_defs.h

Merge branch 'master' into new_control_trees

1aac15a

# Conflicts: # frame/base/bli_rntm.h # frame/include/bli_type_defs.h

Revert "Merge branch 'master' into new_control_trees" and fix extrane…

a18252d

…ous '=' by hand. This reverts commit 1aac15a, reversing changes made to d02de76.

fgvanzee merged commit a49238e into master Apr 24, 2024
2 checks passed

fgvanzee deleted the new_control_trees branch April 25, 2024 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the control tree infrastructure #710

Refactor the control tree infrastructure #710

devinamatthews commented Jan 13, 2023 •

edited

Loading

devinamatthews commented Jan 13, 2023 •

edited

Loading

devinamatthews commented Jan 13, 2023

devinamatthews commented Feb 15, 2024

devinamatthews commented Feb 16, 2024

Aaron-Hutchinson commented Feb 26, 2024

myeh01 Feb 26, 2024

myeh01 commented Feb 26, 2024

	if (cdim == 7 && cdim_max == 7 && cdim_bcast == 1)
	if (cdim <= 7 && cdim_max == 7 && cdim_bcast == 1)

Refactor the control tree infrastructure #710

Refactor the control tree infrastructure #710

Conversation

devinamatthews commented Jan 13, 2023 • edited Loading

devinamatthews commented Jan 13, 2023 • edited Loading

devinamatthews commented Jan 13, 2023

devinamatthews commented Feb 15, 2024

devinamatthews commented Feb 16, 2024

Aaron-Hutchinson commented Feb 26, 2024

myeh01 Feb 26, 2024

Choose a reason for hiding this comment

myeh01 commented Feb 26, 2024

devinamatthews commented Jan 13, 2023 •

edited

Loading

devinamatthews commented Jan 13, 2023 •

edited

Loading