-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable user-customization of packing #549
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… call gemmt instead. The induced methods for gemmt are currently missing but I imagine that is easy to fill in.
1. Removed bli_packm_blk_var1_md and merged with bli_packm_blk_var1. 2. Simplified packing code under the assumption that the matrices are passed as A and B^T (i.e., that k is always the row dimension). This primarily affects bli_packm_blk_var1 and the bli_packm_struc_cxk functions. To compensate, explicit transposition is done as appropriate in bli_(gemm|trsm)_pack[ab]. CAVEAT: explicit transitions of B and Bp are also needed in the gemm and gemmtrsm ukr tests. 3. Explicitly remove any offsets in the user-provided obj_t's. Thus the offsets represent only the partitioning by the framework.
…_packm_blk_var1. gemm/gemmtrsm/trsm ukr tests required an overhaul to make use of the modified packing kernel.
bli_gemm_packab and bli_trsm_packab came along for the ride.
All threads need to make a copy of the master thread's local mem_t struct, but the master thread may exit before this is complete. Fixed by putting the thread barrier in the right place. Also, fixed a problem with the max panel length and work partitioning for triangular matrices.
…ar matrices correctly, so pass BLIS_GENERAL in this case.
@hominhquan yes, things got reordered and so we ended up with a race condition. I think now it is pretty compartmentalized and modular so that a custom packing kernel could use |
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
There was a data race when simulating application-level threading: the `test_done` field of the operation structures was set by all threads on a shared object. The threading driver now copies all of the operation structures to be local to each thread.
The packing buffer for B was being released too early. The fix is a bit ugly because it really does need to get release at that point for all but the last repeat, but later on the last one.
STOP IT!
This reverts commit ef64cdf.
# Conflicts: # frame/1m/packm/bli_packm_alloc.c # frame/1m/packm/bli_packm_blk_var1.c # frame/1m/packm/bli_packm_init.c # frame/1m/packm/bli_packm_struc_cxk_3mis.c # frame/1m/packm/bli_packm_struc_cxk_3mis.h # frame/1m/packm/bli_packm_struc_cxk_4mi.c # frame/1m/packm/bli_packm_struc_cxk_4mi.h # frame/1m/packm/bli_packm_struc_cxk_rih.c # frame/1m/packm/bli_packm_struc_cxk_rih.h # frame/3/bli_l3.h # frame/3/bli_l3_check.c # frame/3/bli_l3_check.h # frame/3/bli_l3_ind.c # frame/3/bli_l3_oapi.c # frame/3/bli_l3_oapi.h # frame/3/bli_l3_oapi_ba.c # frame/3/bli_l3_oapi_ex.c # frame/3/bli_l3_tapi_ex.h # frame/3/gemm/bli_gemm_md.c # frame/3/gemmt/bli_gemmt_front.c # frame/3/her2k/bli_her2k_front.c # frame/3/herk/bli_herk_front.c # frame/3/syr2k/bli_syr2k_front.c # frame/3/syrk/bli_syrk_front.c # frame/ind/oapi/bli_l3_3m4m1m_oapi.c # frame/ind/oapi/bli_l3_ind_oapi.c # frame/ind/oapi/bli_l3_ind_oapi.h # frame/ind/oapi/bli_l3_nat_oapi.c # frame/ind/tapi/bli_l3_ind_tapi.c # kernels/armsve/3/armsve_asm_macros_scomplex.h # testsuite/input.general.fast
@fgvanzee ready to review |
devinamatthews
changed the title
Enable user-customized of packing
Enable user-customization of packing
Nov 12, 2021
Details: - Fixed bli_obj_swap() so that it correctly handles the root fields after obj_t contents are swapped. - Renamed bli_obj_remove_offs() to bli_obj_reset_origin() and changed the function to reset the object's root field. - Comment and whitespace updates.
I think we're done. I'll compile a monolithic commit message and then squash. |
Details: - Removed calls to bli_obj_set_as_root() within the level-3 _front() functions since the setting of the aliases' root fields is now handled by bli_obj_reset_origin(). (The calls to bli_obj_set_as_root() in gemm_front() has already been removed.)
dzambare
pushed a commit
to Meghana-vankadari/blis
that referenced
this pull request
Jan 6, 2022
Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t*. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions.
fgvanzee
added a commit
that referenced
this pull request
Oct 24, 2022
Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t*. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions. - (cherry picked from commit cf7d616)
fgvanzee
added a commit
that referenced
this pull request
Nov 3, 2022
Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t*. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions. - (cherry picked from commit cf7d616) Fixed trmm[3]/trsm performance bug in cf7d616. (#685) Details: - Fixed a performance bug in the packing of micropanels that intersect the diagonal of triangular matrices (i.e., those found in trmm, trmm3, and trsm). This bug was introduced in cf7d616 and stemmed from an ill-formed boolean conditional expression in bli_packm_blk_var1(). This conditional would chose when to use round-robin parallel work allocation, but checked for the triangularity of the submatrix being packed while failing also to check for whether the current micropanel actually intersected the diagonal. The net result of this bug was that *all* micropanels of a triangular matrix, no matter where the upanels resided within the matrix, were assigned to threads via a round-robin policy. This affected some microarchitectures and threading configurations much worse than others, but it seems that overall the effect was universally negative, likely because of the reduced spatial locality during the packing with round-robin. Thanks to Leick Robinson for his tireless efforts in helping track down this issue. - (cherry picked from commit 872898d)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds functionality to override the "default" packing microkernel (which packs one MRK or NRK micropanel) and variant (kernel, which packs an entire matrix block/panel). A void* field on obj_t allows users to pass arbitrary additional parameters to packing microkernels and variants.