Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable user-customization of packing #549

Merged
merged 28 commits into from
Dec 2, 2021
Merged

Conversation

devinamatthews
Copy link
Member

This PR adds functionality to override the "default" packing microkernel (which packs one MRK or NRK micropanel) and variant (kernel, which packs an entire matrix block/panel). A void* field on obj_t allows users to pass arbitrary additional parameters to packing microkernels and variants.

… call gemmt instead.

The induced methods for gemmt are currently missing but I imagine that is easy to fill in.
1. Removed bli_packm_blk_var1_md and merged with bli_packm_blk_var1.
2. Simplified packing code under the assumption that the matrices are passed as A and B^T (i.e., that k is always the row dimension). This primarily affects bli_packm_blk_var1 and the bli_packm_struc_cxk functions. To compensate, explicit transposition is done as appropriate in bli_(gemm|trsm)_pack[ab]. CAVEAT: explicit transitions of B and Bp are also needed in the gemm and gemmtrsm ukr tests.
3. Explicitly remove any offsets in the user-provided obj_t's. Thus the offsets represent only the partitioning by the framework.
…_packm_blk_var1.

gemm/gemmtrsm/trsm ukr tests required an overhaul to make use of the modified packing kernel.
bli_gemm_packab and bli_trsm_packab came along for the ride.
All threads need to make a copy of the master thread's local mem_t struct, but the master thread may exit before this is complete. Fixed by putting the thread barrier in the right place. Also, fixed a problem with the max panel length and work partitioning for triangular matrices.
…ar matrices correctly, so pass BLIS_GENERAL in this case.
@devinamatthews
Copy link
Member Author

@hominhquan yes, things got reordered and so we ended up with a race condition. I think now it is pretty compartmentalized and modular so that a custom packing kernel could use bli_packm_init() without taking any extra precautions.

Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
There was a data race when simulating application-level threading: the `test_done` field of the operation structures was set by all threads on a shared object. The threading driver now copies all of the operation structures to be local to each thread.
The packing buffer for B was being released too early. The fix is a bit ugly because it really does need to get release at that point for all but the last repeat, but later on the last one.
@devinamatthews
Copy link
Member Author

Finally. @fgvanzee this is ready to review when you're done with #531.

# Conflicts:
#	frame/1m/packm/bli_packm_alloc.c
#	frame/1m/packm/bli_packm_blk_var1.c
#	frame/1m/packm/bli_packm_init.c
#	frame/1m/packm/bli_packm_struc_cxk_3mis.c
#	frame/1m/packm/bli_packm_struc_cxk_3mis.h
#	frame/1m/packm/bli_packm_struc_cxk_4mi.c
#	frame/1m/packm/bli_packm_struc_cxk_4mi.h
#	frame/1m/packm/bli_packm_struc_cxk_rih.c
#	frame/1m/packm/bli_packm_struc_cxk_rih.h
#	frame/3/bli_l3.h
#	frame/3/bli_l3_check.c
#	frame/3/bli_l3_check.h
#	frame/3/bli_l3_ind.c
#	frame/3/bli_l3_oapi.c
#	frame/3/bli_l3_oapi.h
#	frame/3/bli_l3_oapi_ba.c
#	frame/3/bli_l3_oapi_ex.c
#	frame/3/bli_l3_tapi_ex.h
#	frame/3/gemm/bli_gemm_md.c
#	frame/3/gemmt/bli_gemmt_front.c
#	frame/3/her2k/bli_her2k_front.c
#	frame/3/herk/bli_herk_front.c
#	frame/3/syr2k/bli_syr2k_front.c
#	frame/3/syrk/bli_syrk_front.c
#	frame/ind/oapi/bli_l3_3m4m1m_oapi.c
#	frame/ind/oapi/bli_l3_ind_oapi.c
#	frame/ind/oapi/bli_l3_ind_oapi.h
#	frame/ind/oapi/bli_l3_nat_oapi.c
#	frame/ind/tapi/bli_l3_ind_tapi.c
#	kernels/armsve/3/armsve_asm_macros_scomplex.h
#	testsuite/input.general.fast
@devinamatthews
Copy link
Member Author

@fgvanzee ready to review

@devinamatthews devinamatthews changed the title Enable user-customized of packing Enable user-customization of packing Nov 12, 2021
Details:
- Fixed bli_obj_swap() so that it correctly handles the root fields
  after obj_t contents are swapped.
- Renamed bli_obj_remove_offs() to bli_obj_reset_origin() and changed
  the function to reset the object's root field.
- Comment and whitespace updates.
@fgvanzee
Copy link
Member

I think we're done. I'll compile a monolithic commit message and then squash.

Details:
- Removed calls to bli_obj_set_as_root() within the level-3 _front()
  functions since the setting of the aliases' root fields is now
  handled by bli_obj_reset_origin(). (The calls to bli_obj_set_as_root()
  in gemm_front() has already been removed.)
@fgvanzee fgvanzee merged commit cf7d616 into master Dec 2, 2021
@devinamatthews devinamatthews deleted the obj_t_makeover_phase1 branch December 7, 2021 21:29
dzambare pushed a commit to Meghana-vankadari/blis that referenced this pull request Jan 6, 2022
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
  .ker_params. These fields store pointers to functions and data that
  will allow the user to more flexibly create custom operations while  
  recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to 
  replace the diagonal offset with panel offsets, and changed strides 
  of both C and P to inc/ldim semantics. Updated object API to the packm
  variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
  definition since it has been replaced by the .pack_fn pointer in the 
  obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
  from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
  bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
  instead of typed pointers, allowing a single function to be used
  regardless of datatype. This obviated having a separate implementation
  in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a 
  new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
  always packed as column-stored row panels -- that is, identically to 
  that of left-hand matrix operands ("A"). This means that while we pack
  matrix A normally, we actually pack B in a transposed state. This
  allowed us to simplify a lot of code throughout the framework, and
  also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
  described above. bli_packm_init()--which is now called from within
  bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
  a bool that indicates whether packing should be performed (or
  skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
  which, among other things, defaults the new .pack_fn field of the 
  obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
  refocuses the view of an object so that it "forgets" any offsets from 
  its original pointer. This function also sets the object's root field 
  to itself. Calls to bli_obj_reset_origin() for each matrix operand
  appear in the _front() functions, after the obj_t's are aliased. This
  resetting of the underlying matrices' origins is needed in preparation
  for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static 
  inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
  libblis_test_pobj_create() to create local packed objects. Previously,
  these packed objects were created by calling lower-level functions.
fgvanzee added a commit that referenced this pull request Oct 24, 2022
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
  .ker_params. These fields store pointers to functions and data that
  will allow the user to more flexibly create custom operations while
  recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to
  replace the diagonal offset with panel offsets, and changed strides
  of both C and P to inc/ldim semantics. Updated object API to the packm
  variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
  definition since it has been replaced by the .pack_fn pointer in the
  obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
  from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
  bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
  instead of typed pointers, allowing a single function to be used
  regardless of datatype. This obviated having a separate implementation
  in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a
  new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
  always packed as column-stored row panels -- that is, identically to
  that of left-hand matrix operands ("A"). This means that while we pack
  matrix A normally, we actually pack B in a transposed state. This
  allowed us to simplify a lot of code throughout the framework, and
  also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
  described above. bli_packm_init()--which is now called from within
  bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
  a bool that indicates whether packing should be performed (or
  skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
  which, among other things, defaults the new .pack_fn field of the
  obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
  refocuses the view of an object so that it "forgets" any offsets from
  its original pointer. This function also sets the object's root field
  to itself. Calls to bli_obj_reset_origin() for each matrix operand
  appear in the _front() functions, after the obj_t's are aliased. This
  resetting of the underlying matrices' origins is needed in preparation
  for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static
  inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
  libblis_test_pobj_create() to create local packed objects. Previously,
  these packed objects were created by calling lower-level functions.
- (cherry picked from commit cf7d616)
fgvanzee added a commit that referenced this pull request Nov 3, 2022
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
  .ker_params. These fields store pointers to functions and data that
  will allow the user to more flexibly create custom operations while
  recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to
  replace the diagonal offset with panel offsets, and changed strides
  of both C and P to inc/ldim semantics. Updated object API to the packm
  variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
  definition since it has been replaced by the .pack_fn pointer in the
  obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
  from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
  bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
  instead of typed pointers, allowing a single function to be used
  regardless of datatype. This obviated having a separate implementation
  in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a
  new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
  always packed as column-stored row panels -- that is, identically to
  that of left-hand matrix operands ("A"). This means that while we pack
  matrix A normally, we actually pack B in a transposed state. This
  allowed us to simplify a lot of code throughout the framework, and
  also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
  described above. bli_packm_init()--which is now called from within
  bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
  a bool that indicates whether packing should be performed (or
  skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
  which, among other things, defaults the new .pack_fn field of the
  obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
  refocuses the view of an object so that it "forgets" any offsets from
  its original pointer. This function also sets the object's root field
  to itself. Calls to bli_obj_reset_origin() for each matrix operand
  appear in the _front() functions, after the obj_t's are aliased. This
  resetting of the underlying matrices' origins is needed in preparation
  for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static
  inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
  libblis_test_pobj_create() to create local packed objects. Previously,
  these packed objects were created by calling lower-level functions.
- (cherry picked from commit cf7d616)

Fixed trmm[3]/trsm performance bug in cf7d616. (#685)

Details:
- Fixed a performance bug in the packing of micropanels that intersect
  the diagonal of triangular matrices (i.e., those found in trmm, trmm3,
  and trsm). This bug was introduced in cf7d616 and stemmed from an
  ill-formed boolean conditional expression in bli_packm_blk_var1().
  This conditional would chose when to use round-robin parallel work
  allocation, but checked for the triangularity of the submatrix being
  packed while failing also to check for whether the current micropanel
  actually intersected the diagonal. The net result of this bug was that
  *all* micropanels of a triangular matrix, no matter where the upanels
  resided within the matrix, were assigned to threads via a round-robin
  policy. This affected some microarchitectures and threading
  configurations much worse than others, but it seems that overall the
  effect was universally negative, likely because of the reduced spatial
  locality during the packing with round-robin. Thanks to Leick Robinson
  for his tireless efforts in helping track down this issue.
- (cherry picked from commit 872898d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants