Skip to content

Commit

Permalink
Omnibus PR - Oct 2023 (#678)
Browse files Browse the repository at this point in the history
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
  commits that affect non-trivial aspects of BLIS. The major highlights:
  - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
    cntl_t) to the thrinfo_t object. This allows the rntm_t to be
    effectively const (although it is sometimes copied internally and
    modified to reflect different ways of parallelism). Moving the mem_t
    sets the stage for sharing a global control tree amongst all
    threads.
  - De-templatized the macrokernels for gemmt, trmm, and trsm to match
    the macrokernel for gemm, which has been de-templatized since
    54fa28b.
  - Reimplemented bli_l3_determine_kc() by separating out the logic for
    adjusting KC based on MR/NR for triangular A and/or B into a new
    function, bli_l3_adjust_kc(). For now, this function is still called
    from bli_l3_determine_kc(), but in the future we plan to have it
    called once when constructing the control tree.
  - Refactored the level-3 thread decorator into two parts:
    - One part deals only with launching threads, each one calling a
      generic thread entry function. This code resides in frame/thread
      and constitutes the definition of bli_thread_launch(). Note that
      it is specific to the threading implementation (OpenMP, pthreads,
      single, etc.)
    - The other part deals with passing the matrix operands and related
      information into bli_thread_launch(). This is the "l3 decorator"
      and now resides in frame/3. It is agnostic to the threading
      implementation.
  - Modified the "level" of the thread control tree passed in at each
    operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
    passed in a communicator representing the active thread teams which
    would share the available work. Now, the *parent* thread comm is
    passed in. The operation then grabs the child comm and uses it to
    partition the work. The difference is in bli_trsm_blk_var1(), where
    there are now two children nodes for this single operation (i.e. the
    thread control tree is split one level above where the control tree
    is). The sub-prenode is used for the trsm subproblem while the
    normal sub-node is used for the gemm part. Importantly, the parent
    comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
  added back in the future when the control tree's creation is moved so
  that it happens much sooner (provided that bli_*_front() have not been
  absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
  for consistency. This includes _num_threads(), _thread_id(), _n_way(),
  _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
  _am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
  misspelled.
- (cherry picked from commit aeb5f0c)

Fixed performance bug caused by redundant packing. (#680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
  packing the same (rather than separate) micropanels. This bug was
  caused by different parts of the code using the num_threads/thread_id
  field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
  standardize on the latter and provide a "fake" thrinfo_t sub-prenode
  in the thrinfo tree which consists of single-member thread teams. The
  single team with multiple threads node is still required since it and
  only it can be used to perform barriers and broadcasts (e.g. of the
  packed buffer pointer).
- (cherry picked from commit 29f79f0)
  • Loading branch information
fgvanzee committed Nov 20, 2023
1 parent 751d0a1 commit d7c32a2
Show file tree
Hide file tree
Showing 215 changed files with 5,166 additions and 11,099 deletions.
10 changes: 5 additions & 5 deletions addon/gemmd/attic/bao_gemmd_bp_var2.c
Original file line number Diff line number Diff line change
Expand Up @@ -386,8 +386,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the JR loop.
NOTE: These values are only needed when computing the next
micropanel of B. */ \
const dim_t jr_nt = bli_thread_n_way( thread_jr ); \
const dim_t jr_tid = bli_thread_work_id( thread_jr ); \
const dim_t jr_nt = bli_thrinfo_n_way( thread_jr ); \
const dim_t jr_tid = bli_thrinfo_work_id( thread_jr ); \
\
/* Compute number of primary and leftover components of the JR loop. */ \
dim_t jr_iter = ( nc_cur + NR - 1 ) / NR; \
Expand Down Expand Up @@ -416,8 +416,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the IR loop.
NOTE: These values are only needed when computing the next
micropanel of A. */ \
const dim_t ir_nt = bli_thread_n_way( thread_ir ); \
const dim_t ir_tid = bli_thread_work_id( thread_ir ); \
const dim_t ir_nt = bli_thrinfo_n_way( thread_ir ); \
const dim_t ir_tid = bli_thrinfo_work_id( thread_ir ); \
\
/* Compute number of primary and leftover components of the IR loop. */ \
dim_t ir_iter = ( mc_cur + MR - 1 ) / MR; \
Expand Down Expand Up @@ -476,7 +476,7 @@ void PASTECH2(bao_,ch,varname) \
/* This barrier is needed to prevent threads from starting to pack
the next row panel of B before the current row panel is fully
computed upon. */ \
bli_thread_barrier( thread_pb ); \
bli_thrinfo_barrier( thread_pb ); \
} \
} \
\
Expand Down
10 changes: 5 additions & 5 deletions addon/gemmd/bao_gemmd_bp_var1.c
Original file line number Diff line number Diff line change
Expand Up @@ -370,8 +370,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the JR loop.
NOTE: These values are only needed when computing the next
micropanel of B. */ \
const dim_t jr_nt = bli_thread_n_way( thread_jr ); \
const dim_t jr_tid = bli_thread_work_id( thread_jr ); \
const dim_t jr_nt = bli_thrinfo_n_way( thread_jr ); \
const dim_t jr_tid = bli_thrinfo_work_id( thread_jr ); \
\
/* Compute number of primary and leftover components of the JR loop. */ \
dim_t jr_iter = ( nc_cur + NR - 1 ) / NR; \
Expand Down Expand Up @@ -400,8 +400,8 @@ void PASTECH2(bao_,ch,varname) \
/* Query the number of threads and thread ids for the IR loop.
NOTE: These values are only needed when computing the next
micropanel of A. */ \
const dim_t ir_nt = bli_thread_n_way( thread_ir ); \
const dim_t ir_tid = bli_thread_work_id( thread_ir ); \
const dim_t ir_nt = bli_thrinfo_n_way( thread_ir ); \
const dim_t ir_tid = bli_thrinfo_work_id( thread_ir ); \
\
/* Compute number of primary and leftover components of the IR loop. */ \
dim_t ir_iter = ( mc_cur + MR - 1 ) / MR; \
Expand Down Expand Up @@ -458,7 +458,7 @@ void PASTECH2(bao_,ch,varname) \
/* This barrier is needed to prevent threads from starting to pack
the next row panel of B before the current row panel is fully
computed upon. */ \
bli_thread_barrier( rntm, thread_pb ); \
bli_thrinfo_barrier( thread_pb ); \
} \
} \
\
Expand Down
10 changes: 5 additions & 5 deletions addon/gemmd/bao_l3_packm_a.c
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Barrier to make sure all threads are caught up and ready to begin the
packm stage. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
\
/* Compute the size of the memory block eneded. */ \
siz_t size_needed = sizeof( ctype ) * m_pack * k_pack; \
Expand Down Expand Up @@ -90,7 +90,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t to all
threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -139,7 +139,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t
to all threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -313,13 +313,13 @@ void PASTECH2(bao_,ch,opname) \
d, incd, \
a, rs_a, cs_a, \
*p, *rs_p, *cs_p, \
pd_p, *ps_p, \
pd_p, *ps_p, \
cntx, \
thread \
); \
\
/* Barrier so that packing is done before computation. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
}

//INSERT_GENTFUNC_BASIC0( packm_a )
Expand Down
10 changes: 5 additions & 5 deletions addon/gemmd/bao_l3_packm_b.c
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Barrier to make sure all threads are caught up and ready to begin the
packm stage. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
\
/* Compute the size of the memory block eneded. */ \
siz_t size_needed = sizeof( ctype ) * k_pack * n_pack; \
Expand Down Expand Up @@ -90,7 +90,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t to all
threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -139,7 +139,7 @@ void PASTECH2(bao_,ch,opname) \
\
/* Broadcast the address of the chief thread's passed-in mem_t
to all threads. */ \
mem_t* mem_p = bli_thread_broadcast( rntm, thread, mem ); \
mem_t* mem_p = bli_thrinfo_broadcast( thread, mem ); \
\
/* Non-chief threads: Copy the contents of the chief thread's
passed-in mem_t to the passed-in mem_t for this thread. (The
Expand Down Expand Up @@ -313,13 +313,13 @@ void PASTECH2(bao_,ch,opname) \
d, incd, \
b, rs_b, cs_b, \
*p, *rs_p, *cs_p, \
pd_p, *ps_p, \
pd_p, *ps_p, \
cntx, \
thread \
); \
\
/* Barrier so that packing is done before computation. */ \
bli_thread_barrier( rntm, thread ); \
bli_thrinfo_barrier( thread ); \
}

//INSERT_GENTFUNC_BASIC0( packm_b )
Expand Down
4 changes: 2 additions & 2 deletions addon/gemmd/bao_l3_packm_var1.c
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ void PASTECH2(bao_,ch,varname) \
\
/* Query the number of threads and thread ids from the current thread's
packm thrinfo_t node. */ \
const dim_t nt = bli_thread_n_way( thread ); \
const dim_t tid = bli_thread_work_id( thread ); \
const dim_t nt = bli_thrinfo_n_way( thread ); \
const dim_t tid = bli_thrinfo_work_id( thread ); \
\
/* Suppress warnings in case tid isn't used (ie: as in slab partitioning). */ \
( void )nt; \
Expand Down
4 changes: 2 additions & 2 deletions addon/gemmd/bao_l3_packm_var2.c
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ void PASTECH2(bao_,ch,varname) \
\
/* Query the number of threads and thread ids from the current thread's
packm thrinfo_t node. */ \
const dim_t nt = bli_thread_n_way( thread ); \
const dim_t tid = bli_thread_work_id( thread ); \
const dim_t nt = bli_thrinfo_n_way( thread ); \
const dim_t tid = bli_thrinfo_work_id( thread ); \
\
/* Suppress warnings in case tid isn't used (ie: as in slab partitioning). */ \
( void )nt; \
Expand Down
Loading

0 comments on commit d7c32a2

Please sign in to comment.