-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to thrinfo_t. #673
Conversation
devinamatthews
commented
Oct 5, 2022
- Move pba and sba pool pointers into the thrinfo_t from the rntm_t. This means we can treat the rntm_t as const (except where the ways of threading have to be modified after entry).
- Move the mem_t member from cntl_t to the thrinfo_t. This means we can treat the control tree as const during execution.
- Construct the entire thread control tree in the thread decorator. This means we can avoid the extra complexity of growing the tree as we go, and that the rntm_t is no longer needed below the level of the thread decorator.
- Slightly change the semantics of the thrinfo_t members to better match TCI to ease a future port. The only real practical ramifications of this are that 1) there is now an "extra" node at the beginning of the control tree which represents the initial thread communicator state, and 2) there is some complication with the TRSM pre-node: the splitting of threads in the IC loop doesn't happen for the pre-node (TRSM part), but the split in the control and thread info trees happens below this level. To fix this we construct the first node of the pre-node sub-tree from the KC loop thread communicator and shift any IC parallelism to the JR loop.
- Constructing the thread info tree for GEMMSUP is still rather unsatisfactory given that the threading strategy can change (requiring a re-build of the thread info tree) after the thread decorator due to transposition and/or selection of panel-block vs. block-panel algorithms. It would be nice to make these decisions earlier so that we can construct the tree only once.
These are defined in sub-configuration-specific header files, which are only included by reference kernels.
All kernels have been combined into a single array (level-1v/1f, (un)packm, level-3, and sup), and similarly with preferences (only ukr row-storage preferences for now) and block sizes (which now include sup thresholds and block sizes). These changes are necessary for future support of user-defined kernels. The context initialization functions used by bli_cntx_init_* have also been reworked to use a sentinel instead of an explicit count in order to prevent errors. Note that mostly these changes make the cntx_t code oblivious to BLAS level, but some l3-specific functions remain for compatibility.
1. The generic gemm kernel breaks on armsve because there is no compile-time MR/NR. The refernce gemm kernels has been modified to detect this and fallback to a "dumb" version. 2. For some reason, adding an optimization for writing back full microtiles in row-major storage to the reference gemm kernel results in a segfault for armv7a/gcc-9.3. I can't tell if I'm doing something wrong of if there is a compiler bug. This optimization has been removed for the time being.
…vailable as macros. The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. *IMPORTANT*: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
This change also includes a new level-0 macro: set0s_edge, which helps to simplify the packm kernels.
- bli_packm_struc_cxk has been completely rewritten to combine nat/1m execution and use a special packing kernel for diagonal blocks. - *all* reference kernels now respect broadcast packing for A and/or B. This works for all l3 operations (even trsm!) and with 1m.
# Conflicts: # ref_kernels/3/bli_gemmtrsm_ref.c # ref_kernels/ind/bli_gemmtrsm1m_ref.c
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions.
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions. [ci skip]
Beta (as the scalar attached to C) was not seen as reset to 1 after the first iteration of the pc loop, as the wrong pointer was passed to bli_gemm_int.
# Conflicts: # config/a64fx/bli_cntx_init_a64fx.c # config/armsve/bli_cntx_init_armsve.c # config/bgq/bli_cntx_init_bgq.c # config/bulldozer/bli_cntx_init_bulldozer.c # config/cortexa15/bli_cntx_init_cortexa15.c # config/cortexa53/bli_cntx_init_cortexa53.c # config/cortexa57/bli_cntx_init_cortexa57.c # config/cortexa9/bli_cntx_init_cortexa9.c # config/excavator/bli_cntx_init_excavator.c # config/firestorm/bli_cntx_init_firestorm.c # config/haswell/bli_cntx_init_haswell.c # config/knc/bli_cntx_init_knc.c # config/knl/bli_cntx_init_knl.c # config/penryn/bli_cntx_init_penryn.c # config/piledriver/bli_cntx_init_piledriver.c # config/power10/bli_cntx_init_power10.c # config/power7/bli_cntx_init_power7.c # config/power9/bli_cntx_init_power9.c # config/sandybridge/bli_cntx_init_sandybridge.c # config/skx/bli_cntx_init_skx.c # config/steamroller/bli_cntx_init_steamroller.c # config/template/bli_cntx_init_template.c # config/thunderx2/bli_cntx_init_thunderx2.c # config/zen/bli_cntx_init_zen.c # config/zen2/bli_cntx_init_zen2.c # config/zen3/bli_cntx_init_zen3.c # frame/0/bli_l0_check.h # frame/0/bli_l0_oapi.c # frame/0/bli_l0_oapi.h # frame/0/bli_l0_tapi.h # frame/0/copysc/bli_copysc.c # frame/1/bli_l1v_oapi.h # frame/1/bli_l1v_tapi.c # frame/1/bli_l1v_tapi.h # frame/1d/bli_l1d_ft.h # frame/1d/bli_l1d_oapi.c # frame/1d/bli_l1d_oapi.h # frame/1d/bli_l1d_tapi.c # frame/1d/bli_l1d_tapi.h # frame/1f/bli_l1f_check.c # frame/1f/bli_l1f_check.h # frame/1f/bli_l1f_ft.h # frame/1f/bli_l1f_oapi.c # frame/1f/bli_l1f_oapi.h # frame/1f/bli_l1f_tapi.c # frame/1f/bli_l1f_tapi.h # frame/1m/bli_l1m_ft.h # frame/1m/bli_l1m_oapi.c # frame/1m/bli_l1m_oapi.h # frame/1m/bli_l1m_oft_var.h # frame/1m/bli_l1m_tapi.c # frame/1m/bli_l1m_tapi.h # frame/1m/packm/bli_packm_alloc.c # frame/1m/packm/bli_packm_alloc.h # frame/1m/packm/bli_packm_blk_var1.c # frame/1m/packm/bli_packm_blk_var1.h # frame/1m/packm/bli_packm_cntl.h # frame/1m/packm/bli_packm_init.c # frame/1m/packm/bli_packm_init.h # frame/1m/packm/bli_packm_int.c # frame/1m/packm/bli_packm_int.h # frame/1m/unpackm/bli_unpackm_blk_var1.c # frame/1m/unpackm/bli_unpackm_int.c # frame/2/bli_l2_check.c # frame/2/bli_l2_check.h # frame/2/bli_l2_ft.h # frame/2/bli_l2_oapi.c # frame/2/bli_l2_oapi.h # frame/2/bli_l2_tapi.c # frame/2/bli_l2_tapi.h # frame/3/bli_l3_blocksize.c # frame/3/bli_l3_blocksize.h # frame/3/bli_l3_cntl.c # frame/3/bli_l3_direct.h # frame/3/bli_l3_int.c # frame/3/bli_l3_int.h # frame/3/bli_l3_oapi.c # frame/3/bli_l3_oapi.h # frame/3/bli_l3_oapi_ex.c # frame/3/bli_l3_oapi_ex.h # frame/3/bli_l3_oft.h # frame/3/bli_l3_oft_var.h # frame/3/bli_l3_packab.c # frame/3/bli_l3_packab.h # frame/3/bli_l3_sup.c # frame/3/bli_l3_sup.h # frame/3/bli_l3_sup_oft.h # frame/3/bli_l3_sup_packm_a.c # frame/3/bli_l3_sup_packm_a.h # frame/3/bli_l3_sup_packm_b.c # frame/3/bli_l3_sup_packm_b.h # frame/3/bli_l3_sup_packm_var.c # frame/3/bli_l3_sup_packm_var.h # frame/3/bli_l3_sup_var1n2m.c # frame/3/bli_l3_sup_vars.h # frame/3/bli_l3_tapi_ex.c # frame/3/bli_l3_tapi_ex.h # frame/3/gemm/bli_gemm_blk_var1.c # frame/3/gemm/bli_gemm_blk_var2.c # frame/3/gemm/bli_gemm_blk_var3.c # frame/3/gemm/bli_gemm_front.c # frame/3/gemm/bli_gemm_front.h # frame/3/gemm/bli_gemm_ker_var2.c # frame/3/gemm/bli_gemm_md.c # frame/3/gemm/bli_gemm_md.h # frame/3/gemm/bli_gemm_var.h # frame/3/gemmt/bli_gemmt_front.c # frame/3/gemmt/bli_gemmt_front.h # frame/3/gemmt/bli_gemmt_l_ker_var2.c # frame/3/gemmt/bli_gemmt_u_ker_var2.c # frame/3/gemmt/bli_gemmt_var.h # frame/3/gemmt/bli_gemmt_x_ker_var2.c # frame/3/hemm/bli_hemm_front.c # frame/3/hemm/bli_hemm_front.h # frame/3/symm/bli_symm_front.c # frame/3/symm/bli_symm_front.h # frame/3/trmm/bli_trmm_front.c # frame/3/trmm/bli_trmm_front.h # frame/3/trmm/bli_trmm_ll_ker_var2.c # frame/3/trmm/bli_trmm_lu_ker_var2.c # frame/3/trmm/bli_trmm_rl_ker_var2.c # frame/3/trmm/bli_trmm_ru_ker_var2.c # frame/3/trmm/bli_trmm_var.h # frame/3/trmm/bli_trmm_xx_ker_var2.c # frame/3/trmm3/bli_trmm3_front.c # frame/3/trmm3/bli_trmm3_front.h # frame/3/trsm/bli_trsm_blk_var1.c # frame/3/trsm/bli_trsm_blk_var2.c # frame/3/trsm/bli_trsm_blk_var3.c # frame/3/trsm/bli_trsm_front.c # frame/3/trsm/bli_trsm_front.h # frame/3/trsm/bli_trsm_ll_ker_var2.c # frame/3/trsm/bli_trsm_lu_ker_var2.c # frame/3/trsm/bli_trsm_rl_ker_var2.c # frame/3/trsm/bli_trsm_ru_ker_var2.c # frame/3/trsm/bli_trsm_var.h # frame/3/trsm/bli_trsm_xx_ker_var2.c # frame/base/bli_blksz.c # frame/base/bli_blksz.h # frame/base/bli_cntl.h # frame/base/bli_cntx.c # frame/base/bli_cntx.h # frame/base/bli_env.c # frame/base/bli_gks.c # frame/base/bli_gks.h # frame/base/bli_ind.h # frame/base/bli_info.c # frame/base/bli_obj_scalar.c # frame/base/bli_obj_scalar.h # frame/base/bli_pba.c # frame/base/bli_rntm.h # frame/base/bli_sba.c # frame/base/bli_sba.h # frame/base/bli_setgetijm.c # frame/base/check/bli_obj_check.c # frame/base/check/bli_obj_check.h # frame/include/bli_oapi_ex.h # frame/include/bli_obj_macro_defs.h # frame/include/bli_tapi_ex.h # frame/include/bli_type_defs.h # frame/thread/bli_l3_decor.h # frame/thread/bli_l3_decor_openmp.c # frame/thread/bli_l3_decor_pthreads.c # frame/thread/bli_l3_decor_single.c # frame/thread/bli_l3_sup_decor.h # frame/thread/bli_l3_sup_decor_openmp.c # frame/thread/bli_l3_sup_decor_pthreads.c # frame/thread/bli_l3_sup_decor_single.c # frame/thread/bli_thread.c # frame/thread/bli_thread.h # frame/thread/bli_thrinfo.c # frame/thread/bli_thrinfo.h # frame/thread/bli_thrinfo_sup.c # frame/util/bli_util_check.c # frame/util/bli_util_check.h # frame/util/bli_util_oapi.c # frame/util/bli_util_oapi.h # kernels/zen/1/bli_copyv_zen_int.c # kernels/zen/1/bli_scalv_zen_int10.c # kernels/zen/1f/bli_axpyf_zen_int_4.c # kernels/zen/1f/bli_axpyf_zen_int_5.c # ref_kernels/1m/bli_packm_cxk_1er_ref.c # ref_kernels/3/bli_gemm_ref.c # ref_kernels/3/bli_gemmtrsm_ref.c # ref_kernels/bli_cntx_ref.c # ref_kernels/ind/bli_gemm1m_ref.c # ref_kernels/ind/bli_trsm1m_ref.c # testsuite/src/test_libblis.c
This enables better debugging since errors will show up based on the un-flattened filename and line number.
# Conflicts: # build/flatten-headers.py # frame/3/bli_l3_sup_var1n2m.c
# Conflicts: # build/flatten-headers.py # frame/3/bli_l3_sup_packm.c # frame/3/bli_l3_sup_packm.h # frame/3/bli_l3_sup_packm_var.c # frame/3/bli_l3_sup_packm_var.h # frame/3/bli_l3_sup_var1n2m.c # frame/3/gemmt/bli_gemmt_front.c
1. Add a check for pool exhaustion when freeing blocks. This detects double-free and other bad conditions without segfault. 2. Make sure to copy *all* block pointers when growing the pool size. Previously, checked-out block pointers were not copied, leading to the presence of uninitialized data.
This option (disbaled by default) enables compiling and linking with the Address Sanitizer library (ASan), via the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is included for all files *except* optimized kernels, since it usually reguires an extra register which violates the constraints for many gemm microkernels.
Reinstate check for checked-out blocks upon finalization. A flag has been added to indicate that the pool is actually under reinitialization (where checked-out blocks are OK), which temporarily disables the check. A memory leak where blocks are not checked back in is now correctly detected upon exit.
# Conflicts: # Makefile # common.mk # configure # frame/3/bli_l3_oapi_ex.c # frame/3/bli_l3_sup_packm.c # frame/3/bli_l3_sup_packm.h # frame/3/bli_l3_sup_ref.c # frame/3/bli_l3_sup_var1n2m.c # frame/base/bli_pool.c # frame/base/bli_rntm.h # frame/thread/bli_l3_decor.h # frame/thread/bli_l3_decor_openmp.c # frame/thread/bli_l3_decor_pthreads.c # frame/thread/bli_l3_decor_single.c # frame/thread/bli_l3_sup_decor.h # frame/thread/bli_l3_sup_decor_openmp.c # frame/thread/bli_l3_sup_decor_pthreads.c # frame/thread/bli_l3_sup_decor_single.c # frame/thread/bli_thrcomm.h # frame/thread/bli_thrcomm_openmp.c # frame/thread/bli_thrcomm_pthreads.c # frame/thread/bli_thrcomm_single.c # frame/thread/bli_thread.c # frame/thread/bli_thrinfo.c # frame/thread/bli_thrinfo.h # frame/thread/bli_thrinfo_sup.c
@fgvanzee I think the Windows build is failing because some of the symbol names have changed. Is there an easy way to regenerate the symbols file? |
Very maybe! I'll see what I can come up with. |
LOL. Guess what I found? Might need some updates, but it also might work as-is. Lines 35 to 53 in 054a774
|
Also, update symbols definition file for Windows. It seems this file was quite out-of-date.
Yes, worked like a charm. Should be fixed now. |
Oh boy, the sandbox is broken. I guess I'll fix it. |
@fgvanzee All fixed now. |
@devinamatthews I was able to get non-deterministic
At least that probably answers your question of when the FWIW, I observed both native ( |
Yes I figured it was a submarine bug. Will fix. |