Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to thrinfo_t. #673

Closed
wants to merge 42 commits into from
Closed

Changes to thrinfo_t. #673

wants to merge 42 commits into from

Conversation

devinamatthews
Copy link
Member

  • Move pba and sba pool pointers into the thrinfo_t from the rntm_t. This means we can treat the rntm_t as const (except where the ways of threading have to be modified after entry).
  • Move the mem_t member from cntl_t to the thrinfo_t. This means we can treat the control tree as const during execution.
  • Construct the entire thread control tree in the thread decorator. This means we can avoid the extra complexity of growing the tree as we go, and that the rntm_t is no longer needed below the level of the thread decorator.
  • Slightly change the semantics of the thrinfo_t members to better match TCI to ease a future port. The only real practical ramifications of this are that 1) there is now an "extra" node at the beginning of the control tree which represents the initial thread communicator state, and 2) there is some complication with the TRSM pre-node: the splitting of threads in the IC loop doesn't happen for the pre-node (TRSM part), but the split in the control and thread info trees happens below this level. To fix this we construct the first node of the pre-node sub-tree from the KC loop thread communicator and shift any IC parallelism to the JR loop.
  • Constructing the thread info tree for GEMMSUP is still rather unsatisfactory given that the threading strategy can change (requiring a re-build of the thread info tree) after the thread decorator due to transposition and/or selection of panel-block vs. block-panel algorithms. It would be nice to make these decisions earlier so that we can construct the tree only once.

devinamatthews and others added 30 commits January 31, 2022 10:47
These are defined in sub-configuration-specific header files, which are only included by reference kernels.
The gemm reference kernel now uses the configuration-dependent BLIS_MR_x/BLIS_NR_x macros to control unrolling, rather than fixed values. This fixes #259 and replaces PR #547.
All kernels have been combined into a single array (level-1v/1f, (un)packm, level-3, and sup), and similarly with preferences (only ukr row-storage preferences for now) and block sizes (which now include sup thresholds and block sizes). These changes are necessary for future support of user-defined kernels. The context initialization functions used by bli_cntx_init_* have also been reworked to use a sentinel instead of an explicit count in order to prevent errors. Note that mostly these changes make the cntx_t code oblivious to BLAS level, but some l3-specific functions remain for compatibility.
1. The generic gemm kernel breaks on armsve because there is no
   compile-time MR/NR. The refernce gemm kernels has been modified
   to detect this and fallback to a "dumb" version.
2. For some reason, adding an optimization for writing back full
   microtiles in row-major storage to the reference gemm kernel
   results in a segfault for armv7a/gcc-9.3. I can't tell if I'm
   doing something wrong of if there is a compiler bug. This
   optimization has been removed for the time being.
…vailable as macros.

The array of reference packing kernels (0--31) are replaced by exactly two kernels for each config/datatype combination, one to pack MRxK micropanels and one to pack NRxK micropanels. *IMPORTANT*: the "bb" reference kernels have been merged into the "standard" kernels (packm [incl. 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 need testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro.
This change also includes a new level-0 macro: set0s_edge, which helps to simplify the packm kernels.
- bli_packm_struc_cxk has been completely rewritten to combine nat/1m execution and use a special packing kernel for diagonal blocks.
- *all* reference kernels now respect broadcast packing for A and/or B. This works for all l3 operations (even trsm!) and with 1m.
# Conflicts:
#	ref_kernels/3/bli_gemmtrsm_ref.c
#	ref_kernels/ind/bli_gemmtrsm1m_ref.c
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions.
Due to missing `break`s in a switch statement (warn me, dammit!), the virtual gemm ukernels were not getting set to the optimized versions. [ci skip]
Beta (as the scalar attached to C) was not seen as reset to 1 after the first iteration of the pc loop, as the wrong pointer was passed to bli_gemm_int.
# Conflicts:
#	config/a64fx/bli_cntx_init_a64fx.c
#	config/armsve/bli_cntx_init_armsve.c
#	config/bgq/bli_cntx_init_bgq.c
#	config/bulldozer/bli_cntx_init_bulldozer.c
#	config/cortexa15/bli_cntx_init_cortexa15.c
#	config/cortexa53/bli_cntx_init_cortexa53.c
#	config/cortexa57/bli_cntx_init_cortexa57.c
#	config/cortexa9/bli_cntx_init_cortexa9.c
#	config/excavator/bli_cntx_init_excavator.c
#	config/firestorm/bli_cntx_init_firestorm.c
#	config/haswell/bli_cntx_init_haswell.c
#	config/knc/bli_cntx_init_knc.c
#	config/knl/bli_cntx_init_knl.c
#	config/penryn/bli_cntx_init_penryn.c
#	config/piledriver/bli_cntx_init_piledriver.c
#	config/power10/bli_cntx_init_power10.c
#	config/power7/bli_cntx_init_power7.c
#	config/power9/bli_cntx_init_power9.c
#	config/sandybridge/bli_cntx_init_sandybridge.c
#	config/skx/bli_cntx_init_skx.c
#	config/steamroller/bli_cntx_init_steamroller.c
#	config/template/bli_cntx_init_template.c
#	config/thunderx2/bli_cntx_init_thunderx2.c
#	config/zen/bli_cntx_init_zen.c
#	config/zen2/bli_cntx_init_zen2.c
#	config/zen3/bli_cntx_init_zen3.c
#	frame/0/bli_l0_check.h
#	frame/0/bli_l0_oapi.c
#	frame/0/bli_l0_oapi.h
#	frame/0/bli_l0_tapi.h
#	frame/0/copysc/bli_copysc.c
#	frame/1/bli_l1v_oapi.h
#	frame/1/bli_l1v_tapi.c
#	frame/1/bli_l1v_tapi.h
#	frame/1d/bli_l1d_ft.h
#	frame/1d/bli_l1d_oapi.c
#	frame/1d/bli_l1d_oapi.h
#	frame/1d/bli_l1d_tapi.c
#	frame/1d/bli_l1d_tapi.h
#	frame/1f/bli_l1f_check.c
#	frame/1f/bli_l1f_check.h
#	frame/1f/bli_l1f_ft.h
#	frame/1f/bli_l1f_oapi.c
#	frame/1f/bli_l1f_oapi.h
#	frame/1f/bli_l1f_tapi.c
#	frame/1f/bli_l1f_tapi.h
#	frame/1m/bli_l1m_ft.h
#	frame/1m/bli_l1m_oapi.c
#	frame/1m/bli_l1m_oapi.h
#	frame/1m/bli_l1m_oft_var.h
#	frame/1m/bli_l1m_tapi.c
#	frame/1m/bli_l1m_tapi.h
#	frame/1m/packm/bli_packm_alloc.c
#	frame/1m/packm/bli_packm_alloc.h
#	frame/1m/packm/bli_packm_blk_var1.c
#	frame/1m/packm/bli_packm_blk_var1.h
#	frame/1m/packm/bli_packm_cntl.h
#	frame/1m/packm/bli_packm_init.c
#	frame/1m/packm/bli_packm_init.h
#	frame/1m/packm/bli_packm_int.c
#	frame/1m/packm/bli_packm_int.h
#	frame/1m/unpackm/bli_unpackm_blk_var1.c
#	frame/1m/unpackm/bli_unpackm_int.c
#	frame/2/bli_l2_check.c
#	frame/2/bli_l2_check.h
#	frame/2/bli_l2_ft.h
#	frame/2/bli_l2_oapi.c
#	frame/2/bli_l2_oapi.h
#	frame/2/bli_l2_tapi.c
#	frame/2/bli_l2_tapi.h
#	frame/3/bli_l3_blocksize.c
#	frame/3/bli_l3_blocksize.h
#	frame/3/bli_l3_cntl.c
#	frame/3/bli_l3_direct.h
#	frame/3/bli_l3_int.c
#	frame/3/bli_l3_int.h
#	frame/3/bli_l3_oapi.c
#	frame/3/bli_l3_oapi.h
#	frame/3/bli_l3_oapi_ex.c
#	frame/3/bli_l3_oapi_ex.h
#	frame/3/bli_l3_oft.h
#	frame/3/bli_l3_oft_var.h
#	frame/3/bli_l3_packab.c
#	frame/3/bli_l3_packab.h
#	frame/3/bli_l3_sup.c
#	frame/3/bli_l3_sup.h
#	frame/3/bli_l3_sup_oft.h
#	frame/3/bli_l3_sup_packm_a.c
#	frame/3/bli_l3_sup_packm_a.h
#	frame/3/bli_l3_sup_packm_b.c
#	frame/3/bli_l3_sup_packm_b.h
#	frame/3/bli_l3_sup_packm_var.c
#	frame/3/bli_l3_sup_packm_var.h
#	frame/3/bli_l3_sup_var1n2m.c
#	frame/3/bli_l3_sup_vars.h
#	frame/3/bli_l3_tapi_ex.c
#	frame/3/bli_l3_tapi_ex.h
#	frame/3/gemm/bli_gemm_blk_var1.c
#	frame/3/gemm/bli_gemm_blk_var2.c
#	frame/3/gemm/bli_gemm_blk_var3.c
#	frame/3/gemm/bli_gemm_front.c
#	frame/3/gemm/bli_gemm_front.h
#	frame/3/gemm/bli_gemm_ker_var2.c
#	frame/3/gemm/bli_gemm_md.c
#	frame/3/gemm/bli_gemm_md.h
#	frame/3/gemm/bli_gemm_var.h
#	frame/3/gemmt/bli_gemmt_front.c
#	frame/3/gemmt/bli_gemmt_front.h
#	frame/3/gemmt/bli_gemmt_l_ker_var2.c
#	frame/3/gemmt/bli_gemmt_u_ker_var2.c
#	frame/3/gemmt/bli_gemmt_var.h
#	frame/3/gemmt/bli_gemmt_x_ker_var2.c
#	frame/3/hemm/bli_hemm_front.c
#	frame/3/hemm/bli_hemm_front.h
#	frame/3/symm/bli_symm_front.c
#	frame/3/symm/bli_symm_front.h
#	frame/3/trmm/bli_trmm_front.c
#	frame/3/trmm/bli_trmm_front.h
#	frame/3/trmm/bli_trmm_ll_ker_var2.c
#	frame/3/trmm/bli_trmm_lu_ker_var2.c
#	frame/3/trmm/bli_trmm_rl_ker_var2.c
#	frame/3/trmm/bli_trmm_ru_ker_var2.c
#	frame/3/trmm/bli_trmm_var.h
#	frame/3/trmm/bli_trmm_xx_ker_var2.c
#	frame/3/trmm3/bli_trmm3_front.c
#	frame/3/trmm3/bli_trmm3_front.h
#	frame/3/trsm/bli_trsm_blk_var1.c
#	frame/3/trsm/bli_trsm_blk_var2.c
#	frame/3/trsm/bli_trsm_blk_var3.c
#	frame/3/trsm/bli_trsm_front.c
#	frame/3/trsm/bli_trsm_front.h
#	frame/3/trsm/bli_trsm_ll_ker_var2.c
#	frame/3/trsm/bli_trsm_lu_ker_var2.c
#	frame/3/trsm/bli_trsm_rl_ker_var2.c
#	frame/3/trsm/bli_trsm_ru_ker_var2.c
#	frame/3/trsm/bli_trsm_var.h
#	frame/3/trsm/bli_trsm_xx_ker_var2.c
#	frame/base/bli_blksz.c
#	frame/base/bli_blksz.h
#	frame/base/bli_cntl.h
#	frame/base/bli_cntx.c
#	frame/base/bli_cntx.h
#	frame/base/bli_env.c
#	frame/base/bli_gks.c
#	frame/base/bli_gks.h
#	frame/base/bli_ind.h
#	frame/base/bli_info.c
#	frame/base/bli_obj_scalar.c
#	frame/base/bli_obj_scalar.h
#	frame/base/bli_pba.c
#	frame/base/bli_rntm.h
#	frame/base/bli_sba.c
#	frame/base/bli_sba.h
#	frame/base/bli_setgetijm.c
#	frame/base/check/bli_obj_check.c
#	frame/base/check/bli_obj_check.h
#	frame/include/bli_oapi_ex.h
#	frame/include/bli_obj_macro_defs.h
#	frame/include/bli_tapi_ex.h
#	frame/include/bli_type_defs.h
#	frame/thread/bli_l3_decor.h
#	frame/thread/bli_l3_decor_openmp.c
#	frame/thread/bli_l3_decor_pthreads.c
#	frame/thread/bli_l3_decor_single.c
#	frame/thread/bli_l3_sup_decor.h
#	frame/thread/bli_l3_sup_decor_openmp.c
#	frame/thread/bli_l3_sup_decor_pthreads.c
#	frame/thread/bli_l3_sup_decor_single.c
#	frame/thread/bli_thread.c
#	frame/thread/bli_thread.h
#	frame/thread/bli_thrinfo.c
#	frame/thread/bli_thrinfo.h
#	frame/thread/bli_thrinfo_sup.c
#	frame/util/bli_util_check.c
#	frame/util/bli_util_check.h
#	frame/util/bli_util_oapi.c
#	frame/util/bli_util_oapi.h
#	kernels/zen/1/bli_copyv_zen_int.c
#	kernels/zen/1/bli_scalv_zen_int10.c
#	kernels/zen/1f/bli_axpyf_zen_int_4.c
#	kernels/zen/1f/bli_axpyf_zen_int_5.c
#	ref_kernels/1m/bli_packm_cxk_1er_ref.c
#	ref_kernels/3/bli_gemm_ref.c
#	ref_kernels/3/bli_gemmtrsm_ref.c
#	ref_kernels/bli_cntx_ref.c
#	ref_kernels/ind/bli_gemm1m_ref.c
#	ref_kernels/ind/bli_trsm1m_ref.c
#	testsuite/src/test_libblis.c
This enables better debugging since errors will show up based on the un-flattened filename and line number.
# Conflicts:
#	build/flatten-headers.py
#	frame/3/bli_l3_sup_var1n2m.c
# Conflicts:
#	build/flatten-headers.py
#	frame/3/bli_l3_sup_packm.c
#	frame/3/bli_l3_sup_packm.h
#	frame/3/bli_l3_sup_packm_var.c
#	frame/3/bli_l3_sup_packm_var.h
#	frame/3/bli_l3_sup_var1n2m.c
#	frame/3/gemmt/bli_gemmt_front.c
1. Add a check for pool exhaustion when freeing blocks. This detects double-free and other bad conditions without segfault.
2. Make sure to copy *all* block pointers when growing the pool size. Previously, checked-out block pointers were not copied, leading to the presence of uninitialized data.
This option (disbaled by default) enables compiling and linking with the Address Sanitizer library (ASan), via the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is included for all files *except* optimized kernels, since it usually reguires an extra register which violates the constraints for many gemm microkernels.
Reinstate check for checked-out blocks upon finalization. A flag has been added to indicate that the pool is actually under reinitialization (where checked-out blocks are OK), which temporarily disables the check. A memory leak where blocks are not checked back in is now correctly detected upon exit.
# Conflicts:
#	Makefile
#	common.mk
#	configure
#	frame/3/bli_l3_oapi_ex.c
#	frame/3/bli_l3_sup_packm.c
#	frame/3/bli_l3_sup_packm.h
#	frame/3/bli_l3_sup_ref.c
#	frame/3/bli_l3_sup_var1n2m.c
#	frame/base/bli_pool.c
#	frame/base/bli_rntm.h
#	frame/thread/bli_l3_decor.h
#	frame/thread/bli_l3_decor_openmp.c
#	frame/thread/bli_l3_decor_pthreads.c
#	frame/thread/bli_l3_decor_single.c
#	frame/thread/bli_l3_sup_decor.h
#	frame/thread/bli_l3_sup_decor_openmp.c
#	frame/thread/bli_l3_sup_decor_pthreads.c
#	frame/thread/bli_l3_sup_decor_single.c
#	frame/thread/bli_thrcomm.h
#	frame/thread/bli_thrcomm_openmp.c
#	frame/thread/bli_thrcomm_pthreads.c
#	frame/thread/bli_thrcomm_single.c
#	frame/thread/bli_thread.c
#	frame/thread/bli_thrinfo.c
#	frame/thread/bli_thrinfo.h
#	frame/thread/bli_thrinfo_sup.c
@devinamatthews
Copy link
Member Author

@fgvanzee I think the Windows build is failing because some of the symbol names have changed. Is there an easy way to regenerate the symbols file?

@fgvanzee
Copy link
Member

fgvanzee commented Oct 5, 2022

@fgvanzee I think the Windows build is failing because some of the symbol names have changed. Is there an easy way to regenerate the symbols file?

Very maybe! I'll see what I can come up with.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 5, 2022

LOL. Guess what I found?

Might need some updates, but it also might work as-is.

#
# This script regenerates a list of symbols for use when building
# Windows-compatible DLLs. We assume that this script will be run after
# running configure as:
#
# ./configure --enable-cblas haswell
#
# and compiling BLIS normally. (Notice that we also prune out all
# haswell/zen-related context initialization and reference kernels.)
#
libblis='lib/haswell/libblis.so'
symfile='build/libblis-symbols.def'
echo "EXPORTS" > def.exports
#nm -g ${libblis} | grep -o " D BLIS_.*" | cut -f2- "-dD" > def.blis_const
nm -g ${libblis} | grep -o " T bli_.*" | cut -f2- "-dT" > def.blis
nm -g ${libblis} | grep -o " T bla_.*" | cut -f2- "-dT" > def.blis_bla
nm -g ${libblis} | grep -o " T cblas_.*" | cut -f2- "-dT" > def.blis_cblas

Also, update symbols definition file for Windows. It seems this file was quite out-of-date.
@devinamatthews
Copy link
Member Author

Yes, worked like a charm. Should be fixed now.

@devinamatthews
Copy link
Member Author

Oh boy, the sandbox is broken. I guess I'll fix it.

@devinamatthews
Copy link
Member Author

@fgvanzee All fixed now.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 18, 2022

@devinamatthews I was able to get non-deterministic trsm failures just by:

  1. ./configure -t openmp auto; make
  2. Increasing the problem size in the input.general.fast from 100 to 300.
  3. Setting export GOMP_CPU_AFFINITY="0-3" and export BLIS_NUM_THREADS=4 on my four-core KabyLake.
  4. Running make check.

At least that probably answers your question of when the trsm issues began? 😕

FWIW, I observed both native (dtrsm) and complex 1m failures.

@devinamatthews
Copy link
Member Author

Yes I figured it was a submarine bug. Will fix.

devinamatthews added a commit that referenced this pull request Oct 23, 2022
@devinamatthews devinamatthews deleted the thrinfo_changes2 branch October 28, 2022 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants