Skip to content

Commit

Permalink
Merge pull request #373 from ValeevGroup/evaleev/feature/tensor-memor…
Browse files Browse the repository at this point in the history
…y-profile-and-trace

Evaleev/feature/tensor memory profile and trace
  • Loading branch information
evaleev authored Oct 25, 2022
2 parents b525c3f + 5d99f65 commit c42361c
Show file tree
Hide file tree
Showing 16 changed files with 157 additions and 129 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ jobs:
steps:
- uses: actions/checkout@v2

- uses: maxim-lobanov/setup-xcode@v1
with:
xcode-version: '<14'

- name: Host system info
shell: bash
run: cmake -P ${{github.workspace}}/ci/host_system_info.cmake
Expand Down
13 changes: 8 additions & 5 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -235,12 +235,15 @@ endif()
set(INTEGER4 TRUE CACHE BOOL "If TRUE, use integer*4 Fortran integers in BLAS calls. Otherwise use integer*8.")
mark_as_advanced(INTEGER4)

# Set the CPU L1 cache line size.
set(VECTOR_ALIGNMENT "16" CACHE STRING "Set the vector alignment in memory (DO NOT CHANGE THIS VALUE UNLESS YOU KNOW WHAT YOU ARE DOING)")
mark_as_advanced(VECTOR_ALIGNMENT)
set(TILEDARRAY_ALIGNMENT ${VECTOR_ALIGNMENT})
# Set the align size
include(DetectAlignSize)
if (NOT DEFINED CACHE{TA_ALIGN_SIZE})
set(TA_ALIGN_SIZE "${TA_ALIGN_SIZE_DETECTED}" CACHE STRING "Set the default alignment of data buffers used by array tiles (DO NOT CHANGE THIS VALUE UNLESS YOU KNOW WHAT YOU ARE DOING)")
endif()
mark_as_advanced(TA_ALIGN_SIZE)
set(TILEDARRAY_ALIGN_SIZE ${TA_ALIGN_SIZE})

# Set the vectory.
# Set the CPU L1 cache line size.
set(CACHE_LINE_SIZE "64" CACHE STRING "Set the CPU L1 cache line size in bytes (DO NOT CHANGE THIS VALUE UNLESS YOU KNOW WHAT YOU ARE DOING)")
mark_as_advanced(CACHE_LINE_SIZE)
set(TILEDARRAY_CACHELINE_SIZE ${CACHE_LINE_SIZE})
Expand Down
14 changes: 7 additions & 7 deletions INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Both methods are supported. However, for most users we _strongly_ recommend to b
- Boost.Range: header-only, *only used for unit testing*
- [BTAS](http://github.com/ValeevGroup/BTAS), tag fba66ad9881ab29ea8df49ac6a6006cab3fb3ce5 . If usable BTAS installation is not found, TiledArray will download and compile
BTAS from source. *This is the recommended way to compile BTAS for all users*.
- [MADNESS](https://github.com/m-a-d-n-e-s-s/madness), tag 66b199a08bf5f33b1565811fc202a051ec1b0fbb .
- [MADNESS](https://github.com/m-a-d-n-e-s-s/madness), tag 40d2e38414179a8ebce508c7339fcee21244ffc6 .
Only the MADworld runtime and BLAS/LAPACK C API component of MADNESS is used by TiledArray.
If usable MADNESS installation is not found, TiledArray will download and compile
MADNESS from source. *This is the recommended way to compile MADNESS for all users*.
Expand Down Expand Up @@ -393,13 +393,13 @@ directory with:

## Advanced configure options:

The following CMake cache variables are tuning parameters. You should only
modify these values if you know the values for your patricular system.
The following CMake cache variables are for performance tuning. You should only
modify these values if you know the values for your particular system.

* `VECTOR_ALIGNMENT` -- The alignment of memory for Tensor in bytes [Default=16]
* `CACHE_LINE_SIZE` -- The cache line size in bytes [Default=64]
* `TA_ALIGN_SIZE` -- The alignment of memory allocated by TA::Tensor (and other artifacts like TA::host_allocator), in bytes. [Default is platform-specific, if no platform-specific value is found =64]
* `TA_CACHE_LINE_SIZE` -- The cache line size in bytes [Default=64]

`VECTOR_ALIGNMENT` controls the alignment of Tensor data, and `CACHE_LINE_SIZE`
`TA_ALIGN_SIZE` controls the alignment of memory allocated for tiles, and `TA_CACHE_LINE_SIZE`
controls the size of automatic loop unrolling for tensor operations. TiledArray
does not currently use explicit vector instructions (i.e. intrinsics), but
the code is written in such a way that compilers can more easily autovectorize
Expand All @@ -416,7 +416,7 @@ support may be added.
* `TA_TTG` -- Set to `ON` to find or fetch the TTG library. [Default=OFF].
* `TA_SIGNED_1INDEX_TYPE` -- Set to `OFF` to use unsigned 1-index coordinate type (default for TiledArray 1.0.0-alpha.2 and older). The default is `ON`, which enables the use of negative indices in coordinates.
* `TA_MAX_SOO_RANK_METADATA` -- Specifies the maximum rank for which to use Small Object Optimization (hence, avoid the use of the heap) for metadata. The default is `8`.
* `TA_TENSOR_MEM_PROFILE` -- Set to `ON` to profile memory allocations in TA::Tensor.
* `TA_TENSOR_MEM_PROFILE` -- Set to `ON` to profile host memory allocations used by TA::Tensor. This causes the use of Umpire for host memory allocation. This also enables additional tracing facilities provided by Umpire; these can be controlled via [environment variable `UMPIRE_LOG_LEVEL`](https://umpire.readthedocs.io/en/develop/sphinx/features/logging_and_replay.html), but note that the default is to log Umpire info into a file rather than stdout.
* `TA_UT_CTEST_TIMEOUT` -- The value (in seconds) of the timeout to use for running the TA unit tests via CTest when building the `check`/`check-tiledarray` targets. The default timeout is 1500s.

# Build TiledArray
Expand Down
17 changes: 17 additions & 0 deletions cmake/modules/DetectAlignSize.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# see https://stackoverflow.com/a/69952705 and https://gitlab.kitware.com/cmake/cmake/-/blob/master/Modules/CMakeDetermineCompilerABI.cmake

set(BIN "${CMAKE_PLATFORM_INFO_DIR}/cmake/modules/DetectAlignSize.bin")
try_compile(DETECT_ALIGN_SIZE_COMPILED
${CMAKE_BINARY_DIR}
SOURCES ${PROJECT_SOURCE_DIR}/cmake/modules/DetectAlignSize.cpp
CMAKE_FLAGS ${CMAKE_CXX_FLAGS}
COPY_FILE "${BIN}"
COPY_FILE_ERROR copy_error
OUTPUT_VARIABLE OUTPUT
)
if (DETECT_ALIGN_SIZE_COMPILED AND NOT copy_error)
file(STRINGS "${BIN}" data REGEX "INFO:align_size\\[[^]]*\\]")
if (data MATCHES "INFO:align_size\\[0*([^]]*)\\]")
set(TA_ALIGN_SIZE_DETECTED "${CMAKE_MATCH_1}" CACHE INTERNAL "")
endif()
endif()
39 changes: 39 additions & 0 deletions cmake/modules/DetectAlignSize.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
//
// Created by Eduard Valeyev on 10/18/22.
//

#if defined(__x86_64__)
#if defined(__AVX__)
#define PREFERRED_ALIGN_SIZE 32
#elif defined(__AVX512F__)
#define PREFERRED_ALIGN_SIZE 64
#else // 64-bit x86 should have SSE
#define PREFERRED_ALIGN_SIZE 16
#endif
#elif (defined(__ARM_NEON__) || defined(__aarch64__) || defined(_M_ARM) || \
defined(_M_ARM64))
#define PREFERRED_ALIGN_SIZE 16
#elif defined(__VECTOR4DOUBLE__)
#define PREFERRED_ALIGN_SIZE 32
#endif

// else: default to typical cache line size
#ifndef PREFERRED_ALIGN_SIZE
#define PREFERRED_ALIGN_SIZE 64
#endif

/* Preferred align size, in bytes. */
const char info_align_size[] = {
/* clang-format off */
'I', 'N', 'F', 'O', ':', 'a', 'l', 'i', 'g', 'n', '_', 's', 'i', 'z',
'e', '[', ('0' + ((PREFERRED_ALIGN_SIZE / 10) % 10)), ('0' + (PREFERRED_ALIGN_SIZE % 10)), ']',
'\0'
/* clang-format on */
};

int main(int argc, char* argv[]) {
int require = 0;
require += info_align_size[argc];
(void)argv;
return require;
}
12 changes: 12 additions & 0 deletions examples/dgemm/ta_dense.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,18 @@ void gemm_(TiledArray::World& world, const TiledArray::TiledRange& trange,
world.gop.fence();
madness::print_meminfo(world.rank(), str);
}
#ifdef TA_TENSOR_MEM_PROFILE
{
world.gop.fence();
std::cout << str << ": TA::Tensor allocated "
<< umpire::ResourceManager::getInstance()
.getAllocator("HOST")
.getHighWatermark()
<< " bytes and used "
<< TA::hostEnv::instance()->host_allocator().getHighWatermark()
<< " bytes" << std::endl;
}
#endif
};

memtrace("start");
Expand Down
10 changes: 5 additions & 5 deletions external/versions.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ set(TA_INSTALL_EIGEN_PREVIOUS_VERSION 3.3.7)
set(TA_INSTALL_EIGEN_URL_HASH SHA256=b4c198460eba6f28d34894e3a5710998818515104d6e74e5cc331ce31e46e626)
set(TA_INSTALL_EIGEN_PREVIOUS_URL_HASH MD5=b9e98a200d2455f06db9c661c5610496)

set(TA_TRACKED_MADNESS_TAG 66b199a08bf5f33b1565811fc202a051ec1b0fbb)
set(TA_TRACKED_MADNESS_PREVIOUS_TAG c0df7338779d06df7eaff31644d508940a7cfd90)
set(TA_TRACKED_MADNESS_TAG 40d2e38414179a8ebce508c7339fcee21244ffc6)
set(TA_TRACKED_MADNESS_PREVIOUS_TAG 66b199a08bf5f33b1565811fc202a051ec1b0fbb)
set(TA_TRACKED_MADNESS_VERSION 0.10.1)
set(TA_TRACKED_MADNESS_PREVIOUS_VERSION 0.10.1)

Expand All @@ -39,6 +39,6 @@ set(TA_TRACKED_SCALAPACKPP_PREVIOUS_TAG bf17a7246af38d34523bd0099b01d9961d06d311
set(TA_TRACKED_RANGEV3_TAG 2e0591c57fce2aca6073ad6e4fdc50d841827864)
set(TA_TRACKED_RANGEV3_PREVIOUS_TAG dbdaa247a25a0daa24c68f1286a5693c72ea0006)

set(TA_TRACKED_TTG_URL https://github.com/therault/ttg.git)
set(TA_TRACKED_TTG_TAG bb5309a5224e2546a5316daf7fc5c143f450f17b)
set(TA_TRACKED_TTG_PREVIOUS_TAG 5107143b418384c44587c2776a9e87065d33d670)
set(TA_TRACKED_TTG_URL https://github.com/TESSEorg/ttg)
set(TA_TRACKED_TTG_TAG 1251bec25e07a74a05e5cd4cdec181a95a9baa66)
set(TA_TRACKED_TTG_PREVIOUS_TAG bb5309a5224e2546a5316daf7fc5c143f450f17b)
6 changes: 3 additions & 3 deletions src/TiledArray/config.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
#cmakedefine TILEDARRAY_HAS_LONG_LONG 1

/* Define the default alignment for arrays required by vector operations. */
#cmakedefine TILEDARRAY_ALIGNMENT @TILEDARRAY_ALIGNMENT@
#cmakedefine TILEDARRAY_ALIGN_SIZE @TILEDARRAY_ALIGN_SIZE@

/* Define the size of the CPU L1 cache lines. */
#cmakedefine TILEDARRAY_CACHELINE_SIZE @TILEDARRAY_CACHELINE_SIZE@
Expand Down Expand Up @@ -125,11 +125,11 @@
/* Add macro TILEDARRAY_ALIGNED_STORAGE which forces alignment of variables */
#if defined(__clang__) || defined(__GNUC__) || defined(__PGI) || defined(__IBMCPP__) || defined(__ARMCC_VERSION)

#define TILEDARRAY_ALIGNED_STORAGE __attribute__((aligned(TILEDARRAY_ALIGNMENT)))
#define TILEDARRAY_ALIGNED_STORAGE __attribute__((aligned(TILEDARRAY_ALIGN_SIZE)))

#elif (defined _MSC_VER)

#define TILEDARRAY_ALIGNED_STORAGE __declspec(align(TILEDARRAY_ALIGNMENT))
#define TILEDARRAY_ALIGNED_STORAGE __declspec(align(TILEDARRAY_ALIGN_SIZE))

#else

Expand Down
11 changes: 8 additions & 3 deletions src/TiledArray/external/umpire.h
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,20 @@ class umpire_allocator_impl {

TA_ASSERT(umpalloc_);

result = static_cast<pointer>(umpalloc_->allocate(n * sizeof(T)));
// this, instead of umpalloc_->allocate(n*sizeof(T)), profiles memory use
// even if introspection is off
result = static_cast<pointer>(
umpalloc_->getAllocationStrategy()->allocate_internal(n * sizeof(T)));

return result;
}

/// deallocate um memory using umpire dynamic pool
void deallocate(pointer ptr, size_t) {
void deallocate(pointer ptr, size_t size) {
TA_ASSERT(umpalloc_);
umpalloc_->deallocate(ptr);
// this, instead of umpalloc_->deallocate(ptr, size), profiles mmeory use
// even if introspection is off
umpalloc_->getAllocationStrategy()->deallocate_internal(ptr, size);
}

const umpire::Allocator* umpire_allocator() const { return umpalloc_; }
Expand Down
10 changes: 8 additions & 2 deletions src/TiledArray/fwd.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,14 @@ class DensePolicy;
class SparsePolicy;

// TiledArray Tensors
// can also use host_allocator<T> and std::allocator<T> for A
template <typename T, typename A = Eigen::aligned_allocator<T>>
// can any standard-compliant allocator such as std::allocator<T>
template <typename T, typename A =
#ifndef TA_TENSOR_MEM_PROFILE
Eigen::aligned_allocator<T>
#else
host_allocator<T>
#endif
>
class Tensor;

typedef Tensor<double> TensorD;
Expand Down
2 changes: 1 addition & 1 deletion src/TiledArray/host/allocator.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ class host_allocator_impl : public umpire_allocator_impl<T> {
template <typename T1, typename T2>
friend bool operator==(const host_allocator_impl<T1>& lhs,
const host_allocator_impl<T2>& rhs) noexcept;
}; // class host_allocator
}; // class host_allocator_impl

template <class T1, class T2>
bool operator==(const host_allocator_impl<T1>& lhs,
Expand Down
47 changes: 26 additions & 21 deletions src/TiledArray/host/env.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@

// for memory management
#include <umpire/Umpire.hpp>
#include <umpire/strategy/AlignedAllocator.hpp>
#include <umpire/strategy/QuickPool.hpp>
#include <umpire/strategy/SizeLimiter.hpp>
#include <umpire/strategy/ThreadSafeAllocator.hpp>
Expand All @@ -42,11 +43,11 @@
namespace TiledArray {

/**
* hostEnv set up global environment
* hostEnv maintains the (host-side, as opposed to device-side) environment,
* such as memory allocators
*
* Singleton class
* \note this is a Singleton
*/

class hostEnv {
public:
~hostEnv() = default;
Expand All @@ -56,20 +57,26 @@ class hostEnv {
hostEnv& operator=(const hostEnv&) = delete;
hostEnv& operator=(hostEnv&&) = delete;

/// access the instance, if not initialized will be initialized using default
/// params
/// access the singleton instance; if not initialized will be
/// initialized via hostEnv::initialize() with the default params
static std::unique_ptr<hostEnv>& instance() {
if (!instance_accessor()) {
initialize(TiledArray::get_default_world());
initialize();
}
return instance_accessor();
}

/// initialize the instance using explicit params
static void initialize(World& world,
const std::uint64_t max_memory_size = (1ul << 40),
const std::uint64_t page_size = (1ul << 22)) {
// initialize only when not initialized
/// \param max_memory_size max amount of memory (bytes) that TiledArray
/// can use for storage of TA::Tensor objects (these by default
/// store DistArray tile data and (if sparse) shape [default=2^40]
/// \param page_size memory added to the pool in chunks of at least
/// this size (bytes) [default=2^25]
static void initialize(const std::uint64_t max_memory_size = (1ul << 40),
const std::uint64_t page_size = (1ul << 25)) {
static std::mutex mtx; // to make initialize() reentrant
std::scoped_lock lock{mtx};
// only the winner of the lock race gets to initialize
if (instance_accessor() == nullptr) {
// uncomment to debug umpire ops
//
Expand All @@ -80,26 +87,24 @@ class hostEnv {

auto& rm = umpire::ResourceManager::getInstance();

// turn off Umpire introspection for non-Debug builds
#ifndef NDEBUG
constexpr auto introspect = true;
#else
// N.B. we don't rely on Umpire introspection (even for profiling)
constexpr auto introspect = false;
#endif

// allocate zero memory for device pool, same grain for subsequent allocs
// use QuickPool for host memory allocation, with min grain of 1 page
auto host_size_limited_alloc =
rm.makeAllocator<umpire::strategy::SizeLimiter, introspect>(
"size_limited_alloc", rm.getAllocator("HOST"), max_memory_size);
"SizeLimited_HOST", rm.getAllocator("HOST"), max_memory_size);
auto host_dynamic_pool =
rm.makeAllocator<umpire::strategy::QuickPool, introspect>(
"HostDynamicPool", host_size_limited_alloc, 0, page_size);
auto thread_safe_host_dynamic_pool =
"QuickPool_SizeLimited_HOST", host_size_limited_alloc, page_size,
page_size, /* alignment */ TILEDARRAY_ALIGN_SIZE);
auto thread_safe_host_aligned_dynamic_pool =
rm.makeAllocator<umpire::strategy::ThreadSafeAllocator, introspect>(
"ThreadSafeHostDynamicPool", host_dynamic_pool);
"ThreadSafe_QuickPool_SizeLimited_HOST", host_dynamic_pool);

auto host_env = std::unique_ptr<hostEnv>(
new hostEnv(world, thread_safe_host_dynamic_pool));
new hostEnv(TiledArray::get_default_world(),
thread_safe_host_aligned_dynamic_pool));
instance_accessor() = std::move(host_env);
}
}
Expand Down
6 changes: 3 additions & 3 deletions src/TiledArray/math/linalg/ttg/util.h
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ auto make_writer_ttg(

auto keymap2 = [pmap = A.pmap_shared(),
range = A.trange().tiles_range()](const Key2& key) {
const auto IJ = range.ordinal({key.I, key.J});
const auto IJ = range.ordinal({key[0], key[1]});
return pmap->owner(IJ);
};

Expand All @@ -239,8 +239,8 @@ auto make_writer_ttg(
(Layout == lapack::Layout::ColMajor
? tile.rows()
: tile.cols())); // the code below only works if tile's LD == rows
const int I = key.I;
const int J = key.J;
const int I = key[0];
const int J = key[1];
auto rng = A.trange().make_tile_range({I, J});
if constexpr (Uplo != lapack::Uplo::General) {
if (I != J &&
Expand Down
Loading

0 comments on commit c42361c

Please sign in to comment.