Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem benchmarking OpenBLAS GEMM #973

Closed
pdrocaldeira opened this issue Jun 5, 2020 · 9 comments
Closed

Problem benchmarking OpenBLAS GEMM #973

pdrocaldeira opened this issue Jun 5, 2020 · 9 comments

Comments

@pdrocaldeira
Copy link

I'm sorry if it's not a bug by I think this is the place to ask for help.

I'm trying to use this library to measure a GEMM with OpenBLAS.
So, basically I have:

static void BM_GEMM(benchmark::State& state) {
  // Perform setup here
  float* A = new float[state.range(0)*state.range(0)];
  float* B = new float[state.range(0)*state.range(0)];
  float* C = new float[state.range(0)*state.range(0)];
  openblas_set_num_threads(1);
  for (auto _ : state) {
    cblas_sgemm(CblasColMajor, CblasNoTrans, CblasTrans,
       state.range(0),
       state.range(0),
       state.range(0),
        1, //alpha
        A,
        state.range(0),
        B,
        state.range(0),
        1, //beta
        C,
        state.range(0));
    benchmark::DoNotOptimize(C);
  }
  delete[] A;
  delete[] B;
  delete[] C;
}

Problem is, for any state.range(0) bigger than 15 I get this crash:

terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument

I suspect that this is something thread related. But I've compiled OpenBLAS with all single thread flags possible. Running this code with gdb and a breakpoint, code runs just perfect.

I'm using a relative old version, 168604d.

Here's the backtrace using gdb:

(gdb) backtrace
#8  0x0000000010008da0 in benchmark::State::StartKeepRunning() ()
#9  0x0000000010006efc in benchmark::State::end (this=0x7fffffffe330) at benchmark/include/benchmark/benchmark.h:768
#10 BM_GEMM(state=...) at benchmark.cpp:27
#11 0x000000001000ba0c in benchmark::internal::FunctionBenchmark::Run(benchmark::State&) ()
#12 0x0000000010050e80 in benchmark::internal::BenchmarkInstance::Run(unsigned long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*) const ()
#13 0x000000001002da3c in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
#14 0x000000001002e4d4 in benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance const&, std::vector<benchmark::BenchmarkReporter::Run, std::allocator<benchmark::BenchmarkReporter::Run> >*) ()
#15 0x000000001000a680 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#16 0x0000000010007694 in main (argc=1, argv=0x7ffffffff0e8) at benchmark.cpp:81
@dmah42
Copy link
Member

dmah42 commented Jun 5, 2020

this stack overflow question suggests it is related to a mutex being destroyed before a lock is attempted.

Can you post your build commands and logs? #67 was similar and was caused by pthread not being linked.

I don't see anything in the newer commits that might affect this issue.

@pdrocaldeira
Copy link
Author

First thing, thanks a lot for this fast response.

mutex being destroyed before a lock is attempted.

Yeah I got into that one, that is why I'm trying to have a "no thread" execution. I have even tried not to link pthreads but it doesn't work.

This is my build log:

#Make a build directory to place the build output.
mkdir build && cd build && CXX=g++ cmake -DCMAKE_BUILD_TYPE=Release -DGOOGLETEST_PATH=../googletest ../benchmark && make CXX=/g++ -j8
-- The CXX compiler identification is GNU 9.2.1
-- Check for working CXX compiler: g++
-- Check for working CXX compiler: g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- git Version: v1.5.0-168604d8
-- Version: 1.5.0
-- Performing Test HAVE_CXX_FLAG_STD_CXX11
-- Performing Test HAVE_CXX_FLAG_STD_CXX11 - Success
-- Performing Test HAVE_CXX_FLAG_WALL
-- Performing Test HAVE_CXX_FLAG_WALL - Success
-- Performing Test HAVE_CXX_FLAG_WEXTRA
-- Performing Test HAVE_CXX_FLAG_WEXTRA - Success
-- Performing Test HAVE_CXX_FLAG_WSHADOW
-- Performing Test HAVE_CXX_FLAG_WSHADOW - Success
-- Performing Test HAVE_CXX_FLAG_WERROR
-- Performing Test HAVE_CXX_FLAG_WERROR - Success
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32 - Failed
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED - Success
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WD654
-- Performing Test HAVE_CXX_FLAG_WD654 - Failed
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY - Failed
-- Performing Test HAVE_CXX_FLAG_COVERAGE
-- Performing Test HAVE_CXX_FLAG_COVERAGE - Success
-- Performing Test HAVE_STD_REGEX
-- Performing Test HAVE_STD_REGEX
-- Performing Test HAVE_STD_REGEX -- success
-- Performing Test HAVE_GNU_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Performing Test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX -- success
-- Performing Test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK -- success
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Looking for Google Test sources
-- Looking for Google Test sources in /home/caldeira/workspace/performance/googletest
-- Found Google Test in /home/caldeira/workspace/performance/googletest
-- Configuring done
-- Generating done
-- Build files have been written to: /home/caldeira/workspace/performance/build/third_party/googletest
make[1]: Entering directory '/home/caldeira/workspace/performance/build/third_party/googletest'
make[2]: Entering directory '/home/caldeira/workspace/performance/build/third_party/googletest'
make[3]: Entering directory '/home/caldeira/workspace/performance/build/third_party/googletest'
Scanning dependencies of target googletest
make[3]: Leaving directory '/home/caldeira/workspace/performance/build/third_party/googletest'
make[3]: Entering directory '/home/caldeira/workspace/performance/build/third_party/googletest'
[ 11%] Creating directories for 'googletest'
[ 22%] No download step for 'googletest'
[ 33%] No patch step for 'googletest'
[ 44%] No update step for 'googletest'
[ 55%] No configure step for 'googletest'
[ 66%] No build step for 'googletest'
[ 77%] No install step for 'googletest'
[ 88%] No test step for 'googletest'
[100%] Completed 'googletest'
make[3]: Leaving directory '/home/caldeira/workspace/performance/build/third_party/googletest'
[100%] Built target googletest
make[2]: Leaving directory '/home/caldeira/workspace/performance/build/third_party/googletest'
make[1]: Leaving directory '/home/caldeira/workspace/performance/build/third_party/googletest'
-- The C compiler identification is GNU 9.2.1
-- Check for working C compiler: gcc
-- Check for working C compiler: gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.17") 
-- Performing Test BENCHMARK_HAS_O3_FLAG
-- Performing Test BENCHMARK_HAS_O3_FLAG - Success
-- Performing Test BENCHMARK_HAS_CXX03_FLAG
-- Performing Test BENCHMARK_HAS_CXX03_FLAG - Success
-- Performing Test BENCHMARK_HAS_WNO_ODR
-- Performing Test BENCHMARK_HAS_WNO_ODR - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /home/caldeira/workspace/performance/build

And this is how I'm compiling the benchmark:

g++ -O0 -g -I ../OpenBLAS/install/include benchmark.cpp -pthread -std=c++11 -isystem benchmark/include \
-Lbuild/src -lbenchmark -L../OpenBLAS/install/lib -lopenblas -o benchmark.out

Running with gdb I can see a single thread running but I think it does not guarantee that OpenBLAS won't mess up with the mutex. Running the benchmark code outside google benchmark it works fine. Also removing the OpenBLAS line the benchmark code also runs fine. The only broken scenario is using them together.

Just for saying, I don't need an elegant solution. This is just for some personal tests but I really like to use google benchmark instead of doing everything "manually". This library is too good to give it up easily. 😄

Thanks (again) for your help.

@dmah42
Copy link
Member

dmah42 commented Jun 5, 2020

I have a local version running that isn't crashing, which is really annoying.

@dmah42
Copy link
Member

dmah42 commented Jun 5, 2020

@dmah42
Copy link
Member

dmah42 commented Jun 5, 2020

I also set the number of openblas threads to 4 in both to see what happened.

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_GEMM_unique_ptr/4           406 ns          406 ns      1503340
BM_GEMM_unique_ptr/8           277 ns          277 ns      2588458
BM_GEMM_unique_ptr/64        13207 ns        13189 ns        50773
BM_GEMM_unique_ptr/512     5357378 ns      3606007 ns          325
BM_GEMM_unique_ptr/1024   31492686 ns     23356052 ns           27
BM_GEMM_raw_ptr/4              245 ns          237 ns      2249392
BM_GEMM_raw_ptr/8              376 ns          366 ns      2088436
BM_GEMM_raw_ptr/64           16301 ns        15617 ns        37961
BM_GEMM_raw_ptr/512        6392767 ns      4189606 ns          174
BM_GEMM_raw_ptr/1024      33148312 ns     22723908 ns           35

@pdrocaldeira
Copy link
Author

Well, that is good news, now we know that they can work together just out of the box.

You're using clang instead of gcc but I don't think gcc is to blame, at least not in the moment.

What gets my attention is: you are using the default version of OpenBLAS.
Although I have a installed OpenBLAS I would like to use a self compiled version, as in -L../OpenBLAS/install/lib. This version is just the current master from OpenBLAS, without any modification.

My guess is that your OpenBLAS version have some kind of lock mechanism that mine does not.
Maybe something like this NUM_LOCKING=1 as said here https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded

@dmah42
Copy link
Member

dmah42 commented Jun 5, 2020

"How can I use OpenBLAS in multi-threaded applications?
If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following."

strongly suggests that having the two working together isn't going to work if the version isn't compiled this way.

@pdrocaldeira
Copy link
Author

My guess is that your OpenBLAS version have some kind of lock mechanism that mine does not.

It turns out that the problem was exactly that. I got it working by compiling OpenBLAS with:

USE_THREAD=0
USE_LOCKING=1
USE_OPENMP=0

I'll leave it here for future reference, maybe it will help someone else. Thanks a lot for your time @dominichamon , I really mean it :)

See you 😄

@dmah42
Copy link
Member

dmah42 commented Jun 8, 2020

i'm so glad you got it. i'll close this but obviously it'll be available if folks search.

@dmah42 dmah42 closed this as completed Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants