Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Threads time taken report. #1834

Closed
manuelgustavo opened this issue Aug 6, 2024 · 12 comments
Closed

[BUG] Threads time taken report. #1834

manuelgustavo opened this issue Aug 6, 2024 · 12 comments

Comments

@manuelgustavo
Copy link

Describe the bug
A benchmark that has only sleep_for(10s) gets reported / number of threads.

System
Which OS, compiler, and compiler version are you using:

  • OS: Linux / Ubuntu 22.04
  • Compiler and version: clang-14

To reproduce
Steps to reproduce the behavior:

create a benchmark that only hwas a std::this_thread::sleep_for(10s).
Execute it using ->Threads(2);

Expected behavior
It should report 10s, but only reports 5s.

Additional context

I understand that this issue may include several changes downstream.

However, I can't find it right the that a code, which is takes 1s to run in 1 thread (as per sleep_for(1s)), gets reported as 0.5s in 2 threads, 0.25 in 4 threads and so on.

"A question for you: Would you agree that a code that just has a wait time of 1s should be reported that it took 1s to run, regardless of the number of threads?"

Digging deeper into the code:

The following function DoNIterations() will call RunInThread():

BenchmarkRunner::IterationResults BenchmarkRunner::DoNIterations() {
...
RunInThread(&b, iters, 0, manager.get(), perf_counters_measurement_ptr);

And RunInThread() will eventually post its results

State st =
      b->Run(iters, thread_id, &timer, manager, perf_counters_measurement);
  BM_CHECK(st.skipped() || st.iterations() >= st.max_iterations)
      << "Benchmark returned before State::KeepRunning() returned false!";
  {
    MutexLock l(manager->GetBenchmarkMutex());
    internal::ThreadManager::Result& results = manager->results;
    results.iterations += st.iterations();
    results.cpu_time_used += timer.cpu_time_used();
    results.real_time_used += timer.real_time_used();
    results.manual_time_used += timer.manual_time_used();
    results.complexity_n += st.complexity_length_n();
    internal::Increment(&results.counters, st.counters);
  }

But then further down in RunInThread you have:

  // Adjust real/manual time stats since they were reported per thread.
  i.results.real_time_used /= b.threads();
  i.results.manual_time_used /= b.threads();
  // If we were measuring whole-process CPU usage, adjust the CPU time too.
  if (b.measure_process_cpu_time()) i.results.cpu_time_used /= b.threads();

So, you're averaging the times with the number of threads, but the number of iterations doesn't get average causing the discrepancy in measurements that use the ->Threads([greater than 1]) functionality.

Since the averaging happens later with the total number of iterations, eliminating the time adjusting (all of =/ b.threads();) from the code would be the best solution. Also, this could be an optional parameter (for initial testing purposes).

Again, this would only be relevant for benchmarks that use the ->Threads function.

@manuelgustavo manuelgustavo mentioned this issue Aug 6, 2024
@dmah42
Copy link
Member

dmah42 commented Aug 6, 2024

i don't see how we can safely change this behaviour without breaking a lot of existing users.

@manuelgustavo
Copy link
Author

manuelgustavo commented Aug 6, 2024

With an optional argument from the command-line to turn it on you wouldn't break existing users.

I'm testing with the following code removed. Ideally it should be enabled / disabled by that command-line argument.

diff --git a/src/benchmark_runner.cc b/src/benchmark_runner.cc
index 3a8c307..9e3bbc0 100644
--- a/src/benchmark_runner.cc
+++ b/src/benchmark_runner.cc
@@ -289,18 +289,12 @@ BenchmarkRunner::IterationResults BenchmarkRunner::DoNIterations() {
   // And get rid of the manager.
   manager.reset();
 
-  // Adjust real/manual time stats since they were reported per thread.
-  i.results.real_time_used /= b.threads();
-  i.results.manual_time_used /= b.threads();
-  // If we were measuring whole-process CPU usage, adjust the CPU time too.
-  if (b.measure_process_cpu_time()) i.results.cpu_time_used /= b.threads();
-
   BM_VLOG(2) << "Ran in " << i.results.cpu_time_used << "/"
              << i.results.real_time_used << "\n";
 
   // By using KeepRunningBatch a benchmark can iterate more times than
   // requested, so take the iteration count from i.results.
-  i.iters = i.results.iterations / b.threads();
+  i.iters = i.results.iterations;
 
   // Base decisions off of real time if requested by this benchmark.
   i.seconds = i.results.cpu_time_used;

I'm testing it now in our environment, I will post the results.

@dmah42
Copy link
Member

dmah42 commented Aug 6, 2024

i'm not thrilled about the idea of fundamentally changing how we report on threaded benchmarks with a flag. it's the sort of thing that should be an opinion (and it is today) rather than having to a) maintain two ways of doing the same thing and b) explaining to users which they should use and when.

do you think this is so fundamental a problem that we should consider changing the default behaviour?

@manuelgustavo
Copy link
Author

Yes, I believe it's a fundamental problem.

I was executing the benchmark on a code that was "divide and conquer" on a 128 core processor (with hyperthread / turbo boost disabled), I was dialing up the number of threads, I was expecting 128 threads to be roughly the "sweet spot", but the times were still drastically improving with 1024 threads.

So I went for the manual creation and destruction of threads. Then I was getting the expected results.
By reading the user_guide.md in the "CPU timers" section it states that the time measured is the time taken by the main thread. If you take in consideration the sleep_for(1s) scenario, it's clear that it's not the time taken by the main thread. So there was something wrong in the calculation.
What seems wrong is the fact that it seems that the number of iterations "increase x number of threads".
Therefore, dividing the time taken only by the number of iterations (since they include each thread iteration) should provide the correct results.

Sorry for taking a long time... I'm will do the tests and report back, in the next hour or so. (It's been a busy day.)

@manuelgustavo
Copy link
Author

@dmah42,

I'm happy with the test results.
The sleep_for(1s) test after the changes (one can see that it's very stable regardless on the number of threads):

static void simple_run0(benchmark::State& state) 
{
    for (auto _ : state)
    {
            std::this_thread::sleep_for(1s);
    }
}
CPU Caches:
  L1 Data 48 KiB (x128)
  L1 Instruction 32 KiB (x128)
  L2 Unified 2048 KiB (x128)
  L3 Unified 327680 KiB (x2)
Load Average: 1.33, 5.90, 4.25
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
simple_run0/threads:1            1000110269 ns        21783 ns           10
simple_run0/threads:2            1000113257 ns        20000 ns           40
simple_run0/threads:4            1000130917 ns        22819 ns            4
simple_run0/threads:8            1000117407 ns        23134 ns            8
simple_run0/threads:16           1000171957 ns        23078 ns           16
simple_run0/threads:32           1000184530 ns        22492 ns           32
simple_run0/threads:64           1000242465 ns        25044 ns           64
simple_run0/threads:128          1000154692 ns        13733 ns          128
simple_run0/threads:256          1000116723 ns        11508 ns          256
simple_run0/threads:512          1000151504 ns        13930 ns          512
simple_run0/threads:1024         1000097065 ns        23422 ns         1024

A more process intensive benchmark (monte-carlo simulation):
The code needs more fine-tuning, but it can be seen that the number of threads doesn't affect much after a certain threshold.

simple_run1/8192/16/threads:1         23921 ms        23919 ms            1
simple_run1/8192/16/threads:2         11999 ms        11999 ms            2
simple_run1/8192/16/threads:4          6021 ms         6021 ms            4
simple_run1/8192/16/threads:8          3020 ms         3020 ms            8
simple_run1/8192/16/threads:16         1519 ms         1519 ms           16
simple_run1/8192/16/threads:32          765 ms          765 ms           32
simple_run1/8192/16/threads:64          395 ms          388 ms           64
simple_run1/8192/16/threads:128         325 ms          199 ms          128
simple_run1/8192/16/threads:256         230 ms          103 ms          256
simple_run1/8192/16/threads:512         250 ms         54.4 ms          512
simple_run1/8192/16/threads:1024        244 ms         29.7 ms         1024

I believe that the code change is in the "right direction". At least for our purposes we will have to use the forked / altered version.

@dmah42
Copy link
Member

dmah42 commented Aug 6, 2024

are you using ->UseRealTime when you register the benchmark?

@manuelgustavo
Copy link
Author

No, in the latest run (above), I'm not using UseRealTime()
Though, I've tried it before and also alone and combined with MeasureProcessCPUTime()

@dmah42
Copy link
Member

dmah42 commented Aug 6, 2024

it's been a while since i looked at this, but iirc the real time divides by threads, as does iterations. manual time also divides both by iterations and threads. process CPU time does the same, but non-process CPU time doesn't.

ie, this is why the CPU column stays ~constant for your sleep test and shows improvement for your monte carlo test (though i'm curious why the CPU time continues to shrink for larger numbers of threads but the real-time flattens out.

your change manipulates iterations which i think is incorrect based on the comment -- ie we get an invalid iterations when running in batches and need to correct by taking from results, and as you can see from your results the number of iterations of a single loop is x threads. changing this will mess up the prediction for when to stop growing iterations to find the right number. so we might want to rethink the division of time instead.

removing the thread division from the real time being stored in results gives similar results to what you have (1s timings for sleeping for 1s).

given the previous discussion on this, i wonder if we should add a column. i'm not sure what to call it though. maybe "per-thread time"?

@manuelgustavo
Copy link
Author

Yes, I've noticed something was slightly off.
I also wondered why the CPU time was reducing still...
And the extra column "per-thread time" would be perfect!

Thanks!

@dmah42
Copy link
Member

dmah42 commented Aug 7, 2024

i'm going to see what it looks like to add this to reports.

@manuelgustavo
Copy link
Author

manuelgustavo commented Aug 7, 2024 via email

dmah42 added a commit that referenced this issue Aug 7, 2024
See #1834 for detailed description of why this is useful.
dmah42 added a commit that referenced this issue Aug 7, 2024
See #1834 for detailed description of why this is useful.
@LebedevRI
Copy link
Collaborator

#1836

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants