[BUG] Threads time taken report. #1834

manuelgustavo · 2024-08-06T10:02:41Z

Describe the bug
A benchmark that has only sleep_for(10s) gets reported / number of threads.

System
Which OS, compiler, and compiler version are you using:

OS: Linux / Ubuntu 22.04
Compiler and version: clang-14

To reproduce
Steps to reproduce the behavior:

create a benchmark that only hwas a std::this_thread::sleep_for(10s).
Execute it using ->Threads(2);

Expected behavior
It should report 10s, but only reports 5s.

Additional context

I understand that this issue may include several changes downstream.

However, I can't find it right the that a code, which is takes 1s to run in 1 thread (as per sleep_for(1s)), gets reported as 0.5s in 2 threads, 0.25 in 4 threads and so on.

"A question for you: Would you agree that a code that just has a wait time of 1s should be reported that it took 1s to run, regardless of the number of threads?"

Digging deeper into the code:

The following function DoNIterations() will call RunInThread():

BenchmarkRunner::IterationResults BenchmarkRunner::DoNIterations() {
...
RunInThread(&b, iters, 0, manager.get(), perf_counters_measurement_ptr);

And RunInThread() will eventually post its results

State st =
      b->Run(iters, thread_id, &timer, manager, perf_counters_measurement);
  BM_CHECK(st.skipped() || st.iterations() >= st.max_iterations)
      << "Benchmark returned before State::KeepRunning() returned false!";
  {
    MutexLock l(manager->GetBenchmarkMutex());
    internal::ThreadManager::Result& results = manager->results;
    results.iterations += st.iterations();
    results.cpu_time_used += timer.cpu_time_used();
    results.real_time_used += timer.real_time_used();
    results.manual_time_used += timer.manual_time_used();
    results.complexity_n += st.complexity_length_n();
    internal::Increment(&results.counters, st.counters);
  }

But then further down in RunInThread you have:

  // Adjust real/manual time stats since they were reported per thread.
  i.results.real_time_used /= b.threads();
  i.results.manual_time_used /= b.threads();
  // If we were measuring whole-process CPU usage, adjust the CPU time too.
  if (b.measure_process_cpu_time()) i.results.cpu_time_used /= b.threads();

So, you're averaging the times with the number of threads, but the number of iterations doesn't get average causing the discrepancy in measurements that use the ->Threads([greater than 1]) functionality.

Since the averaging happens later with the total number of iterations, eliminating the time adjusting (all of =/ b.threads();) from the code would be the best solution. Also, this could be an optional parameter (for initial testing purposes).

Again, this would only be relevant for benchmarks that use the ->Threads function.

The text was updated successfully, but these errors were encountered:

dmah42 · 2024-08-06T10:17:54Z

i don't see how we can safely change this behaviour without breaking a lot of existing users.

manuelgustavo · 2024-08-06T10:40:14Z

With an optional argument from the command-line to turn it on you wouldn't break existing users.

I'm testing with the following code removed. Ideally it should be enabled / disabled by that command-line argument.

diff --git a/src/benchmark_runner.cc b/src/benchmark_runner.cc
index 3a8c307..9e3bbc0 100644
--- a/src/benchmark_runner.cc
+++ b/src/benchmark_runner.cc
@@ -289,18 +289,12 @@ BenchmarkRunner::IterationResults BenchmarkRunner::DoNIterations() {
   // And get rid of the manager.
   manager.reset();
 
-  // Adjust real/manual time stats since they were reported per thread.
-  i.results.real_time_used /= b.threads();
-  i.results.manual_time_used /= b.threads();
-  // If we were measuring whole-process CPU usage, adjust the CPU time too.
-  if (b.measure_process_cpu_time()) i.results.cpu_time_used /= b.threads();
-
   BM_VLOG(2) << "Ran in " << i.results.cpu_time_used << "/"
              << i.results.real_time_used << "\n";
 
   // By using KeepRunningBatch a benchmark can iterate more times than
   // requested, so take the iteration count from i.results.
-  i.iters = i.results.iterations / b.threads();
+  i.iters = i.results.iterations;
 
   // Base decisions off of real time if requested by this benchmark.
   i.seconds = i.results.cpu_time_used;

I'm testing it now in our environment, I will post the results.

dmah42 · 2024-08-06T11:17:02Z

i'm not thrilled about the idea of fundamentally changing how we report on threaded benchmarks with a flag. it's the sort of thing that should be an opinion (and it is today) rather than having to a) maintain two ways of doing the same thing and b) explaining to users which they should use and when.

do you think this is so fundamental a problem that we should consider changing the default behaviour?

manuelgustavo · 2024-08-06T12:10:24Z

Yes, I believe it's a fundamental problem.

I was executing the benchmark on a code that was "divide and conquer" on a 128 core processor (with hyperthread / turbo boost disabled), I was dialing up the number of threads, I was expecting 128 threads to be roughly the "sweet spot", but the times were still drastically improving with 1024 threads.

So I went for the manual creation and destruction of threads. Then I was getting the expected results.
By reading the user_guide.md in the "CPU timers" section it states that the time measured is the time taken by the main thread. If you take in consideration the sleep_for(1s) scenario, it's clear that it's not the time taken by the main thread. So there was something wrong in the calculation.
What seems wrong is the fact that it seems that the number of iterations "increase x number of threads".
Therefore, dividing the time taken only by the number of iterations (since they include each thread iteration) should provide the correct results.

Sorry for taking a long time... I'm will do the tests and report back, in the next hour or so. (It's been a busy day.)

manuelgustavo · 2024-08-06T14:25:34Z

@dmah42,

I'm happy with the test results.
The sleep_for(1s) test after the changes (one can see that it's very stable regardless on the number of threads):

static void simple_run0(benchmark::State& state) 
{
    for (auto _ : state)
    {
            std::this_thread::sleep_for(1s);
    }
}

CPU Caches:
  L1 Data 48 KiB (x128)
  L1 Instruction 32 KiB (x128)
  L2 Unified 2048 KiB (x128)
  L3 Unified 327680 KiB (x2)
Load Average: 1.33, 5.90, 4.25
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
simple_run0/threads:1            1000110269 ns        21783 ns           10
simple_run0/threads:2            1000113257 ns        20000 ns           40
simple_run0/threads:4            1000130917 ns        22819 ns            4
simple_run0/threads:8            1000117407 ns        23134 ns            8
simple_run0/threads:16           1000171957 ns        23078 ns           16
simple_run0/threads:32           1000184530 ns        22492 ns           32
simple_run0/threads:64           1000242465 ns        25044 ns           64
simple_run0/threads:128          1000154692 ns        13733 ns          128
simple_run0/threads:256          1000116723 ns        11508 ns          256
simple_run0/threads:512          1000151504 ns        13930 ns          512
simple_run0/threads:1024         1000097065 ns        23422 ns         1024

A more process intensive benchmark (monte-carlo simulation):
The code needs more fine-tuning, but it can be seen that the number of threads doesn't affect much after a certain threshold.

simple_run1/8192/16/threads:1         23921 ms        23919 ms            1
simple_run1/8192/16/threads:2         11999 ms        11999 ms            2
simple_run1/8192/16/threads:4          6021 ms         6021 ms            4
simple_run1/8192/16/threads:8          3020 ms         3020 ms            8
simple_run1/8192/16/threads:16         1519 ms         1519 ms           16
simple_run1/8192/16/threads:32          765 ms          765 ms           32
simple_run1/8192/16/threads:64          395 ms          388 ms           64
simple_run1/8192/16/threads:128         325 ms          199 ms          128
simple_run1/8192/16/threads:256         230 ms          103 ms          256
simple_run1/8192/16/threads:512         250 ms         54.4 ms          512
simple_run1/8192/16/threads:1024        244 ms         29.7 ms         1024

I believe that the code change is in the "right direction". At least for our purposes we will have to use the forked / altered version.

dmah42 · 2024-08-06T15:29:24Z

are you using ->UseRealTime when you register the benchmark?

manuelgustavo · 2024-08-06T15:57:44Z

No, in the latest run (above), I'm not using UseRealTime()
Though, I've tried it before and also alone and combined with MeasureProcessCPUTime()

dmah42 · 2024-08-06T16:23:32Z

it's been a while since i looked at this, but iirc the real time divides by threads, as does iterations. manual time also divides both by iterations and threads. process CPU time does the same, but non-process CPU time doesn't.

ie, this is why the CPU column stays ~constant for your sleep test and shows improvement for your monte carlo test (though i'm curious why the CPU time continues to shrink for larger numbers of threads but the real-time flattens out.

your change manipulates iterations which i think is incorrect based on the comment -- ie we get an invalid iterations when running in batches and need to correct by taking from results, and as you can see from your results the number of iterations of a single loop is x threads. changing this will mess up the prediction for when to stop growing iterations to find the right number. so we might want to rethink the division of time instead.

removing the thread division from the real time being stored in results gives similar results to what you have (1s timings for sleeping for 1s).

given the previous discussion on this, i wonder if we should add a column. i'm not sure what to call it though. maybe "per-thread time"?

manuelgustavo · 2024-08-06T18:08:18Z

Yes, I've noticed something was slightly off.
I also wondered why the CPU time was reducing still...
And the extra column "per-thread time" would be perfect!

Thanks!

dmah42 · 2024-08-07T12:07:46Z

i'm going to see what it looks like to add this to reports.

manuelgustavo · 2024-08-07T13:18:27Z

Ok

…

On Wed, 7 Aug 2024, 13:08 dominic, ***@***.***> wrote: i'm going to see what it looks like to add this to reports. — Reply to this email directly, view it on GitHub <#1834 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIDKORI55N4WRDGHPTGFYTZQIE2RAVCNFSM6AAAAABMB7NU7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGMYTMNBTHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

See #1834 for detailed description of why this is useful.

LebedevRI · 2024-08-13T17:27:42Z

#1836

manuelgustavo mentioned this issue Aug 6, 2024

[BUG] #1833

Closed

dmah42 mentioned this issue Aug 7, 2024

[BUG] Question about Benchmark Results on BigLittle Architecture #1835

Closed

dmah42 added a commit that referenced this issue Aug 7, 2024

Introduce per-thread times to console and json reporters

254652f

See #1834 for detailed description of why this is useful.

dmah42 mentioned this issue Aug 7, 2024

Ensure reported Time is walltime by removing spurious scaling by threads #1836

Merged

dmah42 added a commit that referenced this issue Aug 7, 2024

Introduce per-thread times to console and json reporters

3053d1b

See #1834 for detailed description of why this is useful.

LebedevRI closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Threads time taken report. #1834

[BUG] Threads time taken report. #1834

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024 •

edited

Loading

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 7, 2024

manuelgustavo commented Aug 7, 2024 via email

LebedevRI commented Aug 13, 2024

[BUG] Threads time taken report. #1834

[BUG] Threads time taken report. #1834

Comments

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024 • edited Loading

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 6, 2024

manuelgustavo commented Aug 6, 2024

dmah42 commented Aug 7, 2024

manuelgustavo commented Aug 7, 2024 via email

LebedevRI commented Aug 13, 2024

manuelgustavo commented Aug 6, 2024 •

edited

Loading