Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #1833

Closed
manuelgustavo opened this issue Aug 5, 2024 · 4 comments
Closed

[BUG] #1833

manuelgustavo opened this issue Aug 5, 2024 · 4 comments

Comments

@manuelgustavo
Copy link

Describe the bug
Multi-threaded benchmarks always report the time spent divided by the number of threads.

System
Which OS, compiler, and compiler version are you using:

  • OS: Linux (Ubuntu 22.04)
  • Compiler and version: clang-14

To reproduce
Steps to reproduce the behaviour:

Use the following code:

static void simple_run(benchmark::State& state)
{
    for (auto _ : state)
    {
        std::this_thread::sleep_for(1s);
    }
}
BENCHMARK(simple_run)->ThreadRange(1, 8 << 7);
BENCHMARK_MAIN();

Sample output:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
simple_run/threads:1    1000144900 ns        39474 ns           10
simple_run/threads:2     500064675 ns        35772 ns           20
simple_run/threads:4     250033026 ns        33278 ns           40
simple_run/threads:8     125015169 ns        28046 ns           80
...

Expected behavior
I would expect the time to always be like the one in *threads:1, because the visible time taken doesn't actually change.

Additional context

Investigating the code base (and testing), I've found out that BenchmarkReporter::Run::GetAdjustedRealTime() doesn't consider the number of threads in its formula.

The following change does the job.

diff --git a/src/reporter.cc b/src/reporter.cc
index 076bc31..103d86e 100644
--- a/src/reporter.cc
+++ b/src/reporter.cc
@@ -105,7 +105,7 @@ std::string BenchmarkReporter::Run::benchmark_name() const {
 
 double BenchmarkReporter::Run::GetAdjustedRealTime() const {
   double new_time = real_accumulated_time * GetTimeUnitMultiplier(time_unit);
-  if (iterations != 0) new_time /= static_cast<double>(iterations);
+  if (iterations != 0) new_time /= (static_cast<double>(iterations) / static_cast<double>(threads));
   return new_time;
 }

BTW: Someone reported something similar at https://groups.google.com/g/benchmark-discuss/c/bWwa0QcZenc.

I've found this quite trivial, therefore I have my doubts about the lack of number of threads in the formula, so I preferred to raise this issue for someone more experienced with the original code.

@LebedevRI
Copy link
Collaborator

This is not a bug in the general sense, it's working as intended, but indeed, i'm not quite sure if the intention is correct or not.

@dmah42
Copy link
Member

dmah42 commented Aug 6, 2024

this is very much intended, and yes we could go either way on it, and this is the way we've chosen.

the Time is the perceived time the work takes; ie, by adding more threads you would expect (if the work is parallelisable) that it would take less time. if, however, your algorithm was locking between threads you may not see this and that would be surprising.

changing this would be very disruptive so it's not something we can easily entertain, even if we wanted to.

@dmah42 dmah42 closed this as completed Aug 6, 2024
@manuelgustavo
Copy link
Author

Hi @dmah42,

Thanks for your quick answer. It's an honour to try to contribute to this great tool.

I understand that this issue may include several changes downstream.

However, I can't find it right the that a code, which is explicitly taking 1s to run in 1 thread (as per sleep_for(1s)), gets reported as 0.5s in 2 threads, 0.25 in 4 threads and so on.

"A question for you: Would you agree that a code that just has a wait time of 1s should be reported that it took 1s to run, regardless on the number of threads?"

Digging deeper into the code:

The following function DoNIterations() will call RunInThread():

BenchmarkRunner::IterationResults BenchmarkRunner::DoNIterations() {
...
RunInThread(&b, iters, 0, manager.get(), perf_counters_measurement_ptr);

And RunInThread() will eventually post its results

State st =
      b->Run(iters, thread_id, &timer, manager, perf_counters_measurement);
  BM_CHECK(st.skipped() || st.iterations() >= st.max_iterations)
      << "Benchmark returned before State::KeepRunning() returned false!";
  {
    MutexLock l(manager->GetBenchmarkMutex());
    internal::ThreadManager::Result& results = manager->results;
    results.iterations += st.iterations();
    results.cpu_time_used += timer.cpu_time_used();
    results.real_time_used += timer.real_time_used();
    results.manual_time_used += timer.manual_time_used();
    results.complexity_n += st.complexity_length_n();
    internal::Increment(&results.counters, st.counters);
  }

But then further down in RunInThread you have:

  // Adjust real/manual time stats since they were reported per thread.
  i.results.real_time_used /= b.threads();
  i.results.manual_time_used /= b.threads();
  // If we were measuring whole-process CPU usage, adjust the CPU time too.
  if (b.measure_process_cpu_time()) i.results.cpu_time_used /= b.threads();

So, you're averaging the times with the number of threads, but the number of iterations doesn't get average causing the discrepancy in measurements that use the ->Threads([greater than 1]) functionality.

Since the averaging happens later with the total number of iterations, eliminating the time adjusting (all of =/ b.threads();) from the code would be the best solution. Also, this could be an optional parameter (for initial testing purposes).

Again, this would only be relevant for benchmarks that use the ->Threads function.

Please let me know what you think.

Thanks,
Manuel

@manuelgustavo
Copy link
Author

I've opened #1834 to add further information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants