Fix reported time for multithreaded benchmarks #946

koct9i · 2020-03-15T09:53:09Z

In report real and manual per iteration is divided by thread count
because total time is adjusted to per-thread while iterations counter is not.
See #769

Process CPU time is special - each thread reports time spent by all threads
thus it actually must be divided by thread count once to get total cpu time and
second time to get per-thread CPU time.

googlebot · 2020-03-15T09:53:27Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

koct9i · 2020-03-15T10:10:53Z

@googlebot I signed it!

googlebot · 2020-03-15T10:10:56Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

LebedevRI · 2020-03-15T15:04:24Z

Thank you for looking into this. With multi-threaded benchmarks, It is hard to tell whether
the currently-reported numbers are correct or not, as noted in mentioned #769.

koct9i · 2020-03-15T15:19:17Z

Bug in current reports is pretty obvious for trivial benchmark which spins or sleep fixed amount of time.
It seems present test/internal_threading_test.cc could be used for that. But it needs validation.
But writing checks for it using "output_test" is a pain =) I'll play with custom reported, this should easier.

src/benchmark_runner.cc

koct9i · 2020-03-31T05:49:26Z

Oh, I get it - spin-wait in test watching into real time. 🤦‍♂️

koct9i · 2020-04-07T05:11:43Z

Bump. Ready for review/merge.

LebedevRI

Now that the fresh release is out of the door

Overall, this looks good to me.
I believe, this is moving in the right direction,
but i'm not all that familiar with ->Threads().

@dominichamon does this look obviously-wrong
given the initial ->Threads() design,
and previous discussions and #769?

LebedevRI · 2020-06-15T16:16:47Z

README.md

+For multithreaded benchmarks timer measures average among working threads.
+
 If the benchmark itself uses threads internally, this measurement may not
 be what you are looking for. Instead, there is a way to measure the total
-CPU usage of the process, by all the threads.
+CPU usage of the process, by all the threads. Resulting total CPU time
+also will be divided among working threads.


I wonder if this is too vague. Maybe this should mention that it is specifically about https://github.com/google/benchmark#multithreaded-benchmarks, not the threads inside of the workload itself.

dmah42 · 2020-06-17T13:14:20Z

the code seems reasonable. The threads division has changed back and forth a few times (as i mentioned in #769).

i'm just running a diff between the head version and this change to see what the output looks like for tests.

dmah42 · 2020-06-17T13:24:54Z

This is the before

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_spin_empty/8/threads:4                           23.0 ns         90.2 ns      7967324
BM_spin_empty/512/threads:4                         1367 ns         4908 ns       158756
BM_spin_empty/8192/threads:4                       22378 ns        79529 ns         9364

This is the after

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_spin_empty/8/threads:4                            168 ns         95.0 ns     33131036
BM_spin_empty/512/threads:4                         9492 ns         4960 ns       512840
BM_spin_empty/8192/threads:4                      155608 ns        80357 ns        31952

Before this change, it's clear that the time is the end-to-end time given the parallelisation (23 ~= 90/4, 1300 ~= 4900/4, etc). In the latter case i find it hard to understand why we're ~doubling the reported time.

another example:

before

BM_pause_during                                      804 ns          796 ns       798884
BM_pause_during/threads:4                            355 ns         1221 ns       568668
BM_pause_during/real_time                            972 ns          970 ns       848250
BM_pause_during/real_time/threads:4                  361 ns         1250 ns      1890216

after

BM_pause_during                                     1248 ns         1188 ns       594011
BM_pause_during/threads:4                           2873 ns         1297 ns      2099924
BM_pause_during/real_time                           1326 ns         1196 ns       552044
BM_pause_during/real_time/threads:4                 2357 ns         1300 ns       975240

again i find it hard to understand why the threading appears to cause the measured time to double. the former is what i'd expect: ~4x improvement by using 4 threads.

koct9i · 2020-06-18T09:31:11Z

My patch normalizes all times to 1 iteration in 1 thread. So, if threads runs all time then "real" time should be equal to 'CPU" time or be slightly bigger. For perfect parallelism times should stay the same regardless of count of threads.

In your case 'real' time much bigger then 'cpu', this means threads were preempted. I guess you are using osx? Probably these artifacts were caused by it's scheduler.

For Linux:

Before

BM_spin_empty/8                                     43.2 ns         43.2 ns     16180998
BM_spin_empty/512                                   2448 ns         2448 ns       285791
BM_spin_empty/8192                                 39142 ns        39138 ns        17843
BM_spin_empty/8/threads:8                           11.8 ns         94.0 ns      7782976
BM_spin_empty/512/threads:8                          708 ns         5638 ns       125112
BM_spin_empty/8192/threads:8                       11666 ns        92589 ns         7576

After

BM_spin_empty/8                                     46.8 ns         46.8 ns     14851377
BM_spin_empty/512                                   2638 ns         2638 ns       263970
BM_spin_empty/8192                                 42235 ns        42222 ns        16626
BM_spin_empty/8/threads:8                            109 ns          109 ns     54453880
BM_spin_empty/512/threads:8                         6212 ns         6201 ns       911128
BM_spin_empty/8192/threads:8                      101850 ns       101688 ns        55528

As you can see in 'Before' 'real' time was erroneously divided by 8 for per-cpu threads.

Performance for per-cpu threads is around twice lower because this is 4 core cpu with hyper-threading and turbo-boost cannot reach top frequency when all cores are active.

Right now "real" and "manual" time erroneously divided by thread count, while it also divided by iterations count which includes thread count. Count of iterations and measured time reported by each thread and summed. Thus total_time / total_iterations gives avegrage time per iteration. The only special case is MeasureProcessCPUTime() - each thread reports time spent by all threads thus result must be divided by thread count. Duration of each run (repetition) is total_time / nr_threads, or time_per_iteration * total_iterations / nr_threads. Fixes: google#769 Signed-off-by: Konstantin Khlebnikov <[email protected]>

Run 10 iterations of 50ms busyloop (by thread CPU time) and check resulting times with absolute and relative accuracy ±10%. If system has not enough cpus (1 for each thread + 1 for infrastructure) then check only that measured real time not less than expect. Signed-off-by: Konstantin Khlebnikov <[email protected]>

dmah42 · 2020-06-18T11:00:36Z

I think this change is correct. The wall-time (Time) and CPU time should match in terms of per-thread vs global. The benefit of threading should be visible when the real and CPU times don't change but the amount of processing (items per second, etc) increases. As noted, if threading does lead to a change in timing it indicates where the parallelism breaks down.

krzikalla · 2020-08-14T08:40:42Z

Since it is related, I want to add another problem to this issue: the real time currently measures the average wall-clock time of every single thread. It should rather measures the wall-clock time of the thread pack. That is, the actual real time is the time, when the last thread enters FinishKeepRunning. Only by that means you can spot balancing issues.
Example: if one thread runs 1 sec and 7 threads run 1 ms each, then currently 0,125 sec real time is reported. This is not the wall-clock time! The wall-clock time is 1 sec.
Solution: one could move the barrier in FinishKeepRunning before PauseTiming. But this would tamper the timing. Instead don't sum up the real times, but use the maximum one at

benchmark/src/benchmark_runner.cc

Line 127 in 1302d2c

results.real_time_used += timer.real_time_used();

. Of course you must not divide real time by num_threads then.

dmah42 · 2020-08-18T08:21:52Z

@krzikalla would it be ok to defer the wall-clock time change to a separate PR or do you want to solve that as part of this change?

koct9i · 2020-08-18T08:38:24Z

@krzikalla

Example: if one thread runs 1 sec and 7 threads run 1 ms each, then currently 0,125 sec real time is reported. This is not the wall-clock time! The wall-clock time is 1 sec.

This seems odd. If each thread did same single iteration then this means execution speed is unpredictable.
Instead of forging "sane" metrics benchmark should measure and report deviation among iterations.

Present design of benchmark loop could easily estimate deviation among averages for each thread.
To catch variations within thread it have to measure time periodically or compare progress.

krzikalla · 2020-08-24T09:12:15Z

would it be ok to defer the wall-clock time change to a separate PR or do you want to solve that as part of this change?

It can be a separate PR. I just wanted to raise awareness for this issue and thought, it would be better to add it to this discussion here instead of creating a new one.

This seems odd. If each thread did same single iteration then this means execution speed is unpredictable.

Not necessarily. Other possible scenarios are poor balancing or a hit by Amdahl's law.

Instead of forging "sane" metrics benchmark should measure and report deviation among iterations.

Present design of benchmark loop could easily estimate deviation among averages for each thread.
To catch variations within thread it have to measure time periodically or compare progress.

Normally, you don't report deviations. You report wall clock time, since that affect the usability of the code.

dmah42 · 2021-04-27T16:10:40Z

I'm returning to this, sorry for the break. Do you think there's more to do here or are you happy with it?

LebedevRI · 2024-08-13T17:25:11Z

This was fixed at the source in #1836.

As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch. d9d33ff20e1e7767759fc07ea97c1e661716f26a

googlebot added the cla: no label Mar 15, 2020

googlebot added cla: yes and removed cla: no labels Mar 15, 2020

koct9i force-pushed the fix-multithreaded-time-report branch 3 times, most recently from 9bad2b6 to bc9a6a2 Compare March 24, 2020 13:49

LebedevRI added bug question labels Mar 24, 2020

koct9i force-pushed the fix-multithreaded-time-report branch 5 times, most recently from 4f5daec to c39ee1c Compare March 29, 2020 10:26

LebedevRI reviewed Mar 30, 2020

View reviewed changes

src/benchmark_runner.cc Show resolved Hide resolved

koct9i force-pushed the fix-multithreaded-time-report branch from c39ee1c to d97af0a Compare March 30, 2020 11:57

koct9i force-pushed the fix-multithreaded-time-report branch 6 times, most recently from b4e206a to ffac743 Compare March 31, 2020 10:10

koct9i force-pushed the fix-multithreaded-time-report branch from ffac743 to fb4cc71 Compare April 10, 2020 07:34

LebedevRI mentioned this pull request Jun 3, 2020

[BUG] bytes_per_second depends on the timer #970

Closed

LebedevRI reviewed Jun 15, 2020

View reviewed changes

dmah42 self-assigned this Jun 17, 2020

koct9i added 2 commits June 18, 2020 12:50

koct9i force-pushed the fix-multithreaded-time-report branch from fb4cc71 to 89c72bf Compare June 18, 2020 09:51

krzikalla mentioned this pull request Aug 25, 2020

Fix realtime measurement for multi-threaded benchmarks #1030

Closed

dmah42 added the incomplete work needed label May 6, 2021

dmah42 closed this May 30, 2021

dmah42 deleted the branch google:master May 30, 2021 08:53

LebedevRI mentioned this pull request Mar 20, 2024

[Q] Need help with understanding the results #1771

Closed

georgthegreat mentioned this pull request Aug 21, 2024

Ensure reported Time is walltime by removing spurious scaling by threads #1836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reported time for multithreaded benchmarks #946

Fix reported time for multithreaded benchmarks #946

koct9i commented Mar 15, 2020

googlebot commented Mar 15, 2020

koct9i commented Mar 15, 2020

googlebot commented Mar 15, 2020

LebedevRI commented Mar 15, 2020

koct9i commented Mar 15, 2020

koct9i commented Mar 31, 2020

koct9i commented Apr 7, 2020

LebedevRI left a comment •

edited

Loading

LebedevRI Jun 15, 2020

dmah42 commented Jun 17, 2020

dmah42 commented Jun 17, 2020

koct9i commented Jun 18, 2020 •

edited

Loading

dmah42 commented Jun 18, 2020

krzikalla commented Aug 14, 2020 •

edited

Loading

dmah42 commented Aug 18, 2020

koct9i commented Aug 18, 2020

krzikalla commented Aug 24, 2020

dmah42 commented Apr 27, 2021

LebedevRI commented Aug 13, 2024

Fix reported time for multithreaded benchmarks #946

Fix reported time for multithreaded benchmarks #946

Conversation

koct9i commented Mar 15, 2020

googlebot commented Mar 15, 2020

What to do if you already signed the CLA

Individual signers

Corporate signers

koct9i commented Mar 15, 2020

googlebot commented Mar 15, 2020

LebedevRI commented Mar 15, 2020

koct9i commented Mar 15, 2020

koct9i commented Mar 31, 2020

koct9i commented Apr 7, 2020

LebedevRI left a comment • edited Loading

Choose a reason for hiding this comment

LebedevRI Jun 15, 2020

Choose a reason for hiding this comment

dmah42 commented Jun 17, 2020

dmah42 commented Jun 17, 2020

koct9i commented Jun 18, 2020 • edited Loading

dmah42 commented Jun 18, 2020

krzikalla commented Aug 14, 2020 • edited Loading

dmah42 commented Aug 18, 2020

koct9i commented Aug 18, 2020

krzikalla commented Aug 24, 2020

dmah42 commented Apr 27, 2021

LebedevRI commented Aug 13, 2024

LebedevRI left a comment •

edited

Loading

koct9i commented Jun 18, 2020 •

edited

Loading

krzikalla commented Aug 14, 2020 •

edited

Loading