Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reported time for multithreaded benchmarks #946

Closed
wants to merge 2 commits into from

Conversation

koct9i
Copy link
Contributor

@koct9i koct9i commented Mar 15, 2020

In report real and manual per iteration is divided by thread count
because total time is adjusted to per-thread while iterations counter is not.
See #769

Process CPU time is special - each thread reports time spent by all threads
thus it actually must be divided by thread count once to get total cpu time and
second time to get per-thread CPU time.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@koct9i
Copy link
Contributor Author

koct9i commented Mar 15, 2020

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@LebedevRI
Copy link
Collaborator

Thank you for looking into this. With multi-threaded benchmarks, It is hard to tell whether
the currently-reported numbers are correct or not, as noted in mentioned #769.

@koct9i
Copy link
Contributor Author

koct9i commented Mar 15, 2020

Bug in current reports is pretty obvious for trivial benchmark which spins or sleep fixed amount of time.
It seems present test/internal_threading_test.cc could be used for that. But it needs validation.
But writing checks for it using "output_test" is a pain =) I'll play with custom reported, this should easier.

@koct9i koct9i force-pushed the fix-multithreaded-time-report branch 3 times, most recently from 9bad2b6 to bc9a6a2 Compare March 24, 2020 13:49
@koct9i koct9i force-pushed the fix-multithreaded-time-report branch 5 times, most recently from 4f5daec to c39ee1c Compare March 29, 2020 10:26
@koct9i koct9i force-pushed the fix-multithreaded-time-report branch from c39ee1c to d97af0a Compare March 30, 2020 11:57
@koct9i
Copy link
Contributor Author

koct9i commented Mar 31, 2020

Oh, I get it - spin-wait in test watching into real time. 🤦‍♂️

@koct9i koct9i force-pushed the fix-multithreaded-time-report branch 6 times, most recently from b4e206a to ffac743 Compare March 31, 2020 10:10
@koct9i
Copy link
Contributor Author

koct9i commented Apr 7, 2020

Bump. Ready for review/merge.

Copy link
Collaborator

@LebedevRI LebedevRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the fresh release is out of the door

Overall, this looks good to me.
I believe, this is moving in the right direction,
but i'm not all that familiar with ->Threads().

@dominichamon does this look obviously-wrong
given the initial ->Threads() design,
and previous discussions and #769?

Comment on lines +913 to +918
For multithreaded benchmarks timer measures average among working threads.

If the benchmark itself uses threads internally, this measurement may not
be what you are looking for. Instead, there is a way to measure the total
CPU usage of the process, by all the threads.
CPU usage of the process, by all the threads. Resulting total CPU time
also will be divided among working threads.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is too vague. Maybe this should mention that it is specifically about https://github.com/google/benchmark#multithreaded-benchmarks, not the threads inside of the workload itself.

@dmah42 dmah42 self-assigned this Jun 17, 2020
@dmah42
Copy link
Member

dmah42 commented Jun 17, 2020

the code seems reasonable. The threads division has changed back and forth a few times (as i mentioned in #769).

i'm just running a diff between the head version and this change to see what the output looks like for tests.

@dmah42
Copy link
Member

dmah42 commented Jun 17, 2020

This is the before

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_spin_empty/8/threads:4                           23.0 ns         90.2 ns      7967324
BM_spin_empty/512/threads:4                         1367 ns         4908 ns       158756
BM_spin_empty/8192/threads:4                       22378 ns        79529 ns         9364

This is the after

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
BM_spin_empty/8/threads:4                            168 ns         95.0 ns     33131036
BM_spin_empty/512/threads:4                         9492 ns         4960 ns       512840
BM_spin_empty/8192/threads:4                      155608 ns        80357 ns        31952

Before this change, it's clear that the time is the end-to-end time given the parallelisation (23 ~= 90/4, 1300 ~= 4900/4, etc). In the latter case i find it hard to understand why we're ~doubling the reported time.

another example:

before

BM_pause_during                                      804 ns          796 ns       798884
BM_pause_during/threads:4                            355 ns         1221 ns       568668
BM_pause_during/real_time                            972 ns          970 ns       848250
BM_pause_during/real_time/threads:4                  361 ns         1250 ns      1890216

after

BM_pause_during                                     1248 ns         1188 ns       594011
BM_pause_during/threads:4                           2873 ns         1297 ns      2099924
BM_pause_during/real_time                           1326 ns         1196 ns       552044
BM_pause_during/real_time/threads:4                 2357 ns         1300 ns       975240

again i find it hard to understand why the threading appears to cause the measured time to double. the former is what i'd expect: ~4x improvement by using 4 threads.

@koct9i
Copy link
Contributor Author

koct9i commented Jun 18, 2020

My patch normalizes all times to 1 iteration in 1 thread. So, if threads runs all time then "real" time should be equal to 'CPU" time or be slightly bigger. For perfect parallelism times should stay the same regardless of count of threads.

In your case 'real' time much bigger then 'cpu', this means threads were preempted. I guess you are using osx? Probably these artifacts were caused by it's scheduler.

For Linux:

Before

BM_spin_empty/8                                     43.2 ns         43.2 ns     16180998
BM_spin_empty/512                                   2448 ns         2448 ns       285791
BM_spin_empty/8192                                 39142 ns        39138 ns        17843
BM_spin_empty/8/threads:8                           11.8 ns         94.0 ns      7782976
BM_spin_empty/512/threads:8                          708 ns         5638 ns       125112
BM_spin_empty/8192/threads:8                       11666 ns        92589 ns         7576

After

BM_spin_empty/8                                     46.8 ns         46.8 ns     14851377
BM_spin_empty/512                                   2638 ns         2638 ns       263970
BM_spin_empty/8192                                 42235 ns        42222 ns        16626
BM_spin_empty/8/threads:8                            109 ns          109 ns     54453880
BM_spin_empty/512/threads:8                         6212 ns         6201 ns       911128
BM_spin_empty/8192/threads:8                      101850 ns       101688 ns        55528

As you can see in 'Before' 'real' time was erroneously divided by 8 for per-cpu threads.

Performance for per-cpu threads is around twice lower because this is 4 core cpu with hyper-threading and turbo-boost cannot reach top frequency when all cores are active.

koct9i added 2 commits June 18, 2020 12:50
Right now "real" and "manual" time erroneously divided by thread count,
while it also divided by iterations count which includes thread count.

Count of iterations and measured time reported by each thread and summed.
Thus total_time / total_iterations gives avegrage time per iteration.
The only special case is MeasureProcessCPUTime() - each thread reports
time spent by all threads thus result must be divided by thread count.

Duration of each run (repetition) is total_time / nr_threads,
or time_per_iteration * total_iterations / nr_threads.

Fixes: google#769

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Run 10 iterations of 50ms busyloop (by thread CPU time) and
check resulting times with absolute and relative accuracy ±10%.

If system has not enough cpus (1 for each thread + 1 for infrastructure)
then check only that measured real time not less than expect.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
@koct9i koct9i force-pushed the fix-multithreaded-time-report branch from fb4cc71 to 89c72bf Compare June 18, 2020 09:51
@dmah42
Copy link
Member

dmah42 commented Jun 18, 2020

I think this change is correct. The wall-time (Time) and CPU time should match in terms of per-thread vs global. The benefit of threading should be visible when the real and CPU times don't change but the amount of processing (items per second, etc) increases. As noted, if threading does lead to a change in timing it indicates where the parallelism breaks down.

@krzikalla
Copy link

krzikalla commented Aug 14, 2020

Since it is related, I want to add another problem to this issue: the real time currently measures the average wall-clock time of every single thread. It should rather measures the wall-clock time of the thread pack. That is, the actual real time is the time, when the last thread enters FinishKeepRunning. Only by that means you can spot balancing issues.
Example: if one thread runs 1 sec and 7 threads run 1 ms each, then currently 0,125 sec real time is reported. This is not the wall-clock time! The wall-clock time is 1 sec.
Solution: one could move the barrier in FinishKeepRunning before PauseTiming. But this would tamper the timing. Instead don't sum up the real times, but use the maximum one at

results.real_time_used += timer.real_time_used();
. Of course you must not divide real time by num_threads then.

@dmah42
Copy link
Member

dmah42 commented Aug 18, 2020

@krzikalla would it be ok to defer the wall-clock time change to a separate PR or do you want to solve that as part of this change?

@koct9i
Copy link
Contributor Author

koct9i commented Aug 18, 2020

@krzikalla

Example: if one thread runs 1 sec and 7 threads run 1 ms each, then currently 0,125 sec real time is reported. This is not the wall-clock time! The wall-clock time is 1 sec.

This seems odd. If each thread did same single iteration then this means execution speed is unpredictable.
Instead of forging "sane" metrics benchmark should measure and report deviation among iterations.

Present design of benchmark loop could easily estimate deviation among averages for each thread.
To catch variations within thread it have to measure time periodically or compare progress.

@krzikalla
Copy link

would it be ok to defer the wall-clock time change to a separate PR or do you want to solve that as part of this change?

It can be a separate PR. I just wanted to raise awareness for this issue and thought, it would be better to add it to this discussion here instead of creating a new one.

This seems odd. If each thread did same single iteration then this means execution speed is unpredictable.

Not necessarily. Other possible scenarios are poor balancing or a hit by Amdahl's law.

Instead of forging "sane" metrics benchmark should measure and report deviation among iterations.

Present design of benchmark loop could easily estimate deviation among averages for each thread.
To catch variations within thread it have to measure time periodically or compare progress.

Normally, you don't report deviations. You report wall clock time, since that affect the usability of the code.

@dmah42
Copy link
Member

dmah42 commented Apr 27, 2021

I'm returning to this, sorry for the break. Do you think there's more to do here or are you happy with it?

@dmah42 dmah42 added the incomplete work needed label May 6, 2021
@dmah42 dmah42 closed this May 30, 2021
@dmah42 dmah42 deleted the branch google:master May 30, 2021 08:53
@LebedevRI
Copy link
Collaborator

This was fixed at the source in #1836.

robot-piglet added a commit to ytsaurus/ytsaurus that referenced this pull request Aug 23, 2024
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch.
d9d33ff20e1e7767759fc07ea97c1e661716f26a
robot-piglet added a commit to ytsaurus/ytsaurus-cpp-sdk that referenced this pull request Aug 23, 2024
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch.
d9d33ff20e1e7767759fc07ea97c1e661716f26a
robot-piglet added a commit to ytsaurus/ytsaurus that referenced this pull request Aug 23, 2024
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch.
d9d33ff20e1e7767759fc07ea97c1e661716f26a
robot-piglet added a commit to catboost/catboost that referenced this pull request Aug 23, 2024
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch.
d9d33ff20e1e7767759fc07ea97c1e661716f26a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants