-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reported time for multithreaded benchmarks #946
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Thank you for looking into this. With multi-threaded benchmarks, It is hard to tell whether |
Bug in current reports is pretty obvious for trivial benchmark which spins or sleep fixed amount of time. |
9bad2b6
to
bc9a6a2
Compare
4f5daec
to
c39ee1c
Compare
c39ee1c
to
d97af0a
Compare
Oh, I get it - spin-wait in test watching into real time. 🤦♂️ |
b4e206a
to
ffac743
Compare
Bump. Ready for review/merge. |
ffac743
to
fb4cc71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the fresh release is out of the door
Overall, this looks good to me.
I believe, this is moving in the right direction,
but i'm not all that familiar with ->Threads()
.
@dominichamon does this look obviously-wrong
given the initial ->Threads()
design,
and previous discussions and #769?
For multithreaded benchmarks timer measures average among working threads. | ||
|
||
If the benchmark itself uses threads internally, this measurement may not | ||
be what you are looking for. Instead, there is a way to measure the total | ||
CPU usage of the process, by all the threads. | ||
CPU usage of the process, by all the threads. Resulting total CPU time | ||
also will be divided among working threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is too vague. Maybe this should mention that it is specifically about https://github.com/google/benchmark#multithreaded-benchmarks, not the threads inside of the workload itself.
the code seems reasonable. The threads division has changed back and forth a few times (as i mentioned in #769). i'm just running a diff between the head version and this change to see what the output looks like for tests. |
This is the before
This is the after
Before this change, it's clear that the time is the end-to-end time given the parallelisation (23 ~= 90/4, 1300 ~= 4900/4, etc). In the latter case i find it hard to understand why we're ~doubling the reported time. another example: before
after
again i find it hard to understand why the threading appears to cause the measured time to double. the former is what i'd expect: ~4x improvement by using 4 threads. |
My patch normalizes all times to 1 iteration in 1 thread. So, if threads runs all time then "real" time should be equal to 'CPU" time or be slightly bigger. For perfect parallelism times should stay the same regardless of count of threads. In your case 'real' time much bigger then 'cpu', this means threads were preempted. I guess you are using osx? Probably these artifacts were caused by it's scheduler. For Linux: Before
After
As you can see in 'Before' 'real' time was erroneously divided by 8 for per-cpu threads. Performance for per-cpu threads is around twice lower because this is 4 core cpu with hyper-threading and turbo-boost cannot reach top frequency when all cores are active. |
Right now "real" and "manual" time erroneously divided by thread count, while it also divided by iterations count which includes thread count. Count of iterations and measured time reported by each thread and summed. Thus total_time / total_iterations gives avegrage time per iteration. The only special case is MeasureProcessCPUTime() - each thread reports time spent by all threads thus result must be divided by thread count. Duration of each run (repetition) is total_time / nr_threads, or time_per_iteration * total_iterations / nr_threads. Fixes: google#769 Signed-off-by: Konstantin Khlebnikov <[email protected]>
Run 10 iterations of 50ms busyloop (by thread CPU time) and check resulting times with absolute and relative accuracy ±10%. If system has not enough cpus (1 for each thread + 1 for infrastructure) then check only that measured real time not less than expect. Signed-off-by: Konstantin Khlebnikov <[email protected]>
fb4cc71
to
89c72bf
Compare
I think this change is correct. The wall-time (Time) and CPU time should match in terms of per-thread vs global. The benefit of threading should be visible when the real and CPU times don't change but the amount of processing (items per second, etc) increases. As noted, if threading does lead to a change in timing it indicates where the parallelism breaks down. |
Since it is related, I want to add another problem to this issue: the real time currently measures the average wall-clock time of every single thread. It should rather measures the wall-clock time of the thread pack. That is, the actual real time is the time, when the last thread enters FinishKeepRunning. Only by that means you can spot balancing issues. benchmark/src/benchmark_runner.cc Line 127 in 1302d2c
|
@krzikalla would it be ok to defer the wall-clock time change to a separate PR or do you want to solve that as part of this change? |
This seems odd. If each thread did same single iteration then this means execution speed is unpredictable. Present design of benchmark loop could easily estimate deviation among averages for each thread. |
It can be a separate PR. I just wanted to raise awareness for this issue and thought, it would be better to add it to this discussion here instead of creating a new one.
Not necessarily. Other possible scenarios are poor balancing or a hit by Amdahl's law.
Normally, you don't report deviations. You report wall clock time, since that affect the usability of the code. |
I'm returning to this, sorry for the break. Do you think there's more to do here or are you happy with it? |
This was fixed at the source in #1836. |
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch. d9d33ff20e1e7767759fc07ea97c1e661716f26a
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch. d9d33ff20e1e7767759fc07ea97c1e661716f26a
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch. d9d33ff20e1e7767759fc07ea97c1e661716f26a
As [#1836](google/benchmark#1836) has landed into upstream, there is no need to keep [#946](google/benchmark#946) as a patch. d9d33ff20e1e7767759fc07ea97c1e661716f26a
In report real and manual per iteration is divided by thread count
because total time is adjusted to per-thread while iterations counter is not.
See #769
Process CPU time is special - each thread reports time spent by all threads
thus it actually must be divided by thread count once to get total cpu time and
second time to get per-thread CPU time.