-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openmp compatibility #763
Openmp compatibility #763
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
I signed it! (Signed CLA.) |
CLAs look good, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i would very much like this.
I've been thinking about this before, and i wonder if this is almost the right solution.
But before going any further, i have a question:
@dominichamon @EricWF do we agree that for the normal case of google benchmark threads,
ThreadCPUUsage()
would be equivalent of the sum of per-thread ThreadCPUUsage()
?
As in, are we afraid some other user-created thread will mess the measurements up? |
The patch seems reasonable to me. I ran it side-by-side with the master version and didn't see much difference in the basic tests (which use threads) so i might be missing where the benefit is. Could you demonstrate the problem with a code snippet or some output?
Looks like some tests need updating :) |
I created a project that demonstrates the need for this patch: https://github.com/bryan-lunt-supercomputing/gbench_threads_patch_need_demo The reason you see nothing different is that you are benchmarking things that are threaded by Google Benchmark itself. But google Benchmark does not give correct results when you benchmark code that spins up and shuts down its own threads, for example, any OpenMP codes. |
If you tried to use the Current GoogleBenchmark built-in threads to benchmark an OpenMP code, you'd get something much worse than just nonsense CPU times, you'd get that each GB thread calls the function which internally starts up its own threadpool. I've been using my own parameter "num_threads" but if wishes were fishes, it would be nice to be able to ask the total number of threads from within a benchmark, and to do something like "set_user_managed_threads" which results in only one GoogleBenchmark thread, but then inside we can ask the number of threads that were requested. That way, we could have one and only one "threads" parameter, which I think would make interpreting the final output easier. |
Yep, that would make no sense in general case. The current question remains (i have not heard back from @dominichamon / @EricWF) - what
No. Nothing like that should be done. |
Good point. |
Thanks for the motivating example, that helps. The current code supports the following concept:
Your motivation, and the more general user thread issue, isn't something we've considered. We already have a 'manual timer' that tells the library to get out of the way, which sets a precedent for having a 'this is user threaded' setting, but that's clunky. I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time. |
To give an example: (from IRC)
I currently personally think it would be possible to solve this by simply measuring this third timer, I do not think it is worth it adding a third column to the console reporter. |
I have wanted this feature for years. I have code where the main thread does nothing but forking a bunch of threads that do CPU-intensive work and synchronize at the end. At the moment the output looks like this (the second parameter of the benchmark is the number of threads used):
The CPU time reported here is completely useless. I would really like to have a proper display of the total CPU time consumed by all threads, as that would give me a sense of the overhead of the multithreading. |
Now I'm confused, does UseRealTime() get wallclock time into the "Time" column, or not? In general, for parallel codes (by which I mean user-managed parallelism within the one function, as in my example) we are primarily interested in the wallclock time and secondarily interested in the CPU time. |
It seems like there is a choice and a choice of reduction here.
|
I have now looked, and while we could rename the "Time" column in reports to "Wall",
Yes, i understand the needs, i have one such case myself.
It looks like i will need to help with this code. |
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the Googlers can find more info about SignCLA and this PR by following this link. |
Updated with all the stuff!!1 Looks like cla bot loves such pushes :S |
So who has to sign off on the CLA now? |
Mine's signed, it's a 'known' glitch i believe. |
It's up to me to check the commits are all by one of the signed folks (CLA bot doesn't like PRs from multiple people). It looks like this satisfies the original request, but at the increase in complexity of the API. Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask). |
Yes.
Yes :(
The alternative would be to always use whole-process cpu clock instead of the
Cons:
The other alternative is likely again the old good 'custom timers'. But that feature is nowhere near. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this satisfies the original request,
Yes.
Hooray!
but at the increase in complexity of the API.
Yes :(
Awww.
Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).
The alternative would be to always use whole-process cpu clock instead of the
main-thread-only cpu clock when the->Threads()
wasn't called.
ok...
Pros:
- no need for all that extra API
- no
/process_time
suffix in benchmark name.
I like these...
Cons:
- Such change will change the timer used by benchmarks that do not use
->Threads()
.
Yes, but will it actually be different for the majority of users?
If any of these benchmarks used threads internally, the reported time will naturally be different.
It will, and more accurate (if we believe the premise of this PR, which i think we do).
Do we know that it is what everyone affected actually wanted?
Maybe not, but what it's doing today is a bit silly.
Maybe they actually used the current timing method.
Also, there will be no way to switch back to the previous (current) timer method.
I might be ok with this as long as only people who are using custom threads inside their benchmark are affected and we can make the case this PR does that this is a better measurement.
The other alternative is likely again the old good 'custom timers'. But that feature is nowhere near.
src/benchmark_runner.cc
Outdated
@@ -82,7 +82,11 @@ BenchmarkReporter::Run CreateRunReport( | |||
} else { | |||
report.real_accumulated_time = results.real_time_used; | |||
} | |||
report.cpu_accumulated_time = results.cpu_time_used; | |||
if(b.threads <= 1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please stick to the surrounding style. it's google-style per clang-format if you could please format it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You are looking at an old diff)
None.
I'm guessing you are talking about #763 (comment) ? So i'm still not sure what will be the step after this PR. I'll try to elaborate to the possibilities i see:
|
def9e0c
to
0b2b11e
Compare
Rebased, should be good to go if it looks good to @dominichamon. |
After the next release, do the Tame variant above (which would require a command-line/benchmark flag).
I wish we had tracking for flags... if we saw the flag in Tame variant went unused i'd push for this the release after. |
Thanks! |
YAY |
WOOHOO! |
This patch makes Google Benchmark compatible with OpenMP and other user-level thread management.
Until now, google benchmark would only report the CPU usage of the master thread if the code being benchmarked used OpenMP or otherwise spawned multiple threads internally.
This version reports the total process CPU usage if the number of google-benchmark threads is set to <= 1 , but reverts to the existing behaviour otherwise.
It may actually be preferable to report the total process CPU usage in all cases, but this is sufficient for my needs.
We have been using google benchmark in our parallel programming class, however, every term students are confused when the CPU time roughly reflects the wall-clock time for parallelized codes doing the same amount of work. This version is also advantageous because it can better demonstrate the overhead of threading, that some tasks take more total CPU time when multi-threaded, and, sometimes, tasks may actually take less overall CPU time.
If my feature patch cannot be merged, I would like to request that the maintainers implement this. It is very important to us.