Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openmp compatibility #763

Merged
merged 16 commits into from
Apr 9, 2019

Conversation

bryan-lunt
Copy link
Contributor

This patch makes Google Benchmark compatible with OpenMP and other user-level thread management.
Until now, google benchmark would only report the CPU usage of the master thread if the code being benchmarked used OpenMP or otherwise spawned multiple threads internally.

This version reports the total process CPU usage if the number of google-benchmark threads is set to <= 1 , but reverts to the existing behaviour otherwise.

It may actually be preferable to report the total process CPU usage in all cases, but this is sufficient for my needs.

We have been using google benchmark in our parallel programming class, however, every term students are confused when the CPU time roughly reflects the wall-clock time for parallelized codes doing the same amount of work. This version is also advantageous because it can better demonstrate the overhead of threading, that some tasks take more total CPU time when multi-threaded, and, sometimes, tasks may actually take less overall CPU time.

If my feature patch cannot be merged, I would like to request that the maintainers implement this. It is very important to us.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@bryan-lunt
Copy link
Contributor Author

I signed it! (Signed CLA.)

@googlebot
Copy link

CLAs look good, thanks!

@LebedevRI LebedevRI self-requested a review February 11, 2019 07:41
Copy link
Collaborator

@LebedevRI LebedevRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i would very much like this.
I've been thinking about this before, and i wonder if this is almost the right solution.

But before going any further, i have a question:
@dominichamon @EricWF do we agree that for the normal case of google benchmark threads,
ThreadCPUUsage() would be equivalent of the sum of per-thread ThreadCPUUsage()?

@LebedevRI
Copy link
Collaborator

But before going any further, i have a question:
@dominichamon @EricWF do we agree that for the normal case of google benchmark threads,
ThreadCPUUsage() would be equivalent of the sum of per-thread ThreadCPUUsage()?

As in, are we afraid some other user-created thread will mess the measurements up?
Or that the results won't be equivalent due to some implementation detail of timers themselves?
(in kernel/etc)

@dmah42
Copy link
Member

dmah42 commented Feb 12, 2019

The patch seems reasonable to me. I ran it side-by-side with the master version and didn't see much difference in the basic tests (which use threads) so i might be missing where the benefit is. Could you demonstrate the problem with a code snippet or some output?

test/user_counters_test.cc:241: CheckAvgThreadsRate: Check `std::fabs((e.GetCounterAs< double >("foo")) - ((1. / e.DurationCPUTime()))) < ((0.001) * (1. / e.DurationCPUTime()))    ' failed.·                                                                                                                                                                      
  1 test/user_counters_test.cc:241: BM_Counters_AvgThreadsRate/threads:2:
  2 test/user_counters_test.cc:241: expected (double)foo=88.0367 to be EQ to 133.142
  3 test/user_counters_test.cc:241: with tolerance of 0.133142 (0.1%), but delta was -45.1053 (-33.8776%)

Looks like some tests need updating :)

@bryan-lunt
Copy link
Contributor Author

@dominichamon

I created a project that demonstrates the need for this patch:

https://github.com/bryan-lunt-supercomputing/gbench_threads_patch_need_demo

The reason you see nothing different is that you are benchmarking things that are threaded by Google Benchmark itself. But google Benchmark does not give correct results when you benchmark code that spins up and shuts down its own threads, for example, any OpenMP codes.

@bryan-lunt
Copy link
Contributor Author

If you tried to use the Current GoogleBenchmark built-in threads to benchmark an OpenMP code, you'd get something much worse than just nonsense CPU times, you'd get that each GB thread calls the function which internally starts up its own threadpool.
So you'd have N_OMP_THREADS*N_GB_THREADS threads at once.

I've been using my own parameter "num_threads" but if wishes were fishes, it would be nice to be able to ask the total number of threads from within a benchmark, and to do something like "set_user_managed_threads" which results in only one GoogleBenchmark thread, but then inside we can ask the number of threads that were requested.

That way, we could have one and only one "threads" parameter, which I think would make interpreting the final output easier.

@LebedevRI
Copy link
Collaborator

If you tried to use the Current GoogleBenchmark built-in threads to benchmark an OpenMP code, you'd get something much worse than just nonsense CPU times, you'd get that each GB thread calls the function which internally starts up its own threadpool.

Yep, that would make no sense in general case.

The current question remains (i have not heard back from @dominichamon / @EricWF) - what
//should// CPU timer be reporting, the whole time spent (i.e. by all the threads) by the benchmark
function, or what it currently does - the time spent by the "main" thread. It might be that there is no
The Right solution, and a switch like ->UseRealTime() (->UseProcessorTime()?) should be added.

...
That way, we could have one and only one "threads" parameter, which I think would make interpreting the final output easier.

No. Nothing like that should be done.
If user code uses it's own threads, there is no sane way to generalize that and display
in report. There is simply can be no guarantee of anything, unless you want to put arbitrary
restrictions on the user threading. E.g. part of the benchmark function may be limited 4 threads,
and then second part is not limited by the thread count.

@bryan-lunt
Copy link
Contributor Author

No. Nothing like that should be done.
If user code uses it's own threads, there is no sane way to generalize that and display
in report. There is simply can be no guarantee of anything, unless you want to put arbitrary
restrictions on the user threading. E.g. part of the benchmark function may be limited 4 threads,
and then second part is not limited by the thread count.

Good point.

@dmah42
Copy link
Member

dmah42 commented Feb 13, 2019

Thanks for the motivating example, that helps.

The current code supports the following concept:

  • user code can be run on a single thread, or multiple threads in paralllel by the benchmark framework
  • the benefit (and scaling limitations) of threading is then observed by the thread-0 (main thread) processing being ~constant until we hit CPU pressure, but the items per second processing increasing.

Your motivation, and the more general user thread issue, isn't something we've considered. We already have a 'manual timer' that tells the library to get out of the way, which sets a precedent for having a 'this is user threaded' setting, but that's clunky.

I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time.

@LebedevRI
Copy link
Collaborator

Your motivation, and the more general user thread issue, isn't something we've considered. We already have a 'manual timer' that tells the library to get out of the way, which sets a precedent for having a 'this is user threaded' setting, but that's clunky.

To give an example: (from IRC)

<LebedevRI> here, and example in the commit msg: https://github.com/darktable-org/rawspeed/commit/eda7972db1b469548b9bfdd054b5c58f5830e230#diff-e03d58ca0e0ab7c09683945bc5c6e825
<dhamon> in benchmark-threaded cases they also wouldn't be the same.
<LebedevRI> "Time" is wallclock, "CPU" is the time spent by main thread, "CPUTime,s" is the actual cpu time spent by all threads
<LebedevRI> iirc
<dhamon> and cpu/wallclock is, as expected with 8 threads, ~8.

I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time.

I currently personally think it would be possible to solve this by simply measuring this third timer,
and using it instead of the "time of main thread" for the "CPU" column, much like the same way
as with ->UseRealTime(), ->UseManualTime() - if user requested so.

I do not think it is worth it adding a third column to the console reporter.

@pleroy
Copy link
Contributor

pleroy commented Feb 13, 2019

I have wanted this feature for years. I have code where the main thread does nothing but forking a bunch of threads that do CPU-intensive work and synchronize at the end. At the moment the output looks like this (the second parameter of the benchmark is the number of threads used):

------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                       Time         CPU Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1 102495353 ns        0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/2  77841192 ns   156001 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/3  49749029 ns        0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/4  46776591 ns        0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/5  46512723 ns   156001 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

The CPU time reported here is completely useless. I would really like to have a proper display of the total CPU time consumed by all threads, as that would give me a sense of the overhead of the multithreading.

@bryan-lunt
Copy link
Contributor Author

bryan-lunt commented Feb 13, 2019

Now I'm confused, does UseRealTime() get wallclock time into the "Time" column, or not?

In general, for parallel codes (by which I mean user-managed parallelism within the one function, as in my example) we are primarily interested in the wallclock time and secondarily interested in the CPU time.

@bryan-lunt
Copy link
Contributor Author

bryan-lunt commented Feb 13, 2019

It seems like there is a choice and a choice of reduction here.

  • Choice: Does an individual GB thread get its own CPU time, or the CPU time of itself and all subthreads? (Or something else?)
  • Reduction choice: When putting together the final report, does the CPU time show the smallest, largest, mean, or sum of all the individual GB thread CPU values?

@LebedevRI
Copy link
Collaborator

I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time.

I have now looked, and while we could rename the "Time" column in reports to "Wall",
i do not think this would universally improve things, because the "manual time"
is displayed in that "Time" column, not "CPU" column.
So i think this will, again, have to wait for custom timers.

Now I'm confused, does UseRealTime() get wallclock time into the "Time" column, or not?

UseRealTime() does not change anything about how which timer is displayed in which column.
UseRealTime() only tells the library to use the wallclock (instead of the "CPU",
the time of the main thread) to decide for how long to run the benchmark.

In general, for parallel codes (by which I mean user-managed parallelism within the one function, as in my example) we are primarily interested in the wallclock time and secondarily interested in the CPU time.

Yes, i understand the needs, i have one such case myself.

It seems like there is a choice and a choice of reduction here.
...

It looks like i will need to help with this code.

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

Googlers can find more info about SignCLA and this PR by following this link.

@LebedevRI
Copy link
Collaborator

Updated with all the stuff!!1 Looks like cla bot loves such pushes :S
Now here is an interesting question, now that i have worked on this code, can i still review the PR? :D
I'd guess not really, perhaps @dominichamon might have a moment.

@bryan-lunt
Copy link
Contributor Author

So who has to sign off on the CLA now?

@LebedevRI
Copy link
Collaborator

So who has to sign off on the CLA now?

Mine's signed, it's a 'known' glitch i believe.

@coveralls
Copy link

coveralls commented Feb 17, 2019

Coverage Status

Coverage decreased (-48.5%) to 41.2% when pulling 5c9a54f on bryan-lunt-supercomputing:openmp-compatibility into 30bd6ea on google:master.

@dmah42
Copy link
Member

dmah42 commented Feb 18, 2019

It's up to me to check the commits are all by one of the signed folks (CLA bot doesn't like PRs from multiple people).

It looks like this satisfies the original request, but at the increase in complexity of the API. Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).

@LebedevRI
Copy link
Collaborator

It looks like this satisfies the original request,

Yes.

but at the increase in complexity of the API.

Yes :(

Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).

The alternative would be to always use whole-process cpu clock instead of the
main-thread-only cpu clock when the ->Threads() wasn't called.
Pros:

  • no need for all that extra API
  • no /process_time suffix in benchmark name.

Cons:

  • Such change will change the timer used by benchmarks that do not use ->Threads().
    If any of these benchmarks used threads internally, the reported time will naturally be different.
    Do we know that it is what everyone affected actually wanted?
    Maybe they actually used the current timing method.
    Also, there will be no way to switch back to the previous (current) timer method.

The other alternative is likely again the old good 'custom timers'. But that feature is nowhere near.

Copy link
Member

@dmah42 dmah42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this satisfies the original request,

Yes.

Hooray!

but at the increase in complexity of the API.

Yes :(

Awww.

Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).

The alternative would be to always use whole-process cpu clock instead of the
main-thread-only cpu clock when the ->Threads() wasn't called.

ok...

Pros:

  • no need for all that extra API
  • no /process_time suffix in benchmark name.

I like these...

Cons:

  • Such change will change the timer used by benchmarks that do not use ->Threads().

Yes, but will it actually be different for the majority of users?

If any of these benchmarks used threads internally, the reported time will naturally be different.

It will, and more accurate (if we believe the premise of this PR, which i think we do).

Do we know that it is what everyone affected actually wanted?

Maybe not, but what it's doing today is a bit silly.

Maybe they actually used the current timing method.
Also, there will be no way to switch back to the previous (current) timer method.

I might be ok with this as long as only people who are using custom threads inside their benchmark are affected and we can make the case this PR does that this is a better measurement.

The other alternative is likely again the old good 'custom timers'. But that feature is nowhere near.

@@ -82,7 +82,11 @@ BenchmarkReporter::Run CreateRunReport(
} else {
report.real_accumulated_time = results.real_time_used;
}
report.cpu_accumulated_time = results.cpu_time_used;
if(b.threads <= 1){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please stick to the surrounding style. it's google-style per clang-format if you could please format it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(You are looking at an old diff)

@LebedevRI
Copy link
Collaborator

Not sure what these obvious changes are, but "getting ready for submit" sgtm.

None.
I'm just saying that this will introduce some new public knobs, that we plan(?) on dropping the very next release.

i'm ok with either. It would be nicer for users i think if they could choose between the existing behaviour and the new behaviour and we just flipped the default between releases, but further up in this PR you suggest that reverting to the old behaviour once we introduce process timing is not possible.

I'm guessing you are talking about #763 (comment) ?
That only talks specifically about dynamically deciding which cpu clock to use depending on whether ->Threads() was used or not, with no API functionality to switch between them.

So i'm still not sure what will be the step after this PR. I'll try to elaborate to the possibilities i see:

  • Keep everything as-is in this PR
  • Reverse the polarity.
    • Tame variant:
      • By default measure process clock.
      • Switch json/csv field name.
      • Don't add /process_time suffix by default.
      • If user asked to measure main-thread time, then add /thread_time suffix. (and write it into old csv/json field?)
    • Wild variant:
      • Revert this PR
      • simply switch the clock (from thread to process -wide)
      • Switch json/csv field name.
  • ???

@LebedevRI LebedevRI force-pushed the openmp-compatibility branch from def9e0c to 0b2b11e Compare April 8, 2019 09:46
@LebedevRI
Copy link
Collaborator

Rebased, should be good to go if it looks good to @dominichamon.

@dmah42
Copy link
Member

dmah42 commented Apr 8, 2019

Not sure what these obvious changes are, but "getting ready for submit" sgtm.

None.
I'm just saying that this will introduce some new public knobs, that we plan(?) on dropping the very next release.

i'm ok with either. It would be nicer for users i think if they could choose between the existing behaviour and the new behaviour and we just flipped the default between releases, but further up in this PR you suggest that reverting to the old behaviour once we introduce process timing is not possible.

I'm guessing you are talking about #763 (comment) ?
That only talks specifically about dynamically deciding which cpu clock to use depending on whether ->Threads() was used or not, with no API functionality to switch between them.

So i'm still not sure what will be the step after this PR. I'll try to elaborate to the possibilities i see:

  • Keep everything as-is in this PR

  • Reverse the polarity.

    • Tame variant:

      • By default measure process clock.
      • Switch json/csv field name.
      • Don't add /process_time suffix by default.
      • If user asked to measure main-thread time, then add /thread_time suffix. (and write it into old csv/json field?)

After the next release, do the Tame variant above (which would require a command-line/benchmark flag).

  • Wild variant:

    • Revert this PR
    • simply switch the clock (from thread to process -wide)
    • Switch json/csv field name.
  • ???

I wish we had tracking for flags... if we saw the flag in Tame variant went unused i'd push for this the release after.

README.md Show resolved Hide resolved
src/thread_timer.h Outdated Show resolved Hide resolved
src/thread_timer.h Outdated Show resolved Hide resolved
@dmah42 dmah42 merged commit 7a1c370 into google:master Apr 9, 2019
@dmah42
Copy link
Member

dmah42 commented Apr 9, 2019

Thanks!

@LebedevRI
Copy link
Collaborator

YAY

@bryan-lunt
Copy link
Contributor Author

WOOHOO!
Thanks everyone, especially @LebedevRI and @dominichamon .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants