Openmp compatibility #763

bryan-lunt · 2019-02-11T04:00:33Z

This patch makes Google Benchmark compatible with OpenMP and other user-level thread management.
Until now, google benchmark would only report the CPU usage of the master thread if the code being benchmarked used OpenMP or otherwise spawned multiple threads internally.

This version reports the total process CPU usage if the number of google-benchmark threads is set to <= 1 , but reverts to the existing behaviour otherwise.

It may actually be preferable to report the total process CPU usage in all cases, but this is sufficient for my needs.

We have been using google benchmark in our parallel programming class, however, every term students are confused when the CPU time roughly reflects the wall-clock time for parallelized codes doing the same amount of work. This version is also advantageous because it can better demonstrate the overhead of threading, that some tasks take more total CPU time when multi-threaded, and, sometimes, tasks may actually take less overall CPU time.

If my feature patch cannot be merged, I would like to request that the maintainers implement this. It is very important to us.

googlebot · 2019-02-11T04:00:35Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

bryan-lunt · 2019-02-11T04:08:48Z

I signed it! (Signed CLA.)

googlebot · 2019-02-11T04:08:51Z

CLAs look good, thanks!

LebedevRI

Hmm, i would very much like this.
I've been thinking about this before, and i wonder if this is almost the right solution.

But before going any further, i have a question:
@dominichamon @EricWF do we agree that for the normal case of google benchmark threads,
ThreadCPUUsage() would be equivalent of the sum of per-thread ThreadCPUUsage()?

LebedevRI · 2019-02-11T08:07:06Z

But before going any further, i have a question:
@dominichamon @EricWF do we agree that for the normal case of google benchmark threads,
ThreadCPUUsage() would be equivalent of the sum of per-thread ThreadCPUUsage()?

As in, are we afraid some other user-created thread will mess the measurements up?
Or that the results won't be equivalent due to some implementation detail of timers themselves?
(in kernel/etc)

dmah42 · 2019-02-12T14:47:54Z

The patch seems reasonable to me. I ran it side-by-side with the master version and didn't see much difference in the basic tests (which use threads) so i might be missing where the benefit is. Could you demonstrate the problem with a code snippet or some output?

test/user_counters_test.cc:241: CheckAvgThreadsRate: Check `std::fabs((e.GetCounterAs< double >("foo")) - ((1. / e.DurationCPUTime()))) < ((0.001) * (1. / e.DurationCPUTime()))    ' failed.·                                                                                                                                                                      
  1 test/user_counters_test.cc:241: BM_Counters_AvgThreadsRate/threads:2:
  2 test/user_counters_test.cc:241: expected (double)foo=88.0367 to be EQ to 133.142
  3 test/user_counters_test.cc:241: with tolerance of 0.133142 (0.1%), but delta was -45.1053 (-33.8776%)

Looks like some tests need updating :)

bryan-lunt · 2019-02-12T21:00:47Z

@dominichamon

I created a project that demonstrates the need for this patch:

https://github.com/bryan-lunt-supercomputing/gbench_threads_patch_need_demo

The reason you see nothing different is that you are benchmarking things that are threaded by Google Benchmark itself. But google Benchmark does not give correct results when you benchmark code that spins up and shuts down its own threads, for example, any OpenMP codes.

bryan-lunt · 2019-02-12T21:07:34Z

If you tried to use the Current GoogleBenchmark built-in threads to benchmark an OpenMP code, you'd get something much worse than just nonsense CPU times, you'd get that each GB thread calls the function which internally starts up its own threadpool.
So you'd have N_OMP_THREADS*N_GB_THREADS threads at once.

I've been using my own parameter "num_threads" but if wishes were fishes, it would be nice to be able to ask the total number of threads from within a benchmark, and to do something like "set_user_managed_threads" which results in only one GoogleBenchmark thread, but then inside we can ask the number of threads that were requested.

That way, we could have one and only one "threads" parameter, which I think would make interpreting the final output easier.

LebedevRI · 2019-02-12T21:16:43Z

If you tried to use the Current GoogleBenchmark built-in threads to benchmark an OpenMP code, you'd get something much worse than just nonsense CPU times, you'd get that each GB thread calls the function which internally starts up its own threadpool.

Yep, that would make no sense in general case.

The current question remains (i have not heard back from @dominichamon / @EricWF) - what
//should// CPU timer be reporting, the whole time spent (i.e. by all the threads) by the benchmark
function, or what it currently does - the time spent by the "main" thread. It might be that there is no
The Right solution, and a switch like ->UseRealTime() (->UseProcessorTime()?) should be added.

...
That way, we could have one and only one "threads" parameter, which I think would make interpreting the final output easier.

No. Nothing like that should be done.
If user code uses it's own threads, there is no sane way to generalize that and display
in report. There is simply can be no guarantee of anything, unless you want to put arbitrary
restrictions on the user threading. E.g. part of the benchmark function may be limited 4 threads,
and then second part is not limited by the thread count.

bryan-lunt · 2019-02-12T21:26:51Z

No. Nothing like that should be done.
If user code uses it's own threads, there is no sane way to generalize that and display
in report. There is simply can be no guarantee of anything, unless you want to put arbitrary
restrictions on the user threading. E.g. part of the benchmark function may be limited 4 threads,
and then second part is not limited by the thread count.

Good point.

dmah42 · 2019-02-13T10:22:10Z

Thanks for the motivating example, that helps.

The current code supports the following concept:

user code can be run on a single thread, or multiple threads in paralllel by the benchmark framework
the benefit (and scaling limitations) of threading is then observed by the thread-0 (main thread) processing being ~constant until we hit CPU pressure, but the items per second processing increasing.

Your motivation, and the more general user thread issue, isn't something we've considered. We already have a 'manual timer' that tells the library to get out of the way, which sets a precedent for having a 'this is user threaded' setting, but that's clunky.

I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time.

LebedevRI · 2019-02-13T10:31:26Z

Your motivation, and the more general user thread issue, isn't something we've considered. We already have a 'manual timer' that tells the library to get out of the way, which sets a precedent for having a 'this is user threaded' setting, but that's clunky.

To give an example: (from IRC)

<LebedevRI> here, and example in the commit msg: https://github.com/darktable-org/rawspeed/commit/eda7972db1b469548b9bfdd054b5c58f5830e230#diff-e03d58ca0e0ab7c09683945bc5c6e825
<dhamon> in benchmark-threaded cases they also wouldn't be the same.
<LebedevRI> "Time" is wallclock, "CPU" is the time spent by main thread, "CPUTime,s" is the actual cpu time spent by all threads
<LebedevRI> iirc
<dhamon> and cpu/wallclock is, as expected with 8 threads, ~8.

I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time.

I currently personally think it would be possible to solve this by simply measuring this third timer,
and using it instead of the "time of main thread" for the "CPU" column, much like the same way
as with ->UseRealTime(), ->UseManualTime() - if user requested so.

I do not think it is worth it adding a third column to the console reporter.

pleroy · 2019-02-13T10:41:39Z

I have wanted this feature for years. I have code where the main thread does nothing but forking a bunch of threads that do CPU-intensive work and synchronize at the end. At the moment the output looks like this (the second parameter of the benchmark is the number of threads used):

------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                       Time         CPU Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1 102495353 ns        0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/2  77841192 ns   156001 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/3  49749029 ns        0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/4  46776591 ns        0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/5  46512723 ns   156001 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

The CPU time reported here is completely useless. I would really like to have a proper display of the total CPU time consumed by all threads, as that would give me a sense of the overhead of the multithreading.

bryan-lunt · 2019-02-13T22:43:26Z

Now I'm confused, does UseRealTime() get wallclock time into the "Time" column, or not?

In general, for parallel codes (by which I mean user-managed parallelism within the one function, as in my example) we are primarily interested in the wallclock time and secondarily interested in the CPU time.

bryan-lunt · 2019-02-13T22:52:56Z

It seems like there is a choice and a choice of reduction here.

Choice: Does an individual GB thread get its own CPU time, or the CPU time of itself and all subthreads? (Or something else?)
Reduction choice: When putting together the final report, does the CPU time show the smallest, largest, mean, or sum of all the individual GB thread CPU values?

LebedevRI · 2019-02-14T13:41:22Z

I think @LebedevRI is on to something already with #671 and other discussions about sorting out our timing so we are explicit about per-(benchmark)-thread time vs total process time.

I have now looked, and while we could rename the "Time" column in reports to "Wall",
i do not think this would universally improve things, because the "manual time"
is displayed in that "Time" column, not "CPU" column.
So i think this will, again, have to wait for custom timers.

Now I'm confused, does UseRealTime() get wallclock time into the "Time" column, or not?

UseRealTime() does not change anything about how which timer is displayed in which column.
UseRealTime() only tells the library to use the wallclock (instead of the "CPU",
the time of the main thread) to decide for how long to run the benchmark.

In general, for parallel codes (by which I mean user-managed parallelism within the one function, as in my example) we are primarily interested in the wallclock time and secondarily interested in the CPU time.

Yes, i understand the needs, i have one such case myself.

It seems like there is a choice and a choice of reduction here.
...

It looks like i will need to help with this code.

googlebot · 2019-02-16T19:42:12Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

Googlers can find more info about SignCLA and this PR by following this link.

LebedevRI · 2019-02-16T19:45:04Z

Updated with all the stuff!!1 Looks like cla bot loves such pushes :S
Now here is an interesting question, now that i have worked on this code, can i still review the PR? :D
I'd guess not really, perhaps @dominichamon might have a moment.

bryan-lunt · 2019-02-16T21:19:52Z

So who has to sign off on the CLA now?

LebedevRI · 2019-02-16T21:39:58Z

So who has to sign off on the CLA now?

Mine's signed, it's a 'known' glitch i believe.

coveralls · 2019-02-17T08:07:46Z

Coverage decreased (-48.5%) to 41.2% when pulling 5c9a54f on bryan-lunt-supercomputing:openmp-compatibility into 30bd6ea on google:master.

dmah42 · 2019-02-18T13:13:37Z

It's up to me to check the commits are all by one of the signed folks (CLA bot doesn't like PRs from multiple people).

It looks like this satisfies the original request, but at the increase in complexity of the API. Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).

LebedevRI · 2019-02-18T13:26:12Z

It looks like this satisfies the original request,

Yes.

but at the increase in complexity of the API.

Yes :(

Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).

The alternative would be to always use whole-process cpu clock instead of the
main-thread-only cpu clock when the ->Threads() wasn't called.
Pros:

no need for all that extra API
no /process_time suffix in benchmark name.

Cons:

Such change will change the timer used by benchmarks that do not use ->Threads().
If any of these benchmarks used threads internally, the reported time will naturally be different.
Do we know that it is what everyone affected actually wanted?
Maybe they actually used the current timing method.
Also, there will be no way to switch back to the previous (current) timer method.

The other alternative is likely again the old good 'custom timers'. But that feature is nowhere near.

dmah42

It looks like this satisfies the original request,

Yes.

Hooray!

but at the increase in complexity of the API.

Yes :(

Awww.

Is there a way we can change the default behaviour to be something sensible and expected without adding another boolean option and benchmark method? (the answer might be no, but i should ask).

The alternative would be to always use whole-process cpu clock instead of the
main-thread-only cpu clock when the ->Threads() wasn't called.

ok...

Pros:

no need for all that extra API

no /process_time suffix in benchmark name.

I like these...

Cons:

Such change will change the timer used by benchmarks that do not use ->Threads().

Yes, but will it actually be different for the majority of users?

If any of these benchmarks used threads internally, the reported time will naturally be different.

It will, and more accurate (if we believe the premise of this PR, which i think we do).

Do we know that it is what everyone affected actually wanted?

Maybe not, but what it's doing today is a bit silly.

Maybe they actually used the current timing method.
Also, there will be no way to switch back to the previous (current) timer method.

I might be ok with this as long as only people who are using custom threads inside their benchmark are affected and we can make the case this PR does that this is a better measurement.

The other alternative is likely again the old good 'custom timers'. But that feature is nowhere near.

dmah42 · 2019-02-12T14:33:20Z

src/benchmark_runner.cc

@@ -82,7 +82,11 @@ BenchmarkReporter::Run CreateRunReport(
    } else {
      report.real_accumulated_time = results.real_time_used;
    }
-    report.cpu_accumulated_time = results.cpu_time_used;
+	if(b.threads <= 1){


please stick to the surrounding style. it's google-style per clang-format if you could please format it.

(You are looking at an old diff)

LebedevRI · 2019-04-05T13:19:05Z

Not sure what these obvious changes are, but "getting ready for submit" sgtm.

None.
I'm just saying that this will introduce some new public knobs, that we plan(?) on dropping the very next release.

i'm ok with either. It would be nicer for users i think if they could choose between the existing behaviour and the new behaviour and we just flipped the default between releases, but further up in this PR you suggest that reverting to the old behaviour once we introduce process timing is not possible.

I'm guessing you are talking about #763 (comment) ?
That only talks specifically about dynamically deciding which cpu clock to use depending on whether ->Threads() was used or not, with no API functionality to switch between them.

So i'm still not sure what will be the step after this PR. I'll try to elaborate to the possibilities i see:

Keep everything as-is in this PR
Reverse the polarity.
- Tame variant:
  - By default measure process clock.
  - Switch json/csv field name.
  - Don't add /process_time suffix by default.
  - If user asked to measure main-thread time, then add /thread_time suffix. (and write it into old csv/json field?)
- Wild variant:
  - Revert this PR
  - simply switch the clock (from thread to process -wide)
  - Switch json/csv field name.
???

…tions.

To much noise.

LebedevRI · 2019-04-08T09:47:35Z

Rebased, should be good to go if it looks good to @dominichamon.

dmah42 · 2019-04-08T15:32:06Z

Not sure what these obvious changes are, but "getting ready for submit" sgtm.

None.
I'm just saying that this will introduce some new public knobs, that we plan(?) on dropping the very next release.

i'm ok with either. It would be nicer for users i think if they could choose between the existing behaviour and the new behaviour and we just flipped the default between releases, but further up in this PR you suggest that reverting to the old behaviour once we introduce process timing is not possible.

I'm guessing you are talking about #763 (comment) ?
That only talks specifically about dynamically deciding which cpu clock to use depending on whether ->Threads() was used or not, with no API functionality to switch between them.

So i'm still not sure what will be the step after this PR. I'll try to elaborate to the possibilities i see:

Keep everything as-is in this PR

Reverse the polarity.

Tame variant:

By default measure process clock.

Switch json/csv field name.

Don't add /process_time suffix by default.

If user asked to measure main-thread time, then add /thread_time suffix. (and write it into old csv/json field?)

After the next release, do the Tame variant above (which would require a command-line/benchmark flag).

Wild variant:

Revert this PR

simply switch the clock (from thread to process -wide)

Switch json/csv field name.

???

I wish we had tracking for flags... if we saw the flag in Tame variant went unused i'd push for this the release after.

README.md

include/benchmark/benchmark.h

src/thread_timer.h

dmah42 · 2019-04-09T12:01:48Z

Thanks!

LebedevRI · 2019-04-09T12:03:12Z

YAY

bryan-lunt · 2019-04-09T20:33:31Z

WOOHOO!
Thanks everyone, especially @LebedevRI and @dominichamon .

googlebot added the cla: no label Feb 11, 2019

googlebot added cla: yes and removed cla: no labels Feb 11, 2019

LebedevRI self-requested a review February 11, 2019 07:41

LebedevRI reviewed Feb 11, 2019

View reviewed changes

googlebot added cla: no and removed cla: yes labels Feb 16, 2019

dmah42 reviewed Feb 18, 2019

View reviewed changes

bryan-lunt and others added 15 commits April 8, 2019 12:39

Proof of concept for better CPU time handling with multithreaded func…

5ed7150

…tions.

Google Benchmark now works with OpenMP and other user-managed threading.

d041c45

clang-format

9668a3f

Rework process-wide CPU usage measurement, +option, +tests, +docs

a647edc

Oh yeah, also validate which clock (cpu vs wall) is used.

60915dd

Adjust eps to pacify CI

2e1e9d3

Adjust eps to pacify CI

f762a4d

Change the way invtime is matched - could be inf

8b59fa8

Change the way CSV invtime is matched - could be inf

47b142f

Bump the test of eps up to 33% to pacify appveyor

1deb554

Bump the invtime

9565f7b

Bump the test of eps up to 50% to pacify appveyor

6997727

Some other changes for CI :/

24a7348

Drop failing attempts to actually verify the test results.

9d61947

To much noise.

fixup

0b2b11e

LebedevRI force-pushed the openmp-compatibility branch from def9e0c to 0b2b11e Compare April 8, 2019 09:46

dmah42 reviewed Apr 8, 2019

View reviewed changes

README.md Show resolved Hide resolved

dmah42 reviewed Apr 8, 2019

View reviewed changes

include/benchmark/benchmark.h Outdated Show resolved Hide resolved

dmah42 reviewed Apr 8, 2019

View reviewed changes

src/thread_timer.h Outdated Show resolved Hide resolved

dmah42 reviewed Apr 8, 2019

View reviewed changes

src/thread_timer.h Outdated Show resolved Hide resolved

review notes

5c9a54f

dmah42 merged commit 7a1c370 into google:master Apr 9, 2019

LebedevRI mentioned this pull request May 1, 2019

Growing virtual memory usage on multithreaded benchmark #806

Open

LebedevRI mentioned this pull request Aug 9, 2024

Ensure reported Time is walltime by removing spurious scaling by threads #1836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openmp compatibility #763

Openmp compatibility #763

bryan-lunt commented Feb 11, 2019

googlebot commented Feb 11, 2019

bryan-lunt commented Feb 11, 2019

googlebot commented Feb 11, 2019

LebedevRI left a comment

LebedevRI commented Feb 11, 2019

dmah42 commented Feb 12, 2019

bryan-lunt commented Feb 12, 2019

bryan-lunt commented Feb 12, 2019

LebedevRI commented Feb 12, 2019

bryan-lunt commented Feb 12, 2019

dmah42 commented Feb 13, 2019

LebedevRI commented Feb 13, 2019

pleroy commented Feb 13, 2019

bryan-lunt commented Feb 13, 2019 •

edited

Loading

bryan-lunt commented Feb 13, 2019 •

edited

Loading

LebedevRI commented Feb 14, 2019

googlebot commented Feb 16, 2019

LebedevRI commented Feb 16, 2019

bryan-lunt commented Feb 16, 2019

LebedevRI commented Feb 16, 2019

coveralls commented Feb 17, 2019 •

edited

Loading

dmah42 commented Feb 18, 2019

LebedevRI commented Feb 18, 2019

dmah42 left a comment

dmah42 Feb 12, 2019

LebedevRI Feb 18, 2019

LebedevRI commented Apr 5, 2019

LebedevRI commented Apr 8, 2019

dmah42 commented Apr 8, 2019

dmah42 commented Apr 9, 2019

LebedevRI commented Apr 9, 2019

bryan-lunt commented Apr 9, 2019

Openmp compatibility #763

Openmp compatibility #763

Conversation

bryan-lunt commented Feb 11, 2019

googlebot commented Feb 11, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

bryan-lunt commented Feb 11, 2019

googlebot commented Feb 11, 2019

LebedevRI left a comment

Choose a reason for hiding this comment

LebedevRI commented Feb 11, 2019

dmah42 commented Feb 12, 2019

bryan-lunt commented Feb 12, 2019

bryan-lunt commented Feb 12, 2019

LebedevRI commented Feb 12, 2019

bryan-lunt commented Feb 12, 2019

dmah42 commented Feb 13, 2019

LebedevRI commented Feb 13, 2019

pleroy commented Feb 13, 2019

bryan-lunt commented Feb 13, 2019 • edited Loading

bryan-lunt commented Feb 13, 2019 • edited Loading

LebedevRI commented Feb 14, 2019

googlebot commented Feb 16, 2019

LebedevRI commented Feb 16, 2019

bryan-lunt commented Feb 16, 2019

LebedevRI commented Feb 16, 2019

coveralls commented Feb 17, 2019 • edited Loading

dmah42 commented Feb 18, 2019

LebedevRI commented Feb 18, 2019

dmah42 left a comment

Choose a reason for hiding this comment

dmah42 Feb 12, 2019

Choose a reason for hiding this comment

LebedevRI Feb 18, 2019

Choose a reason for hiding this comment

LebedevRI commented Apr 5, 2019

LebedevRI commented Apr 8, 2019

dmah42 commented Apr 8, 2019

dmah42 commented Apr 9, 2019

LebedevRI commented Apr 9, 2019

bryan-lunt commented Apr 9, 2019

bryan-lunt commented Feb 13, 2019 •

edited

Loading

bryan-lunt commented Feb 13, 2019 •

edited

Loading

coveralls commented Feb 17, 2019 •

edited

Loading