Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add min rel accuracy stopping criterion #1744

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maartenarnst
Copy link

@maartenarnst maartenarnst commented Jan 25, 2024

This is work in progress to try to add a statistical stopping criterion. While the existing statistics features seek to characterise the variability between repetitions of the benchmark, the statistical stopping criterion in this PR seeks to ensure that within a single repetition, the estimate of the mean duration has become sufficiently statistically accurate.

This is a draft.

Copy link

google-cla bot commented Jan 25, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@maartenarnst maartenarnst reopened this Jan 25, 2024
@maartenarnst maartenarnst marked this pull request as draft January 25, 2024 15:03
Comment on lines 84 to 90
// Accumulated time so far (does not contain current slice if running_)
double real_time_used_ = 0;

double cpu_time_used_ = 0;
// Manually set iteration time. User sets this with SetIterationTime(seconds).
double manual_time_used_ = 0;
double manual_time_used2_ = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to track every timer as-is and as as a sum-of-squares?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to look at the time durations as independent identically distributed samples of a random variable. And then use second-order statistics to estimate mean, variance, and accuracy of the estimate of the mean. So for this, we need indeed to keep track of the sum of the squares of the time durations.

i.seconds >=
GetMinTimeToApply() || // The elapsed time is large enough.
(((i.seconds >= GetMinTimeToApply()) &&
(b.use_manual_time() && !IsZero(GetMinRelAccuracy()) && (std::sqrt(i.seconds2 / i.iters - std::pow(i.seconds / i.iters, 2.)) / (i.seconds / i.iters) / sqrt(i.iters) <= GetMinRelAccuracy()))) || // The relative accuracy is enough.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, i think this needs a bit of refactoring...


BM_VLOG(3) << "Next iters: " << next_iters << ", " << multiplier << "\n";
return next_iters; // round up before conversion to integer.
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the changes to this function.
Does this even compile? multiplier is a local variable, and it seems unused now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LebedevRI. Thanks for looking at the PR. I must admit that I meant to push it to my own fork, rather than to the Google repository. This is why the PR is in a very early draft stage. But it's here now :) So maybe that's indeed a good opportunity to work on it together.

For this particular change, it's an unintentional leftover from a wrong manipulation. I'll delete it.

@LebedevRI
Copy link
Collaborator

As you probably noticed, non-manual timers don't produce per-iteration times
(I'm somewhat planning on looking into that, but i don't expect the default to change),
so this fundamentally only works for manual time.

src/benchmark.cc Outdated Show resolved Hide resolved
@maartenarnst
Copy link
Author

Hi @LebedevRI. In the mean time, I have elaborated the PR more. It is now in a better state for discussion.

Questions/issues that @romintomasetti and I are thinking of are:

  • Is this feature of controlling the number of iterations through a statistical criterion of interest to you? Our main motivation is that it's not always easy to set a "min time" in advance to ensure that the result has converged. Even if it's ultimately not retained as a stopping criterion, reporting a measure of statistical accuracy may be of interest.

  • I agree that as it stands, the feature is limited to benchmarks that use manual timing. On our side, we use benchmark for GPU kernels, mainly with Kokkos, an HPC library that itself also uses benchmark for its performance tests. We use manual timing, as you also suggest in your user guide. This use case may be sufficiently widespread as a motivation. (It may be worth exploring whether calls to the iterator for the state might be a way of obtaining per-iteration times for benchmarks that don't use manual timing).

  • How should the new stopping criterion (statistical relative accuracy) interact with the existing stopping criterion (time or number of iterations)? In the current state of the PR, ShouldReportIterationResults uses an "AND" logic. It will return true if both criteria are reached. For both criteria, there is an opt-out/opt-in by setting the threshold to zero. So the current behaviour of benchmark is maintained, and this PR is mainly adding functionality.

  • Do you agree with the current way of implementing the feature? The current implementation keeps track of the sum of the squares of the per-iteration times. An alternative could be to store all the per-iteration times. An advantage would be that such an alternative implementation would allow your existing code for computing statistics to be reused. It would also be more flexible for future extensions, e.g. looking at statistics other than the mean and variance, such as detecting outliers. Drawbacks may be the increased storage requirement and needs to dynamically resize the storage space.

Many thanks in advance!

@LebedevRI
Copy link
Collaborator

LebedevRI commented Jan 26, 2024

  • Is this feature of controlling the number of iterations through a statistical criterion of interest to you? Our main motivation is that it's not always easy to set a "min time" in advance to ensure that the result has converged. Even if it's ultimately not retained as a stopping criterion, reporting a measure of statistical accuracy may be of interest.

I don't have an answer to this question. Following replies assume "yes", but i'm undecided.
(FWIW, at the very least, i find this experiment, at the very least, to be entertaining/educating.)

  • I agree that as it stands, the feature is limited to benchmarks that use manual timing. On our side, we use benchmark for GPU kernels, mainly with Kokkos, an HPC library that itself also uses benchmark for its
    performance tests. We use manual timing, as you also suggest in your user guide.

Yes, that use of manual time for GPU kernels is indeed very reasonable.

This use case may be sufficiently widespread as a motivation.

(It may be worth exploring whether calls to the iterator for the state might be a way of obtaining per-iteration times for benchmarks that don't use manual timing).

True per-iteration times are, a complicated topic.
The current library has been very intentionally designed
to prohibit that, and while, as i have said,
i was already planning on looking into that,
i'm not sure how that would look like.

  • How should the new stopping criterion (statistical relative accuracy) interact with the existing stopping criterion (time or number of iterations)? In the current state of the PR, ShouldReportIterationResults uses an "AND" logic. It will return true if both criteria are reached. For both criteria, there is an opt-out/opt-in by setting the threshold to zero. So the current behaviour of benchmark is maintained, and this PR is mainly adding functionality.

That's my biggest problem honestly. The existing lack of max-time limit
for manual time already seems problematic. What if the requested accuracy
is more precise than what is actually possible on a given machine?
(run out of time to reply here, may elaborate later)

  • Do you agree with the current way of implementing the feature? The current implementation keeps track of the sum of the squares of the per-iteration times. An alternative could be to store all the per-iteration times. An advantage would be that such an alternative implementation would allow your existing code for computing statistics to be reused. It would also be more flexible for future extensions, e.g. looking at statistics other than the mean and variance, such as detecting outliers. Drawbacks may be the increased storage requirement and needs to dynamically resize the storage space.

This feature does not need to store per-iteration times,
it only needs to compute standard deviation of all of the per-iteration times,
which can be computed as a "live stdev" the way you currently do it,
so i certainly agree with only just a new accumulator,
and would not agree with an alternative design.

As i have said, true per-iteration times is a complicated topic, and storing them doubly so.

@LebedevRI
Copy link
Collaborator

Also, BenchmarkRunner::DoOneRepetition() is wrong now,
we now can't re-use iteration count from one repetition to the others.
(now == with this accuracy check).

@maartenarnst
Copy link
Author

Also, BenchmarkRunner::DoOneRepetition() is wrong now,
we now can't re-use iteration count from one repetition to the others.
(now == with this accuracy check).

I'm not sure to understand entirely what you mean.

The condition

seems to be such that even after the changes in this PR, the behavior is that the first repetition sets the number of iterations and the next repetitions then reuse that number.

An issue may be that subsequent repetitions don't reach the required relative accuracy for the reused number of iterations. But isn't this issue also present for the current stopping criterion based on the time?

@LebedevRI
Copy link
Collaborator

The condition

seems to be such that even after the changes in this PR, the behavior is that the first repetition sets the number of iterations and the next repetitions then reuse that number.

An issue may be that subsequent repetitions don't reach the required relative accuracy for the reused number of iterations.

Precisely, yes. Presumably that is not what you want if you explicitly request convergence-based run-time.

But isn't this issue also present for the current stopping criterion based on the time?

Sure.

@LebedevRI
Copy link
Collaborator

(To be noted, i'm just pointing out that failure mode,
perhaps you consider such unexpected outcome to be acceptable.)

@maartenarnst
Copy link
Author

Hi @LebedevRI.

In the mean time, I've thought a bit more about it.

An answer may be to look at it as two steps. This PR could be a first step focused on introducing a new convergence criterion. And exploring changes to the behavior for multiple repetitions could be a follow-up.

If I do think now about the behavior for multiple repetitions, I'm not sure I would call the issue a failure mode.

From a statistical point of view, doing the repetitions using a fixed number of iterations may be quite sensible. An advantage would be that basic versions of theoretical results from statistics would give some support to the approach. Such as basic versions of the central limit theorem to characterize theoretically the variability of a partial sum of a fixed number of random variables.

The current behavior for multiple repetitions, and keeping it unchanged for the new convergence criterion, also appears quite clear and easy to understand. The first repetition fixes the number of iterations. It can be done via a maximum number of iterations, a minimum number of time to apply, and/or, now, a minimum relative accuracy. Then, in subsequent repetitions, the fixed number of iterations is reused. Perhaps doing the experiment of trying to write a piece of doc about the new convergence criterion could be a good next step.

Those would be my two cents, coming more from a background of statistics.

@maartenarnst maartenarnst force-pushed the min_rel_accuracy branch 2 times, most recently from 6947d6c to d3990ce Compare May 24, 2024 11:11
Clean up the initial commit

Further cleaning of initial commit. Add test.

Improvements to comments thanks to review

Reformat thanks to clang format.

Static cast to avoid conversion warning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants