Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some computations not included in reported total time of output CSVs #3089

Open
jtimonen opened this issue Nov 23, 2021 · 21 comments
Open

Some computations not included in reported total time of output CSVs #3089

jtimonen opened this issue Nov 23, 2021 · 21 comments

Comments

@jtimonen
Copy link

Summary:

Stan writes to the output csv file of each chain something like

#  Elapsed Time: 
#                0.008 seconds (Warm-up)
#                0.004 seconds (Sampling)
#                0.012 seconds (Total)

but this total time doesn't include all computations that have to be done with the model.

Description:

If we look at run_adaptive_sampler, we see that the times warm_delta_t and sample_delta_t measure just the runtime of util::generate_transitions() for warmup and sampling phases, respectively. Then they are written with writer.write_timing(warm_delta_t, sample_delta_t), meaning that the reported total time is their sum. So for example for hmc_static_diag_e_adapt() things that don't get included in the total time are:

A) Finding the initial starting point if not given
B) Calling sampler.init_stepsize() in run_adaptive_sampler()

There are also other things such as writing the CSV header and reading input but I listed these because these both include log probability and gradient evaluations with the model and therefore I think should be included in reported to measure model performance.

A) involves log probability and gradient evaluations until a point where they are finite is found (max 100 tries)
B) involves gradient evaluations and leapfrog steps inside a while loop until some condition is satisfied (https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/base_hmc.hpp)

I found this because I have ODE models where sampler.init_stepsize() can require millions of evaluations of the ODE function (confirmed with profiling) and therefore takes a substantial proportion of the actual wall time. I have sent an example to @rok-cesnovar.

Interfaces:

In cmdstanr, there is also a Sys.time() timing around the entire program call, so that measures the actual wall time correctly. The
$time() method of a fit object reports something like

$total
[1] 210.9202

$chains
  chain_id warmup sampling total
1        1  0.007    0.004 0.011
2        2  0.012    0.004 0.016

So the grand total time is correct, but total time of each chain doesn't involve the initialization of parameter values and step size.

Possible solutions

I am not sure what is the purpose of B) and why it is called even if step size if specified by user. To me this sounds like this is sort of warmup and could either be included in warmup time or then the reported time could have a separate "initialization" part that is included in total time of the chain.

Current Version:

v2.28.1

@bob-carpenter
Copy link
Contributor

@jtimonen. Thanks so much for the careful bug report.

For the timing reports, the only solutions that make sense to me are to (a) include all the time taken, or (b) just not try to report timing. It would be great if we could report log density/gradient evals for setting initial step size.

For (B), the user's initial stepsize is used as a starting point. I believe adaptation has to be turned off to have the stepsize not to be adapted. It would be nice if we could independently control what was adapted (as in being able to adapt step size and not metric or vice-versa).

@nsiccha
Copy link

nsiccha commented Nov 23, 2021

@bob-carpenter for the niche but important set of models with odes, the number of gradient evaluations is (potentially) much less informative than the runtime.

@rok-cesnovar
Copy link
Member

The fix for B) is simple if we decide to add the amount of time we spend in sampler.init_stepsize() as part of warmup. Would that be fine?

A) is probably less critical, though ~100 grad evaluations can still be noticeable, but not really that much.

@jtimonen
Copy link
Author

The fix for B) is simple if we decide to add the amount of time we spend in sampler.init_stepsize() as part of warmup

This would make sense, because for example CmdStan user guide says: "During warmup, the NUTS algorithm adjusts the HMC algorithm parameters metric and stepsize". And the fix would be just to move the start of warmup clock earlier in run_adaptive_sampler().

Including the time spent on A) requires editing more files. 100 grad evaluations doesn't sound like much but as pointed out by Niko, for models that involve adaptive iterative solvers (like ODE) the number of grad evaluations is not the best metric, as cost of one gradient evaluation is not constant. It depends on the parameter values and it is not uncommon that the initial evaluations take the longest time because difficult parameters are being proposed.

@yizhang-yiz
Copy link

Neither runtime or the No. of grad evaluations is a good measure of ode solver performance, tho I agree runtime is better if we have to pick one. To be more informative in ode models, we need

  • No. of grad evaluations for sampler performance.
  • Total runtime.
  • No. of RHS evaluations for explicit ODE solvers.
  • No. of nonlinear solver iterations for implicit ODE solvers.

@bob-carpenter
Copy link
Contributor

I'm totally OK with including total run time and more fine-grained ODE solver reports. I didn't realize until now that they had a lot of extra overhead, but of course that makes sense given that they're doing derivatives and iterative solving with varying numbers of steps internally.

Do the algebraic solvers also have this problem? Even some of our scalar functions have iterative algorithms behind them that can vary in complexity.

@yizhang-yiz
Copy link

Do the algebraic solvers also have this problem?

Yes.

@nsiccha
Copy link

nsiccha commented Nov 23, 2021

I'd find it great if there were some way to get more metrics about (e.g.) the ODE solution process. But, I think for now just closing the gap between reported runtime and actual wall time would be great!

@betanalpha
Copy link
Contributor

betanalpha commented Dec 21, 2021 via email

@bob-carpenter
Copy link
Contributor

the API guarantees for the timing were never well-defined in the first place.

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

We decided ages ago that unstructured text output isn't subject to backwards compatibility issues. It's not like it was going to reproduce in terms of ms anyway.

@jtimonen
Copy link
Author

jtimonen commented Dec 21, 2021

If this stage is taking millions of iterations there there’s a problem.

It's not taking millions of those stepsize refinement iterations, just millions of evaluations of the ODE function as I said. One grad evaluation can call it it max_num_steps times which for the ODE solvers is by default 1e6 or 1e9. My point here was just to demonstrate that this stepsize init procedure can sometimes take a substantial amount of time

EDIT: and by the ODE function I mean the function defining the right-hand side of the system, not the solver function, sorry if that wasn't clear

@jtimonen
Copy link
Author

jtimonen commented Dec 21, 2021

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

Yea, the problem is just that some people are using those reports. Like here: https://www.mdpi.com/2624-8611/3/4/48/htm . They say

For JAGS, the warmup run time [jags.model()] and the sampling run time [jags.samples()] were determined by the before-after difference of time stamps obtained from the function Sys.time(). For Stan, the warmup and sampling run times can be obtained from the console output of the function sampling() and are also retrievable from the returned results object (attributes(fitted@sim$samples[[1]])[["elapsed_time"]]

However, in addition to warmup and sampling, Stan’s sampling() function took another 215 seconds on average before returning the samples. Thus, users need to wait much longer for the results of the sampling process when using rstan instead of rjags.

I am not sure where rstan takes those times. Anyway the additional 215 here is probably some postprocessing.

Further, we encountered a relatively huge consumption of additional time besides warmup and sampling by rstan’s sampling() function. In fact, the additional time was about twice the time for all other steps combined. We are not sure what this function needs this additional time for. According to an anonymous reviewer, the additional time is partly due to calculating ESS and PSR. As this needs to be done for JAGS as well, fair software comparisons would also need to take this into account. Maybe future versions of rstan’s sampling() function may include an option to return the samples directly after sampling or explicitly report the time needed for additional calculations of convergence and precision statistics.

@jtimonen
Copy link
Author

Further, we encountered a relatively huge consumption of additional time besides warmup and sampling by rstan’s sampling() function. In fact, the additional time was about twice the time for all other steps combined. We are not sure what this function needs this additional time for. According to an anonymous reviewer, the additional time is partly due to calculating ESS and PSR. As this needs to be done for JAGS as well, fair software comparisons would also need to take this into account. Maybe future versions of rstan’s sampling() function may include an option to return the samples directly after sampling or explicitly report the time needed for additional calculations of convergence and precision statistics.

I would really like to know what the additional time here comes from. It seems to be a different additional time sink than what I originally started this thread for, but highlights that you need to understand quite a lot about the internals of Stan or your Stan interface to measure performance. If you want to just analyze the MCMC algorithm and compute ESS/second then I think that timing should be done so that it

  • includes the initial stepsize refinement, other warmup and sampling, but
  • does not include reading in data, or postprocessing computation of metrics like rhat.

Even then the performance measure will be affected a bit by whether you are writing the draws as output to CSV (like cmdstan) or storing them in memory (like rstan).

@jtimonen
Copy link
Author

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

One other thing is that if all the timing reports are removed, and one can only time the grand total time on their own, then it gets quite difficult to get information about how much time different chains took.

We decided ages ago that unstructured text output isn't subject to backwards compatibility issues.

CmdStanR is grepping that unstructured text and making structured output out of it: https://github.com/stan-dev/cmdstanr/blob/master/R/run.R#:~:text=if%20(state%20%3C%203%20%26%26%20grepl(%22Elapsed,%7D

So that is going to be a problem if it goes into CRAN as such and the output text of Stan CSVs changes.

@rok-cesnovar
Copy link
Member

We will just remove that if we decide to remove it from the services and go back to reporting total times per chain as it was originally, that is not a big issue. I don't think we want to remove these timings though, and maybe find a way to include them or report them separately.

affected a bit by whether you are writing the draws as output to CSV (like cmdstan)

The performance hit from the CSV output is a bit exaggerated in general in my opinion. On my 6-year old system, outputting 1000 samples of 10k parameters takes 4s. So that means that the total performance penalty for the entire model with num_samples=1000 is 4s. With 50k parameters is 20s. I think 4 seconds for a model with 10k parameters fall within measurement noise more often than not, at least for non-trivial models.

@jtimonen
Copy link
Author

We will just remove that if we decide to remove it from the services and go back to reporting total times per chain as it was originally, that is not a big issue<

But like if CmdStanR 1.0 is released soon to CRAN and then the public method fit$time() returns a list with total and chains (where chains has the warmup and sampling times) like it does now, and then a new Stan version comes which doesn't give the warmup and sampling times separately anymore, wouldn't CmdStanR technically have to go to 2.0 because it breaks compatibility? I actually don't know at all how things work if you have a CRAN package that has external system requirements.

The performance hit from the CSV output is a bit exaggerated in general in my opinion.

I don't think it is a big deal either currently, but just trying to list things that are not part of the MCMC algorithm itself and therefore depending on implementation / used libraries could in theory have an effect on performance metrics

@betanalpha
Copy link
Contributor

betanalpha commented Dec 24, 2021 via email

@jtimonen
Copy link
Author

jtimonen commented Jan 6, 2022

But what do you mean by “substantial” here? The state and sampler initialization require maybe hundreds of gradient evaluations for particularly difficult problems, but then that cost should be overwhelmed by the gradient evaluations needed for generating transitions during warmup and sampling.

The subtlety that ODE solvers, and other numerical solvers, introduce is the potential variation in the gradient evaluation time based on how ill-posed the ODE system is at each point. If the Markov chains are initialized in neighborhoods that are extremely ill-posed but the typical set contains much better-posed neighborhoods then the initialization cost can be much more substantial.

I've had cases where initialization takes something like 10-25% of total time with 2000 iterations. It has happened even when setting init=0 for a system where that point shouldn't be particularly ill-posed. The definition of neighborhood of the given initial parameters is quite fuzzy and this init_stepsize() function has something like

this->nom_epsilon_ = direction == 1 ? 2.0 * this->nom_epsilon_

which in my understanding means it can evolve the integrator repeatedly, each time doubling the nominal stepsize so it might go pretty far from the original point to some ill-posed region (I don't totally understand how direction is determined or motivation for it).

So yes bad initial points are a problem. But I am kind of suspecting that even with well-posed initial point and priors that have significant mass only on well-posed region, you cannot guarantee that you'd only ever need to do solve ODEs using well-posed parameters, if the system is ill-posed with some parameter combinations.

@jtimonen
Copy link
Author

jtimonen commented Jan 6, 2022

Oh and in that case I was able to reduce the time required by init_stepsize() from that 10%-25% to very minimal amount by setting step_size=0.1 instead of the default step_size=1.

@betanalpha
Copy link
Contributor

betanalpha commented Jan 7, 2022 via email

@jtimonen
Copy link
Author

jtimonen commented Jan 8, 2022

Thank you very much for the explanations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants