-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some computations not included in reported total time of output CSVs #3089
Comments
@jtimonen. Thanks so much for the careful bug report. For the timing reports, the only solutions that make sense to me are to (a) include all the time taken, or (b) just not try to report timing. It would be great if we could report log density/gradient evals for setting initial step size. For (B), the user's initial stepsize is used as a starting point. I believe adaptation has to be turned off to have the stepsize not to be adapted. It would be nice if we could independently control what was adapted (as in being able to adapt step size and not metric or vice-versa). |
@bob-carpenter for the niche but important set of models with odes, the number of gradient evaluations is (potentially) much less informative than the runtime. |
The fix for B) is simple if we decide to add the amount of time we spend in A) is probably less critical, though ~100 grad evaluations can still be noticeable, but not really that much. |
This would make sense, because for example CmdStan user guide says: "During warmup, the NUTS algorithm adjusts the HMC algorithm parameters metric and stepsize". And the fix would be just to move the start of warmup clock earlier in Including the time spent on A) requires editing more files. 100 grad evaluations doesn't sound like much but as pointed out by Niko, for models that involve adaptive iterative solvers (like ODE) the number of grad evaluations is not the best metric, as cost of one gradient evaluation is not constant. It depends on the parameter values and it is not uncommon that the initial evaluations take the longest time because difficult parameters are being proposed. |
Neither runtime or the No. of grad evaluations is a good measure of ode solver performance, tho I agree runtime is better if we have to pick one. To be more informative in ode models, we need
|
I'm totally OK with including total run time and more fine-grained ODE solver reports. I didn't realize until now that they had a lot of extra overhead, but of course that makes sense given that they're doing derivatives and iterative solving with varying numbers of steps internally. Do the algebraic solvers also have this problem? Even some of our scalar functions have iterative algorithms behind them that can vary in complexity. |
Yes. |
I'd find it great if there were some way to get more metrics about (e.g.) the ODE solution process. But, I think for now just closing the gap between reported runtime and actual wall time would be great! |
The original intent of the timing was to capture what was happing during the warmup and main sampling iterations, once the state space and sampler had been initialized. Adding an timer around the initializations shouldn’t be a problem, although it would change the format of the CmdStan CSV output and break analysis code that parses the “Elapsed Time:” section. Another potential issue is how “Total” should be defined — if “Total” is changed to include all three components then it would technically break backwards compatibility as performance metrics would change even though the sampler behavior hasn’t changed. The varied expectations people have for what these timings, however, suggest that the API guarantees for the timing were never well-defined in the first place.
I am not sure what is the purpose of B) and why it is called even if step size if specified by user. To me this sounds like this is sort of warmup and could either be included in warmup time or then the reported time could have a separate "initialization" part that is included in total time of the chain.
(B) initializes the symplectic integrator step size to a reasonable value before warmup iterations commence.
The “step size” argument provides only an initial guess for this procedure, not the final step size. This was implemented in the very first versions of Stan, I believe because so few understood the consequences for a given step size for a given model and even the most informed guesses were often way off; indeed this is largely still the case!
If this stage is taking millions of iterations there there’s a problem. At each iteration the step size is increased by 2 or decreased by 1 / 2; after 50 iterations of decrease the step size should already be at the threshold of double precision; at that point a single step shouldn’t change the system much unless the gradient is extremely poorly-behaved. Millions of iterations would imply something really wrong is going on with the evaluation of the log density at small perturbations away from the initial point.
I’m not again adding a max number of iterations here and err’ing out if the limit is hit to avoid these poor behaviors.
|
That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports. We decided ages ago that unstructured text output isn't subject to backwards compatibility issues. It's not like it was going to reproduce in terms of ms anyway. |
It's not taking millions of those stepsize refinement iterations, just millions of evaluations of the ODE function as I said. One grad evaluation can call it it max_num_steps times which for the ODE solvers is by default 1e6 or 1e9. My point here was just to demonstrate that this stepsize init procedure can sometimes take a substantial amount of time EDIT: and by the ODE function I mean the function defining the right-hand side of the system, not the solver function, sorry if that wasn't clear |
Yea, the problem is just that some people are using those reports. Like here: https://www.mdpi.com/2624-8611/3/4/48/htm . They say
I am not sure where rstan takes those times. Anyway the additional 215 here is probably some postprocessing.
|
I would really like to know what the additional time here comes from. It seems to be a different additional time sink than what I originally started this thread for, but highlights that you need to understand quite a lot about the internals of Stan or your Stan interface to measure performance. If you want to just analyze the MCMC algorithm and compute ESS/second then I think that timing should be done so that it
Even then the performance measure will be affected a bit by whether you are writing the draws as output to CSV (like cmdstan) or storing them in memory (like rstan). |
One other thing is that if all the timing reports are removed, and one can only time the grand total time on their own, then it gets quite difficult to get information about how much time different chains took.
CmdStanR is grepping that unstructured text and making structured output out of it: https://github.com/stan-dev/cmdstanr/blob/master/R/run.R#:~:text=if%20(state%20%3C%203%20%26%26%20grepl(%22Elapsed,%7D So that is going to be a problem if it goes into CRAN as such and the output text of Stan CSVs changes. |
We will just remove that if we decide to remove it from the services and go back to reporting total times per chain as it was originally, that is not a big issue. I don't think we want to remove these timings though, and maybe find a way to include them or report them separately.
The performance hit from the CSV output is a bit exaggerated in general in my opinion. On my 6-year old system, outputting 1000 samples of 10k parameters takes 4s. So that means that the total performance penalty for the entire model with num_samples=1000 is 4s. With 50k parameters is 20s. I think 4 seconds for a model with 10k parameters fall within measurement noise more often than not, at least for non-trivial models. |
But like if CmdStanR 1.0 is released soon to CRAN and then the public method
I don't think it is a big deal either currently, but just trying to list things that are not part of the MCMC algorithm itself and therefore depending on implementation / used libraries could in theory have an effect on performance metrics |
It's not taking millions of those stepsize refinement iterations, just millions of evaluations of the ODE function as I said. One grad evaluate it max_num_steps times which for the ODE solvers is by default 1e6 or 1e9. My point here was just to demonstrate that this stepsize init procedure can sometimes take a substantial amount of time
But what do you mean by “substantial” here? The state and sampler initialization require maybe hundreds of gradient evaluations for particularly difficult problems, but then that cost should be overwhelmed by the gradient evaluations needed for generating transitions during warmup and sampling. If one runs really short chains then this initialization can make a difference but those really short chains are also not recommended best practice. In other words for default sampler settings the warmup and sampling timings captures the bulk of the total cost, and for more non-standard settings other time quantifications (wrapping everything in a total time, using the profiler) can be used to identify other contributions.
The subtlety that ODE solvers, and other numerical solvers, introduce is the potential variation in the gradient evaluation time based on how ill-posed the ODE system is at each point. If the Markov chains are initialized in neighborhoods that are extremely ill-posed but the typical set contains much better-posed neighborhoods then the initialization cost can be much more substantial.
To be clear I’m not against adding more timing information, although I also believe that that information can be deduced from other facilities that are already available. My main point here is that the warmup and sampling timings were always intended to capture only the time cost of running the Markov chains. Tweaking the output to be more precise — for exampling replacing “Warmup”, “Sampling”, and “Total” with “Warmup Iterations”, “Main Sampling Iterations”, and “Total Iteration Time” or the like — might also be helpful.
That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.
One other thing is that if all the timing reports are removed, and one can only time the grand total time on their own, then it gets quite difficult to get information about how much time different chains took.
The sampler timings has always been intended for understanding sampler behavior and not overall algorithmic cost. Looking for variation between chains is perhaps its most useful application, but also looking for timing differences between iterations of model optimization and across computers can be really useful for gauging the overall change in sampler behavior. Also it’s these timings that matter for the ESS / time metric, which quantifies sampler performance and not performance of the entire library.
I am not sure where rstan takes those times.
I would really like to know what the additional time here comes from. It seems to be a different additional time sink than what I originally started this thread for, but highlights that you need to understand quite a lot about the internals of Stan or your Stan interface to measure performance. If you want to just analyze the MCMC algorithm and compute ESS/second then I think that timing should be done so that it
There are two issues, both exacerbated by the number of variables being captured in the output.
The actual packaging of the output into the stanfit object itself can be costly; I’m not exactly sure what the main bottleneck is (presumably there’s some transposing/rearranging being done that might cause the memory to start paging) but this has always been around. The delay becomes noticeable the more variables there are.
Secondly at some point RStan started running automated diagnostics which initially included Rhat and ESS for each output variable but not includes all of the modified Rhat and ESS calculations. Once there are many output variables these costs add up, especially for the modified diagnostics which are very, very expensive (PyStan only checks the standard Rhat and ESS but also caps the variables automatically analyzed to 1000 which can help, although the cap is pretty arbitrary). Unfortunately there’s no way to turn them off. Honestly this has been the most frustrating part about using RStan for nontrivial models over the last few versions.
We decided ages ago that unstructured text output isn't subject to backwards compatibility issues. It's not like it was going to reproduce in terms of ms anyway.
But like if CmdStanR 1.0 is released soon to CRAN and then the public method fit$time() returns a list with total and chains (where chains has the warmup and sampling times) like it does now, and then a new Stan version comes which doesn't give the warmup and sampling times separately anymore, wouldn't CmdStanR technically have to go to 2.0 because it breaks compatibility? I actually don't know at all how things work if you have a CRAN package that has external system requirements.
The problem with previous decisions about the CmdStan API guarantees is that CmdStanPy and CmdStanR have largely ignored them, which has baked in multiple dependencies to non-guaranteed CmdStan behavior. This is going to cause problems eventually.
|
I've had cases where initialization takes something like 10-25% of total time with 2000 iterations. It has happened even when setting
which in my understanding means it can evolve the integrator repeatedly, each time doubling the nominal stepsize so it might go pretty far from the original point to some ill-posed region (I don't totally understand how So yes bad initial points are a problem. But I am kind of suspecting that even with well-posed initial point and priors that have significant mass only on well-posed region, you cannot guarantee that you'd only ever need to do solve ODEs using well-posed parameters, if the system is ill-posed with some parameter combinations. |
Oh and in that case I was able to reduce the time required by |
I've had cases where initialization takes something like 10-25% of total time with 2000 iterations. It has happened even when setting init=0 for a system where that point shouldn't be particularly ill-posed. The definition of neighborhood of the given initial parameters is however quite fuzzy and this init_stepsize() <https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/base_hmc.hpp> function has something like
this->nom_epsilon_ = direction == 1 ? 2.0 * this->nom_epsilon_
which in my understanding means it can evolve the integrator repeatedly, each time doubling the nominal stepsize so it might go pretty far from the original point to some ill-posed region (I don't totally understand how direction is determined or motivation for it).
`direction` is set on line 104.
Once an initial point, with finite log density and log density gradient, has been found the step size is initialized to find a reasonable starting value where the change in Hamiltonian after one step is around log(0.8). This would correspond to a Metropolis acceptance probability of transitioning to that next step of 0.8.
To do this momenta, p, are sampled at the initial point, q, to give an initial point in phase space, z = (q, p). This initial phase space space is then evolved forward for one step with the symplectic integrator using the initial step size. At this new point the Hamiltonian is evaluated; if the change in Hamiltonian is smaller than log(0.8) then the initial stepsize is too large and needs to be reduced but if the change in Hamiltonian is larger than log(0.8) then the initial stepsize is too small and needs to be increased. To find the right increase/decrease the stepsize is doubled or halved repeatedly until the change in Hamiltonian after one step of the symplectic integrator from the initial point drops below/above log(0.8).
In other words the log density is only ever evaluated one step away from the initial point. That said if the sampled momenta is large, if the gradient at the initial point is large, or if the region of stability of an iterative solver using the log density calculation is very narrow around the initial point then those tests evolutions might require expensive solves.
So yes bad initial points are a problem. But I am kind of suspecting that even with well-posed initial point and priors that have significant mass only on well-posed region, you cannot guarantee that you'd only ever need to do solve ODEs using well-posed parameters, if the system is ill-posed with some parameter combinations.
Yeah unless you can guarantee that the region connecting the initial point and the posterior typical set is well-posed for an iterative solver then you might have to tackle expensive solves during warmup. If the solver is well-posed within the extent of the prior, and the posterior contracts within that extent, then initializing from the prior should be sufficient. Stan’s default initialization, however, may not make this possible.
The fact that you’re seeing such expensive solves after only a step suggests that the well-posed region is either very small or intricately shaped.
Oh and in that case I was able to reduce the time required by init_stepsize() from that 10%-25% to very minimal amount by setting step_size=0.1 instead of the default step_size=1.
Setting step_size=0.1 instead of step_size=1 changes the initial stepsize in the above procedure. That said the adaptation procedure would still need only 4 iterations to drop an initial step size of 1 to one below 0.1; in other words starting with 0.1 would save only four evaluations of the model. That suggests that the bulk of the expensive is coming from just a few iterative solves!
Note that this example demonstrates why this stepsize initialization is so useful. If the sampler started with stepsize=1 then it would evaluate the model multiple times along the initial Hamiltonian transition which would compound the expensive if a single solve is so expensive. By heuristically adapting the step size first with single steps the cost of the initial trajectories becomes much more manageable.
Anyways the conversation has drifted a bit from the original issue. The dynamic cost of adaptive solvers, all of which were introduced after the initial design of the interfaces, can cause the state and step size initialization routines to be a nontrivial contribution to the overall runtime. So long as all of the dependent interfaces adapt accordingly I don’t have any problem with adding a time estimate for the initialization procedure as well.
|
Thank you very much for the explanations! |
Summary:
Stan writes to the output csv file of each chain something like
but this total time doesn't include all computations that have to be done with the model.
Description:
If we look at run_adaptive_sampler, we see that the times
warm_delta_t
andsample_delta_t
measure just the runtime ofutil::generate_transitions()
for warmup and sampling phases, respectively. Then they are written withwriter.write_timing(warm_delta_t, sample_delta_t)
, meaning that the reported total time is their sum. So for example forhmc_static_diag_e_adapt()
things that don't get included in the total time are:A) Finding the initial starting point if not given
B) Calling
sampler.init_stepsize()
inrun_adaptive_sampler()
There are also other things such as writing the CSV header and reading input but I listed these because these both include log probability and gradient evaluations with the model and therefore I think should be included in reported to measure model performance.
A) involves log probability and gradient evaluations until a point where they are finite is found (max 100 tries)
B) involves gradient evaluations and leapfrog steps inside a while loop until some condition is satisfied (https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/base_hmc.hpp)
I found this because I have ODE models where
sampler.init_stepsize()
can require millions of evaluations of the ODE function (confirmed with profiling) and therefore takes a substantial proportion of the actual wall time. I have sent an example to @rok-cesnovar.Interfaces:
In cmdstanr, there is also a
Sys.time()
timing around the entire program call, so that measures the actual wall time correctly. The$time()
method of a fit object reports something likeSo the grand total time is correct, but total time of each chain doesn't involve the initialization of parameter values and step size.
Possible solutions
I am not sure what is the purpose of B) and why it is called even if step size if specified by user. To me this sounds like this is sort of warmup and could either be included in warmup time or then the reported time could have a separate "initialization" part that is included in total time of the chain.
Current Version:
v2.28.1
The text was updated successfully, but these errors were encountered: