-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.26 timing/profiling meta issue #960
Comments
Thanks Ben. I will be opening all relevant PRs in a few days and then make another call for feedback. The Math PR should work standalone and can be done independently, while cmdstan/stanc3 PRs will be interconnected. |
Math PR Open: stan-dev/math#2265 Both PRs are preliminary and await feedback before polishing and doc-ing. Cmdstan branch ready: https://github.com/stan-dev/cmdstan/tree/profiling Example of use: data {
int<lower=0> N;
int<lower=0> K;
int<lower=0,upper=1> y[N];
matrix[N, K] X;
}
parameters {
real alpha;
vector[K] beta;
}
model {
{
profile("normal_lpdf alpha");
target += normal_lpdf(alpha | 0, 1);
}
{
profile("normal_lpdf beta");
target += normal_lpdf(beta | 0, 1);
}
{
profile("bernoulli GLM");
target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
}
} after running returns:
|
@rok-cesnovar is there a Stan branch? I gotta clone these things using |
Is it just develop? |
Its the profiling branch on all repos. |
I'm getting the wrong compiler:
Command on build is:
|
Ooo, but I found your stanc3 binaries over here: https://github.com/stan-dev/stanc3/actions/runs/431332425 and they work. |
This is so clutch. I was playing around with a model and I put a timing thing in each block:
I don't think the reverse pass timing in tdata should have anything. Something weird must be happening there. Also what are the units on all these printouts? Total time spent? Or seconds per gradient evaluation? Does this include warmup? |
Currently, tdata is not supposed to work. I need to add a Math support for profile that doesn't add a vari.
Its total time spent in seconds, includes everything. |
Thanks for the first batch of feedback!
Fixed
Fixed, in tdata and generated quantities we dont create a vari now. Also added
I tarballed the latest version, now available here: https://github.com/rok-cesnovar/cmdstan/releases/download/profiling/cmdstan-profiling.tgz |
I kinda think these should be per gradient. They should probably be split by warmup/sampling as well. Also I want the distributions. Given we can have up to 1000x more gradients than lp evaluations, that really explodes the output size esp. on something we're trying to benchmark (which seems bad), but we could thin the output or something. This sounds like getting into Stan services lol. |
I'll also be fine with summaries if that's what I get in the end, but while for some things we expect the per-gradient timing to be pretty constant (ODEs) for the higher order functions there could be interesting variation. |
If I time my transformed parameters block like:
And then like:
I get different results. The second gives me timings for "transformed parameters" that are nearly as long as it took to do the whole inference. |
Sure, that is doable. There is more to discuss there, specifically what to do with forward/reverse pass. We can have an unequal number of forward and reverse passes. Which then brings us to a few questions:
Hm, that would be a bigger change and would also mean we have to do this through services. That probably means pushing it to next release. Which I am fine with if that is what we decide. Just curious, what is the added value of knowing the warmup/sampling split?
That is also doable, but would have a larger memory footprint or we would have to store the profile info in files, which again brings us to services and profile then has a bit of a larger effect. I do feel these might be useful yeah. Maybe we can do basic profiling the way as its proposed here and do a verbose one via services. |
ODEs take different amounts of time in different parts of parameter space and warmup can spend a lot of time in weird places.
If it's too hard to give raw timing stats or split by warmup/release, I'm definitely in for doing some sort of summary output now and expanding it in the future. Just gotta make sure and not do something that's super backwards incompatible or whatever. |
Strongly agree.
This one is due to transform parameters actually being generated together with everything in the model block. Will look at how we can get around this problem. I suspect the end-of-TP block is going to be a bit of a challenge. |
This is great I just timed the different parts of a bits of a hierarchical model (something generated from brms here):
This is how the {
profile("linear_model");
mu = Intercept + Xc * b;
}
{
profile("random effects");
for (n in 1:N) {
// add more terms to the linear predictor
mu[n] = r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n] + r_3_1[J_3[n]] * Z_3_1[n] + r_4_1[J_4[n]] * Z_4_1[n] + r_5_1[J_5[n]] * Z_5_1[n] + r_6_1[J_6[n]] * Z_6_1[n] + r_7_1[J_7[n]] * Z_7_1[n];
}
} and then switched it to use the glm functions which combine the linear model and likelihood bit:
Edit: edited to make the labels in the code and output consistent |
Can the timing be recursive? Like a put a profile in my model block which contains other profiles. Now I get:
But could we get:
And then if there's a profile inside a function, it might show two different entries depending on where it's used. Like how profilers let you group the output times. |
Yes, the only limitation for where you can place profiles is that you cant have {
profile("name");
fun1();
profile("name");
fun2();
} so two "active" profiles with the same name. This is allowed:
And name would be the sum of the two. You can also use it in a for loop.
Maybe, yeah. |
This is so cool! Is the scope an RAII-like from the The Stan transpiler should be able to check for active profiles and disallow duplicates. If the design goes onto design-docs, I'm happy to review. I'm also happy to write the doc if you'd like help with that. |
Yes. This works great and is very clean because blocks in Stan = blocks in C++. The only problem we have is that transformed parameters are not actually encapsulated in a block in C++ which means that transformed parameters {
vector[N_1] r_1_1; // actual group-level effects
profile("TP");
r_1_1 = (sd_1[1] * (z_1[1]));
}
I opened a Math PR stan-dev/math#2265 so you can see the Math side of thing. Its not review-ready yet, because I did not go into doxygen and tests before we finalize what we actually want.
I am doing the first round of feedback to hash out any trivial stuff I missed and to get additional ideas. I will open a design doc in a few days. Will make sure to tag you.
Thank you! I will probably require help with docs as I am involved with a bunch of projects for 2.26 and might run out of time. And we want this doc-ed well.
For now, this triggers a runtime error (trying to re-active an already active profile). |
Hmm, I wonder if we should group the output by what block they're run in? Like transformed data gets run once with doubles, generated quantities gets run for every draw with doubles, transformed parameters is messy with the thing @nhuurre found, but the model block is every gradient. And so then per-gradient numbers are weird, cause not everything is a gradient. I guess per-call? I ran an ODE model with the intent to check if the other solvers were running faster: functions {
vector dz_dt(real t, // time
vector z, // system state {prey, predator}
real alpha,
real beta,
real gamma,
real delta) {
real u = z[1];
real v = z[2];
vector[2] dzdt;
dzdt[1] = (alpha - beta * v) * u;
dzdt[2] = (-gamma + delta * u) * v;
return dzdt;
}
}
data {
int<lower = 0> N; // number of measurement times
real ts[N]; // measurement times > 0
real y_init[2]; // initial measured populations
real<lower = 0> y[N, 2]; // measured populations
}
parameters {
real<lower = 0> alpha;
real<lower = 0> beta;
real<lower = 0> gamma;
real<lower = 0> delta;
vector<lower = 0>[2] z_init; // initial population
real<lower = 0> sigma[2]; // measurement errors
}
transformed parameters {
vector[2] z[N] = ode_rk45_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
model {
{
profile("priors");
alpha ~ normal(1, 0.5);
gamma ~ normal(1, 0.5);
beta ~ normal(0.05, 0.05);
delta ~ normal(0.05, 0.05);
sigma ~ lognormal(-1, 1);
z_init ~ lognormal(log(10), 1);
}
{
profile("likelihood");
for (k in 1:2) {
y_init[k] ~ lognormal(log(z_init[k]), sigma[k]);
y[ , k] ~ lognormal(log(z[, k]), sigma[k]);
}
}
}
generated quantities {
vector[2] zb[N];
{
profile("adams");
zb = ode_adams_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
{
profile("bdf");
zb = ode_bdf_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
{
profile("rk45");
zb = ode_rk45_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
} Anyway the idea is to see which solver is faster on the problem. The output is:
which definitely gives me the info I want, but all those numbers kinda come from different places and so maybe they should be arranged a bit more. |
Lets discuss this in the design doc. To me, the cumulative time is the easiest to process. For example, the per-run time could be the biggest for the transformed data but we know in the grand scheme of things its most likely negligible.
Hm, these both sound like good-to-have yes. But, both of these can be achieved by the user labeling the profiles. So you could label GQ profiles as Same for
Just label the likelihood as either " - likelihood" or "model - likelihood" and if the profiles would be shown in the order they appear, both would be shown as you want them to. We definitely need to order the profiles based on how they appear in the model. They are currently shown alphabetically, but that was not the intent. That is the order they show if I just iterate over the map. |
Can we have the same tag in multiple blocks? Often the pieces I do are a combination of things going on in generated quantities, the model block and transformed parameters, so I'm not sure it makes sense to summarize by block. |
Yes, the current implementation would accumulate profiles with the same name. |
The only thing that is currently bothering me is how could we profile the parameters transforms (lower, upper, ...). I dont have a clean solution for that at the moment so I am leaning towards not supporting that at this time. |
Well, you could just allow a profile statement in the parameters block--but I don't think transforms are ever going to be a performance bottleneck.
|
This is the output now:
I tried to format this like the current Stan summary output:
Notes:
|
Yeah on thinking more, I think just reporting the total is the easiest thing to think about. That way it's the total time/memory/autodiff calls within chains and also the totals across chains. The autodiff information section -- there doesn't appear to be much information in here cause the profile statements are not in loops or anything. If you time something in a loop you'll get some variation. I wonder where the likelihood is evaluated without AD? Also I think write a separate stansummary for this stuff. I can help tomorrow. I think it should just have the |
This is the output now:
Yes, except if an exception occurs, then there is no reverse pass.
Yes, but if we want to check how much time we spend per gradient eval or how much of the chain stack is used per gradient we need this info.
Agree. What should the default
We dont want to add that to the comments in my opinion.
I think we should yes. I like the timing/memory/AD splits.
I like the bin/profilesummary idea. I would maybe prefer if this would be handled by stansummary but we would then have to add comments to the CSV which I would prefer not to. So a separate one is fine I guess. |
If you indent code blocks you gotta indent every line I think |
My gut reaction is none, but it would probably be confusing to add profile information and for that to appear nowhere. (Edit: like, to get profiling output you have to add profiling commands and tell the interface something -- pretty annoying) Let's go with
I still think ditch this column. We can keep it for the RC and check how people feel about it.
Sure, no meta information is fine with me.
Do
It is annoying to add a program, but it would be at least as annoying to try to detect csv-types or add another flag to stansummary. |
Ok. Its always easier to add things then to remove them. Maybe I am just overthinking how much ChainStack per gradient-eval is useful. If we have doubts we want it, its best to not have add it. |
We can count reverse passes instead so this is the exact number and it easy to explain (no need to say "the number of reverse pass chain calls except for exceptions"). If we're ever in a situation where forward pass ADs differ much from reverse pass things have gone off the rails in other ways, so just one of the FwdAdCalls and RevAdCalls is necessary |
Stanc3 PR is in, Cmdstan PR is ready to be merged now I believe and the docs PR that adds basic information is also ready. Expanded docs will be added during the freeze. |
Summary:
This is to track the progress of the timing/profiling project.
A prototype is implemented following the design here
Stan Add a virtual get_profile_data method stan#2997StanCurrent Version:
v2.25.0
The text was updated successfully, but these errors were encountered: