-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added regression syntax design doc #27
base: master
Are you sure you want to change the base?
Conversation
Tagging @enetsee who is also interested in this topic! |
Short version. If we stick to just lme4 stuff initially, there are two
cases;
- Don't marginalize out anything, in which case probably dense is enough
- Marginalize out stuff, in which case we need to be able to do sparse
matrix stuff (like in the lme4 paper for the Gaussian case, or similarly
with Laplace approximations).
This is an extremely cool idea!! and it would be very nice to our Python
users (but also to our R users, who currently translate from brms and then
modify, which is messy).
And obviously we can add cool things like GPs (dense) and spatial models
(often sparse), and smoothing splines (often sparse), and time series
models (sparse). It would be ideal if we could keep as much of the brms
notation as possible for these complex models...
…On Sat, 26 Sep 2020 at 19:08, Ben Bales ***@***.***> wrote:
I've been thinking about some sort of regression syntax in Stan. It's kind
of a follow up to Speeding up hierarchical models
<https://discourse.mc-stan.org/t/speeding-up-hierarchical-models/14783>.
When I say regression syntax I'm talking about the interfaces
(brms/rstanarm/lme4/glmer/lm/...) that take in something like the R
regression syntax (y ~ 1 + x + (1 | group) + ...) and do something with
it.
It's probably something that has come up before, but there's just so much
there it's hard to talk about it without writing a lot down. So I tried to
write down the bare minimum justification and implementation of a
regression syntax in Stan.
Stan is written to do regressions and more, so at face value this doesn't
seem that important but I'm trying to hit three points:
1. Make it possible to avoid design matrix libraries outside of Stan
2. Package up big computation to send it off to a GPU/accelerator
3. Build design matrices if you need them
Now that I got that written out I have some questions hoooray:
1.
@jgabry <https://github.com/jgabry> @paul-buerkner
<https://github.com/paul-buerkner> @bgoodri
<https://github.com/bgoodri> does avoiding external design matrices
buy us anything from an interface-writing perspective? I thought it would
when I wrote this, but I've never actually written a high level regression
interface and so I have my doubts.
2.
@SteveBronder <https://github.com/SteveBronder> @t4c1
<https://github.com/t4c1> @rok-cesnovar
<https://github.com/rok-cesnovar> is this syntax enough to pack things
up into big, GPU-friendly calculations?
3.
@dpsimpson <https://github.com/dpsimpson> is it actually useful to
build out design matrices in the way I've written here? I remember the
sparse stuff you talked about we do linear algebra on them, but maybe this
syntax isn't enough to build the matrices you need and you're left having
to pass them in from the outside anyway.
Rendered markdown is here
<https://github.com/bbbales2/design-docs/blob/feature/0026-regression-syntax/designs/0026-stan-regression-syntax.md>
------------------------------
You can view, comment on, or merge this pull request online at:
#27
Commit Summary
- Added regression syntax design doc
File Changes
- *A* designs/0026-stan-regression-syntax.md
<https://github.com/stan-dev/design-docs/pull/27/files#diff-efdbb699c90c4837f62d2dbb93124ad2>
(256)
Patch Links:
- https://github.com/stan-dev/design-docs/pull/27.patch
- https://github.com/stan-dev/design-docs/pull/27.diff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADRICBWL4WYYLHDYG2VQFRDSHZX6VANCNFSM4R3FTBEA>
.
|
So this is just new syntax for something that is already possible? While I am more familiar with the existing syntax I guess that is fine if that syntax is closer to majority of Stan users. I have to say I do not see what is the point of the design matrix stuff. To me it just looks like forcing sparse algebra, which will probably be slower that dense version. Are there any advantages in terms of usability? Anyway, the dense should work fine with GPUs. With kernel generator it won't be hard to implement - we can just write something very similar to the vectorized form of the model. Although the CPU implementation will be hard to optimize, due to the version of Eigen we are using not supporting indexing with vectors. On the other hand we do not have any GPU implementations for sparse operations yet. |
From my point of view, this isn’t a speed thing. It’s shifting something
that is currently done separately for each interface (ie a thing that is
only available in RStan) into the language so that there is consistent
access to it.
It also lets us build a brms-like higher lever interface for all of the
versions of Stan, which should help users immensely.
A thing that is annoying is that that means the new syntax will need to
consistently support factors, which is the thing that model.matrix in R
does really well.
…On Mon, Sep 28, 2020 at 05:23 Tadej Ciglarič ***@***.***> wrote:
So this is just new syntax for something that is already possible? While I
am more familiar with the existing syntax I guess that is fine if that
syntax is closer to majority of Stan users.
I have to say I do not see what is the point of the design matrix stuff.
To me it just looks like forcing sparse algebra, which will probably be
slower that dense version. Are there any advantages in terms of usability?
Anyway, the dense should work fine with GPUs. With kernel generator it
won't be hard to implement - we can just write something very similar to
the vectorized form of the model. Although the CPU implementation will be
hard to optimize, due to the version of Eigen we are using not supporting
indexing with vectors. On the other hand we do not have any GPU
implementations for sparse operations yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADRICBWCVHAGSZPTNB5OVSDSIBIYVANCNFSM4R3FTBEA>
.
|
@seantalts - thanks for the nudge! @bbbales2, as Sean mentioned, I'm very interested in this topic and is a use case I had in mind when we were looking at the structure of the compiler. Rather than adding this as new syntax to Stan the language, the idea was to have an alternative frontend just for regression models which elaborated to core Stan or the intermediate representation we use in the compiler (i.e. we have an additional step where we recover anything that was left implicit in the terser language of regression). There is some really interesting work by Andy Gordon on a 'regression calculus' here (which is just a fancy way of describing a core language for regression along with its meaning) which I was keen on extending to add BRMS like functionality. I spoke with Andy Gordon about this briefly at StanCon in Cambridge but I haven't had too much time to follow up of late. There was also a presentation at StanCon of another front end language, Chronikis which was targeted at time series models and again, elaborated to Stan. In terms of motivation, I think we have to recognize that syntax is important: it facilitates the expression of recurring patterns and allows us to restrict expressiveness. With specialized front-end we can also introduce stricter semantics allowing us to pick up errors more easily. The very fact that BRMS has a large user base is evidence of the utility of this approach. I would be very keen to get involved with this from a language design perspective. |
I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in paul-buerkner/brms#489). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple little comments. My main Qs are mostly around how we would use this syntax to optimize things on the backend. @t4c1 if you check out the lmer paper they go into details on why the sparse matrix representation can be more efficient for large numbers of random effects. Though I do wonder if there is a way we can keep this dense while being efficient
https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
|
||
A basic regression might be: | ||
``` | ||
y ~ 1 + x + (1 | group) + (x | group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 1
in the random effects part what would the generated name be? Just like [group name]_intercept
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would depend on the package, but I'd expect something like that (and for it to be a vector with the number of elements of the group).
|
||
So if the following returns a length `N` vector: | ||
``` | ||
covariate .* (parameters | group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always going to be a .*
? If so I think this would be easier to work with in the compiler and C++ code if we directly put the covariates in the () like
(covariate, parameters | group)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that seems good. I am nervous the .*
there was actually more specialized than a .*
in the wild might be, if that makes sense. The Fabular thing that @enetsee linked above does something like:
(covariate {parameters} | group)
Not opposed to something else either.
vector[N] mu = intercept + slope * x + | ||
(group1_intercepts | group1_idx) + x .* (group1_slopes | group1_idx) + | ||
(group12_intercepts | group1_idx, group2_idx) + | ||
x .* (group12_slopes | group1_idx, group2_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this I'm not really sure exactly how we'd optimize this on the backend. I kind of like something similar to your original proposal for a hiearchy()
function for something like
vector[N] mu = intercept + slope * x +
hierarchy(f, group1_idx, group2_idx, group12_intercepts, x, group12_slopes)
Where f
is a function with a signature like vector f(real intercept, matrix x, vector slope)
that would be applied to each group (in this case the group is the combo of groups 1 and 2).
I wonder if we could do anything useful if we have a group_index
type in the stan language? Or could we give matrices an index? Maybe something like
group_index[N] group_idx1;
group_index[N] group_idx2;
// Anything after first two dims for a matrix are indices?
matrix[N, M, group_idx1, group_idx2] X;
// Or maybe a tag?
matrix[N, M] X index(group_idx1, group_idx2);
Though I think the issue with what I have here is that we would have to pay for a sort / unsort if the data was not presorted so we could iterate over the groups nicely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure exactly how we'd optimize this on the backend
I imagined this would be something recognizable by the compiler. Like, "oh, I have a bunch of design matrices, better lump them together and turn them into a big hierarchy
statement".
But also I think @t4c1 likes expressions and I could see that working too, so it kinda all comes together at runtime. As long as it gets into a big hierarchy
-like statement it's good.
group_index type in the stan language
we would have to pay for a sort / unsort if the data
Is the goal here to express a sparseness or to express the ordering of the matrix to make accesses efficient in some way?
Interesting! I'd love to be involved in some of this discussion. Also could someone tag Bob Carpenter (I'm not sure how to do that), as he and Matt Hoffman and I were talking a lot about these issues back in 2011. Also Jonah/BenG/Aki, because this stuff would be directly relevant to our forthcoming book on applied regression and multilevel models.
Andrew
… On Sep 29, 2020, at 2:23 AM, Paul-Christian Bürkner ***@***.***> wrote:
I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as posterior_predict and log_lik (using our R names right now) for these models directly in Stan, which is otherwise a little ugly to do in the generated quantities block.
There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in paul-buerkner/brms#489 <paul-buerkner/brms#489>). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZYCUOKMQKTJLFGLGKMPHTSIF4PVANCNFSM4R3FTBEA>.
|
@dpsimpson Yeah I like the Separately I would like some sort of Stan library system with a package manager. I think that's the only practical way we're going to get all the fancy design matrices and stuff that would need built for splines, approx GPs, etc in Stan models. Pumping that stuff through Stan math would be too slow development-wise. I think if we had a more formal/centralized way for people to share their code, we'd see a lot of interesting code getting shared (and avoid everyone having to reinvent things, digging in forums for code snippets, etc.).
@t4c1 Yeah, having the design matrices and being able to evaluate/use them efficiently is a plus. Part of the goal is usability and part of the goal is to make it easier for the compiler/runtime to recognize that a bunch of calculations are related. A little of both.
@enetsee The Fabular thing was interesting. Yes this is very much what I want to see in Stan. I like how they talk about breaking the regressions out into different levels. I also like the variables being explicit, but I'm probably not considering some advantage of them being implicit. I had a look at Chronikis. I was kinda thinking about domain-specific Stan packages (alternative frontends like you're saying, and I talked to @imadmali about one). The problem with front end languages is I don't know compiler stuff and so the extent of compiling I want to do is Python format-strings sorta code generation. The discourse thread at the top (this one) has sortof a brute-force giant function that we could use for evaluating the right hand side of a hierarchical regression. This would be something we could compile a higher level language to. The downsides to this are it's not terribly easy to read the syntax, we don't get the design matrices themselves, and there's no hierarchy of regressions like in the regression calculus paper. I don't want to clutter the Stan language up with too much extra syntax, but I also wouldn't want to limit it to a super-conservative syntax either (cuz the intention is lots of people happily use Stan and the interface languages -- not filter everyone from Stan to interfaces).
@paul-buerkner Could we talk about the workflow stuff on this sometime? I am rather scared of forgetting something here. Maybe in a couple weeks after the 2.25 release (on the 19th)? @enetsee or @andrewgelman or anyone else if you want to come to this, that's great. I'll update the design doc with whatever we talk about. |
Happy to have a chat about this. Just let me know here or via email.
Am So., 4. Okt. 2020 um 23:57 Uhr schrieb Ben Bales <
[email protected]>:
… And obviously we can add cool things like GPs (dense) and spatial models
(often sparse), and smoothing splines (often sparse), and time series
models (sparse). It would be ideal if we could keep as much of the brms
notation as possible for these complex models...
@dpsimpson <https://github.com/dpsimpson> Yeah I like the y ~ gp(x) sorta
thing that brms does.
Separately I would like some sort of Stan library system with a package
manager. I think that's the only practical way we're going to get all the
fancy design matrices and stuff that would need built for splines, approx
GPs, etc in Stan models. Pumping that stuff through Stan math would be too
slow development-wise. I think if we had a more formal/centralized way for
people to share their code, we'd see a lot of interesting code getting
shared (and avoid everyone having to reinvent things, digging in forums for
code snippets, etc.).
So this is just new syntax for something that is already possible?
Are there any advantages in terms of usability?
@t4c1 <https://github.com/t4c1> Yeah, having the design matrices and
being able to evaluate/use them efficiently is a plus. Part of the goal is
usability and part of the goal is to make it easier for the
compiler/runtime to recognize that a bunch of calculations are related. A
little of both.
There is some really interesting work by Andy Gordon on a 'regression
calculus'
another front end language, Chronikis
@enetsee <https://github.com/enetsee> The Fabular thing was interesting.
Yes this is very much what I want to see in Stan. I like how they talk
about breaking the regressions out into different levels. I also like the
variables being explicit, but I'm probably not considering some advantage
of them being implicit.
I had a look at Chronikis. I was kinda thinking about domain-specific Stan
packages (alternative frontends like you're saying, and I talked to
@imadmali <https://github.com/imadmali> about one). The problem with
front end languages is I don't know compiler stuff and so the extent of
compiling I want to do is Python format-strings sorta code generation.
The discourse thread at the top (this one
<https://discourse.mc-stan.org/t/speeding-up-hierarchical-models/14783>)
has sortof a brute-force giant function that we could use for evaluating
the right hand side of a hierarchical regression. This would be something
we could compile a higher level language to.
The downsides to this are it's not terribly easy to read the syntax, we
don't get the design matrices themselves, and there's no hierarchy of
regressions like in the regression calculus paper. I don't want to clutter
the Stan language up with too much extra syntax, but I also wouldn't want
to limit it to a super-conservative syntax either (cuz the intention is
lots of people happily use Stan and the interface languages -- not filter
everyone from Stan to interfaces).
There are obvious challenges of course but also a few perhaps not as
obvious ones
@paul-buerkner <https://github.com/paul-buerkner> Could we talk about the
workflow stuff on this sometime? I am rather scared of forgetting something
here. Maybe in a couple weeks after the 2.25 release (on the 19th)?
@enetsee <https://github.com/enetsee> or @andrewgelman
<https://github.com/andrewgelman> or anyone else if you want to come to
this, that's great. I'll update the design doc with whatever we talk about.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2AG5DHNRNSUB6PU4PBLSJDVUHANCNFSM4R3FTBEA>
.
|
@paul-buerkner @enetsee @andrewgelman are you free any time in the next couple weeks to talk about the not-so-obvious challenges here? Edit: I'm available basically anytime. If there's no obvious other preferable time, before or after the Stan meeting Thursday is good. |
I would be flexible on most days as well. Feel free to suggest a time. |
@paul-buerkner let's plan on before the Stan meeting then. That's 2PM UTC Thursday (10AM EST) if I'm doing my math correctly. @enetsee if you want to come and that doesn't work, pick another time. Sounds like we're both flexible. I'll e-mail a call link before then. |
If you want to talk this week, would it also work after the Stan meeting? |
After is good too, so let's say 4PM UTC, noon EST |
Hi, just curious what happened with this? |
@andrewgelman had a meeting and I never updated the design doc. There were some things there worth writing down, but the thing that was nagging me when I was trying to update this were unresolved problems with defining parameters that might have different sizes as a function of data. Problems like, you have an interaction:
And then you have a model where in your data you don't have at least one of all the ages and educations. I sorta got stuck here. And if you don't solve this problem you still leave a lot of annoying shuffling of stuff to the outside. |
I'm not sure I talk about this problem in the design doc, but I think I am assuming that it's solved in some way, and that seems like a big assumption. |
I've been thinking about some sort of regression syntax in Stan. It's kind of a follow up to Speeding up hierarchical models. When I say regression syntax I'm talking about the interfaces (brms/rstanarm/lme4/glmer/lm/...) that take in something like the R regression syntax (
y ~ 1 + x + (1 | group) + ...
) and do something with it.It's probably something that has come up before, but there's just so much there it's hard to talk about it without writing a lot down. So I tried to write down the bare minimum justification and implementation of a regression syntax in Stan.
Stan is written to do regressions and more, so at face value this doesn't seem that important but I'm trying to hit three points:
Now that I got that written out I have some questions hoooray:
@jgabry @paul-buerkner @bgoodri does avoiding external design matrices buy us anything from an interface-writing perspective? I thought it would when I wrote this, but I've never actually written a high level regression interface and so I have my doubts.
@SteveBronder @t4c1 @rok-cesnovar is this syntax enough to pack things up into big, GPU-friendly calculations?
@dpsimpson is it actually useful to build out design matrices in the way I've written here? I remember the sparse stuff you talked about we do linear algebra on them, but maybe this syntax isn't enough to build the matrices you need and you're left having to pass them in from the outside anyway.
Rendered markdown is here