-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjoint ode design #37
Conversation
For anyone who wants to get his hands dirty with the running prototype of this, you need to:
An example model I used for benchmarks is in |
Alright, so this is how it worked for me via @wds15 's instructions:
And then this showed up, which also worked for me: I've found this template for the design doc: I guess we want to keep to that template? Is a "stan programmer" as mentioned in the template a user of stan ("programming stan models") or a developer of stan? I guess a simple example comparing forward/adjoint would be helpful both for users and developers? Anyhow, some comments:
What needs to be done? |
What we need to do in my view is
Did I miss any option to expose for the Ctl thing? Thanks for the other comments, will update the doc accordingly. |
Hmmm, I believe at least for the interpolation method the polynomial interpolation appears to be preferred. The manual (https://computing.llnl.gov/sites/default/files/public/cvs_guide-5.7.0.pdf) states
As for the number of steps between checkpoints, there probably is no way to tell a priori which number is best. As for the tolerances, 🤷♂️. We'll need a suite of testcases for that. I do believe more and simpler (scalable) examples would be useful, not only to evaluate the use cases of the adjoint method. They could also be presented to users as examples when which methods perform well? |
I will switch to the polynomial method for the simpler method then. Maybe we even drop the option for the polynomial entirely? For the steps between checkpoints: I tested with the example values like 10, 50, 250, 500 and found that you see speedups when going from 10 to 250, but then it got flat. In the CVODES examples they use 100s (100 or 500) as I recall. Memory wise I would not expect an issue here at all. ODE problems are small in a Bayesian context as I have seen it (or you will have to wait forever for your results). Tolerances: I opt for making the chosen tolerance the least tolerance you get from this. So forward & backward have higher precision and the tolerance given is the one from the computed parameter gradients. For now we need to come to the conclusion that this is all good enough to land in Stan. Maybe we mark the simple interface as "experimental" in the doc to warn people that we may change the underlying numeric details? Do you have any speicfic examples in mind before proceeding? Tagging @charlesm93 who wanted to get his hands dirty on this. |
Section 1:
We should be consistent with what terms are kept in the scaling. Either N (M + 1) and 2 * N + M or N * M and N + M.
I think that it’s worth noting the overheads here, even just C_{sens} = N * (M + 1) + overhead_{sens} and C_{adj} = 2 * N + M + overhead_{adj}, and mentioning that the adjoint overhead is larger and can be limiting for small systems. This will also better prepare the reader for the discussion in Section 6.
Section 3:
"the adjoints of the input operands are directly computed given the input adjoints” reads pretty awkwardly. Firstly to avoid confusion I recommend being explicit and redundant and everywhere referring to "autodiff adjoints" when referring to the tangents/cotangents passed along the expression graph and “system adjoint” or “ODE adjoint” when referring to backwards solve method for propagating the autodiff adjoints through the ODE solve. Secondly using input to refer to both the operands and the autodiff adjoints is a big confusing given that each refers to a different direction through the expression graph.
Section 4:
Derivations for general l(theta) can be hard to connect back to the autodiff circumstance; I recommend always mentioning that taking l(theta) = delta * y(t) will recover the adjoint-jacobian product needed for reverse mode autodiff.
I recommend noting that lambda is a vector, one for each state, and the backwards solve (5) requires solving a coupled system of N equations (hence the cost of N). Each parameter sensitivity is one-dimensional which is why each of the corresponding integrals can be solved with a straightforward, one-dimensional quadrature (it might also help to refer to “one-dimensional quadrature” everywhere instead of just “quadrature”).
I think it would help to list the control parameters currently assumed for the current integrate_ode_bdf signature.
I really don’t think that we understand these solvers well enough to implement a `ode_adjoint_tol` function with the initial feature. Even ignoring the danger of having some implicit defaults this signature assumes that that uniform tolerances across all of the solvers is useful for which I haven’t seen any strong support. Users will be able to quickly try out the new function with recommended defaults for all of the control parameters that can be cut and paste into the Stan program. This has the added benefit of communicating exactly what was run (to correlate with any performance results they share) instead of relying on the exact repo commit when the default parameters might be changing. In the end I don’t see any strong reason for including this signature.
Section 6:
Blanket statements like “ODEs with many states and parameters are currently out of scope for Stan” can be more confusing than useful (everyone had different computing resources and hence thresholds that define practically useful tools, “scope” is ambiguous, performance depends on not just the solve time but also the behavior of the likelihood functions as the number of states and parameters increases, etc). What’s wrong with just saying that the performance of the current methods scale poorly with the number of states and parameters, but the adjoint method offers better scaling? For a fixed computational budget the adjoint method may allow for larger systems to be fit, and for a fixed system it may speed up existing fits.
Section 7:
I think that it will be helpful to differentiate between prior uses of the adjoint method for specific functions and the adjoin method for general autodiff. The adjoint method for general functions is used all over the sciences but the requirement of a fixed function makes it too rigid for general purpose tools like Stan (indeed this is why we didn’t consider the adjoint method much easier even though it was always in CVODES). Using the adjoint method for adjoint-Jacobian products, however, can be integrated into any reverse mode autodiff implementation. The application to autodiff is the really exciting new bit (the functionality built into tools like JAX for neural ODEs would be a recent prior art).
|
@betanalpha thanks for looking over it.
Wrt. to the suggested |
@betanalpha ok, now I hopefully got your other comments straight as well. Thanks for looking into it. I am now talking of "autodiff adjoints" and "ODE adjoint states" as you suggest, which makes sense to hopefully get down the confusion level... I suggest to now launch until Easter an experimental phase with Stan users. The goal is to collect some feedback on
We can collect feedback on the forum and in summarized form in the wiki. After Easter we then proceed with finishing this up and merging it to stan math during April such that it will be ready for the next May release. Sounds like a plan? Tagging @charlesm93, @bbbales2 , @rok-cesnovar & @funko-unko ... @yizhang-yiz |
Can we maybe just slap a warning on the
such that the user may just swap out the function call? Maybe with a notice to include this message if they post in the forums? This would discourage the use of the PS: Clearly my opinion should have very little weight. |
The laid out plan work for me! Maybe just mention that Easter is the 4th of April. Longterm we might want to use tuples or even better structs (though I am not sure there is any work being done onr structs) for default arguments. So the signature becomes: vector[] ode_adjoint_tol_ctl(F f,
vector y0,
real t0, real[] times,
(real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int solver_f, int solver_b)
T1 arg1, T2 arg2, ...) and you could call it as: ode_adjoint_tol_ctl(udf, y0, t0, times, default_ode_adjoint_args(), ...) where default_ode_adjoint_args() would be a built-in function that would return a tuple with the default arguments. One could also do this: transformed data {
(real, vector, real, vector, real, real, int, int, int, int, int) args = default_ode_adjoint_args();
args.7 = 500; // would set max_num_steps
} Structs with named elements obviously works much better for this. |
Great, thanks.
That all sounds good, although again I think `ode_adjoint_tol` is premature at the moment. Introducing it offers the danger that users will not understand its limitations (it doesn’t matter how many warnings you put in, if you allow defaults people will use them and expect them to work) and we can always add it later without any real cost.
… On Mar 7, 2021, at 10:52 AM, wds15 ***@***.***> wrote:
@betanalpha <https://github.com/betanalpha> ok, now I hopefully got your other comments straight as well. Thanks for looking into it. I am now talking of "autodiff adjoints" and "ODE adjoint states" as you suggest, which makes sense to hopefully get down the confusion level...
I suggest to now launch until Easter an experimental phase with Stan users. The goal is to collect some feedback on
the utility of the method
the utility of the tuning parameters (we can probably drop the interpolation_polynomial, for example... I think)
the utility of the ode_adjoint_tol call
We can collect feedback on the forum and in summarized form in the wiki. After Easter we then proceed with finishing this up and merging it to stan math during April such that it will be ready for the next May release.
Sounds like a plan?
Tagging @charlesm93 <https://github.com/charlesm93>, @bbbales2 <https://github.com/bbbales2> , @rok-cesnovar <https://github.com/rok-cesnovar> & @funko-unko <https://github.com/funko-unko> ... @yizhang-yiz <https://github.com/yizhang-yiz>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FTZJ3MOO4FCMOFTHATTCOOMDANCNFSM4XZBH45Q>.
|
Could we not get the same effect using variadic arguments for all of the control parameters?
… On Mar 8, 2021, at 4:38 AM, Rok Češnovar ***@***.***> wrote:
Longterm we might want to use tuples or even better structs (though I am not sure there is any work being done onr structs) for default arguments.
So the signature becomes:
vector[] ode_adjoint_tol_ctl(F f,
vector y0,
real t0, real[] times,
(real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int solver_f, int solver_b)
T1 arg1, T2 arg2, ...)
and you could call it as:
ode_adjoint_tol_ctl(udf, y0, t0, times, default_ode_adjoint_args(), ...)
where default_ode_adjoint_args() would be a built-in function that would return a tuple with the default arguments.
One could also do this:
transformed data {
(real, vector, real, vector, real, real, int, int, int, int, int) args = default_ode_adjoint_args();
args.7 = 500; // would set max_num_steps
}
Structs with named elements obviously works much better for this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FQFJMCAPGT4DDML6UTTCSLK3ANCNFSM4XZBH45Q>.
|
Current the compiler can not deduce where the control arguments stop and UDF args start. vector[] ode_adjoint_tol_ctl(F f,
vector y0, real t0, real[] times,
real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int solver_f, int solver_b) vector[] ode_adjoint_tol_ctl(F f,
vector y0, real t0, real[] times,
real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int arg1, int arg2) are the same. We could add logic where we check what args So doable but not sure its worth. I agree with "if you allow defaults people will use them and expect them to work". |
Yeah, I guess we’d need named arguments.
Regardless the defaults would have to be stored somewhere, either in each program, in the ODE function implementation, or in some default parameter object. Personally I’d prefer the first initially and then the second eventually, if only to reduce the code associated with the ODE implementations (i.e. default configurations are defined and docs in the ODE functions directly and no searching is required).
… On Mar 9, 2021, at 2:25 PM, Rok Češnovar ***@***.***> wrote:
Current the compiler can not deduce where the control arguments stop and UDF args start.
For example:
vector[] ode_adjoint_tol_ctl(F f,
vector y0, real t0, real[] times,
real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int solver_f, int solver_b)
vector[] ode_adjoint_tol_ctl(F f,
vector y0, real t0, real[] times,
real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int arg1, int arg2)
are the same. We could add logic where we check what args f has and deduce based on that and add missing control args for the C++ call (C++ compiler can not do the same). But that would mean stanc3 would hold the ode default values and I am not sure I like that.
So doable but not sure its worth. I agree with "if you allow defaults people will use them and expect them to work".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FR2FEF7FN2WOQZA5T3TCZY35ANCNFSM4XZBH45Q>.
|
I just updated some more details. @betanalpha did I get your comments in as you had in mind? It would be good to get the design doc agreed so that it's ready to be merge by the end of this month. The only points left open at this stage should be settled during the experimental phase which ends at Easter. The questions open are
As it has been a sticking point to include the simpler Pinging @bbbales2 @yizhang-yiz @rok-cesnovar @funko-unko @charlesm93 |
Mostly small comments. I'm still not comfortable with including a simpler function in the initial release. Users will not respect any "experimental feature" warnings and they will complain about changing behavior from version to version even if the behavior is explicitly documented as not stable. Even when they don't complain they often end up using the same function without realizing the difference, which is already a huge problem with Section 1 “severely costly to fit” -> “expensive to evaluate”. The autodiff method is independent of how hard ODE-based models are to fit. ressources -> resources This matters less in the design doc and more in final user facing doc, but given the heterogeneity in terminology across fields we should try to mention as many common terms as possible. For example not just “so-called forward-mode ODE method” but rather “forward-mode, direct, sensitivity”. Section 2 “susceptible infections & removed” -> “susceptible, infected, and recovered (SIR)” Section 3 “merely” -> “many”, although I think “more” by itself is sufficient here. “in addition to solve M one-dimensional quadrature problems" -> "in addition to solve M one-dimensional quadrature problems that depend on the adjoint states". To harm being clear to hear to avoid any misunderstanding about the solves parallelizeable or the like. "We need to first collect some experience before a simpler version can be made available (if that is feasible at all)." I very much agree, but then why is a simpler version being considered for the initial implementation? Are the quadrature tolerances different for each of the M integrations? Section 4 Use theta or p consistently to denote the parameters throughout. g(t, y, p) will not equal l(p) for the autodiff application -- it will equal dy/dt = f(t, y, p) which allows us to propagate the autodiff adjoint information from any subsequent expression graph. |
Thanks @betanalpha !
I did change my mind in the process of all this. So I do think that we should supply a simplified version to make it easy for users to try this thing out (which is why there could be left-overs saying otherwise). Let's be pragmatic here and I suggest to proceed as follows: We don't provide the simpler version in the Stan language itself. Instead, we pack the simplified function call into the Stan user documentation in the form of a Stan function. ... unless another reviewer here can convince @betanalpha otherwise, of course (sounds unlikely ;) ). Is that reasonable?
Right now the quadrature tolerances are all the same... but I just checked, the CVODES interface allows to specify the absolute tolerance for each parameter separately (I thought that is not possible). Should we change that and make that also a vector for the absolute tolerances? That's an awful lot of stuff to specify then and it's a bit inconvenient as the length of the vector changes whenever you add or drop a parameter...even worse is that the correspondence of absolute error tolerance to actual parameter is not so transparent in that we have a variadic interface, but the absolute tolerances would correspond to a "flattened" version of the variadic parameter pack... for the sake of simplifying at least something already now, I would leave things as they are right now and keep the same absolute tolerance for all parameters. Agree?
My intention is to keep "theta" as denoting a super-set of parameters and p is a subset of parameters needed for the ODE. I would like to keep notation as is, since this sets apart notation in stan-math (theta) and notation in the CVODES manual (p).
Maybe not ideal as it is right now... I will have one more look, but I would like to make it as simple as possible. Possibly we just say that l(theta) is - as an example - a normal lpdf of an ODE state at time T. |
I agree, this way the default values are out in the open in the UDF code and you also get the option to switch to adjoint ODEs "easily". |
There is just one catch with putting the function into the manual: It would be a user-defined Stan function and we do not support variadic arguments to user defined functions (or dow we?). Anyway, we can still agree at a later time-point on the simpler function being part of the language itself... unless anyone else reviewer this has different views. @betanalpha is that a plan? |
I read through the design doc and I think overall it's very well written. The only comments I have is that I expect the adjoint method to work better than forward diff also in the small number of parameters, large number of states regime -- to be tested, I'll put something together -- and that one reference we could include is the work on Neural ODEs by Chen et al (https://arxiv.org/abs/1806.07366). I agree with @betanalpha and @funko-unko that we should have per-parameter absolute tolerances for the quadrature step. This is consistent with other absolute tolerances we have, I don't think it's a burden to users who won't use it, and it can help us study adjoint methods for ODEs in a Bayesian context. It seems a reasonable application would be a case where we use varying absolute tolerances for the forward and backward solves. CVODEs looks at different contexts and the Julia package is fairly novel -- though I'd be interested to hear their reasoning. In my view, the most compelling reason for not having varying absolute tolerances right now is that it's more work for the developers and it'll delay the feature's release -- Can someone confirm this? That's fair enough. We can release the version as is but the design doc should acknowledge varying absolute tolerance is a desirable extension if not a top priority. And also that design as is doesn't hinder this extension. This strikes me as an acceptable compromise. I'd love some documentation on the benefits of scaling parameters in ODEs, some examples, and how well this works in practice! |
I believe part of the problem with settling the question whether or not these parameterwise tolerances are needed is that there is no quick/easy way to estimate the impact of these tolerances, neither for us in the very general setting, nor for users with a specific model. I believe the only way we have right now to estimate this is to just run full HMC and see how the sampler performs, which will take "forever". I think it would help a lot if it were (easily or at all) possible to explore how the solver performs for samples from the prior, i.e. what tolerances are needed for a given (estimated) final error in the density and in the gradient. The first you can do in a slightly hacky way, but for the latter you have to jump through a thousand hoops. I strongly believe that this should be made very very (very) easy for the user. One way to do this would be to be able to pass a csv file to |
At this point it would be great to really get to the point of rejecting (option 1) or accepting (option 2) with your post. Rationale as to why a specific option has been chosen is a great bonus, of course. The design doc already has a short discussion under "Drawbacks" about the lack of the per-parameter tolerances. Here are a few more points to this:
So far we do not even have any strong use-case for per-parameter tolerances for the parameters. Instead, we do have seen that the existing system gives users huge speedups. |
What's the established voting procedure on design doc and on github? |
I am not calling out for an established formal ruling here. Instead I brought up the topic at the Stan Gathering last week Thursday April 8th and it was suggested to go through a voting here in the way I started it (at least this is how I understood the meeting outcome). To me that is reasonable given that
I hope that makes sense to everyone involved. |
In the past for these we had a one week voting period where folks would use the github review process to review or request changes. If the majority is approve then we merge the design doc. If we wanted anyone to be able to vote we could also just have a comment people could thumbs up or down. I'm fine with either but prefer the review process so that all the tallies stay at the bottom of the page and dissents show up as |
Thanks @SteveBronder . Looks like this procedure is missing in the voting proposal discussion. I suggest we stick to the process in that proposal. |
I don't want to get the doc discussion off track so maybe this would be better to discuss in the general meeting Thursday. The actual review process in the design doc readme is a bit vague, which tbh I think is good actually as it lets the scope of the voting pool be decided by the reviewer(s). Whatever way people want to vote is cool and good by me |
We’re in an infinite loophole and no progress will be made without a vote, but I want to point out a few things to have my peace.
Our current ODE solvers also do not allow for different tolerances for anything. They are all the same and no one has ever complained about it. There were discussions on extending that to by-state, but that did not happen for various reasons. In any case, all Stan programs using ODEs so far have been working nicely with just a single absolute tolerance for states and gradients. Thus, if this assumption is wrong by now, then this actually affects all our solvers and we should treat this separately from adjoint would be my view.
By this reasoning the adjoint signature should not have any vector tolerances. The fact that the non-variadic arguments expose vector tolerances, as was strongly motivated by stability issues that are starting to arise in more complicated ODE models that are outside of the typical CVODES scope, completely invalidates this line of reasoning.
The usability is severely complicated given one has a variadic interface. One would have to respect ordering of parameters and their storage order. That can be avoided if the user restricts to only using 1d array inputs, but still.
The adjoint solver requires a sophisticated configuration that can have strong affects on the solver stability that we do not yet fully understand (not only how to optimally configure but also how to diagnose poor configurations). Yes we will always have to balance functionality with robustness, and we will never be able to release functionality the we understand completely. The original ODE solvers in Stan were developed when users were working with much simpler ODE systems, but contemporary applications, especially the large systems for which the adjoint method will be advertised, as much more complex.
Prioritizing simple function signatures to maximize easy of use externalizes the consequences of any poor solver configuration to users who are least prepared to investigate any problems that arise. A non-variadic signature would be a frustration but not an obstruction for the more experienced users working with complex ODE systems who would benefit from this speedup the most.
Indeed, my own resources are limited and in addition I do personally not think that the per-parameter thing is good for users as it complicates usability a lot.
Exposing _any_ of the solver configuration complicates usability.
Limiting what is exposed by forcing suboptimal defaults only shifts the complication from how easy it is to incorporate the function into a Stan program to how easy it is to diagnose and investigate stability problems in the model fit. Usability arguments like this would make a lot more sense if the decision was between exposing _no_ solver configuration options vs exposing _all_ solver configuration options, but the current decision is between exposing just the configuration options that are immediately compatible with the variadic signature verses all configuration options.
The per-parameter tolerance thing can be emulated - to some extend and with some work - using scaling and shifting of parameters and states. This won't work perfect, but that is totally possible to do,
I would love to see any non-speculative guidance on how this can be accomplished. The quadrature tolerances will depend on how parameter Jacobians evolve between the observed times _relative_ the adjoint states. These differential behaviors cannot be controlled by constant, linear transformations of the system that leave the Jacobians invariant.
Finally, the code will be written in a way so that the per-parameter tolerance thing can be added at a later point - should it be deemed useful for problems we encounter (and deem these problems to be important enough to warrant an implementation).
So far we do not even have any strong use-case for per-parameter tolerances for the parameters. Instead, we do have seen that the existing system gives users huge speedups.
Placing the burden of proof on one outcome but not the other by framing that outcome as the default and the other as alternative is entirely artificial here.
Right now every comparison of ODE solver performance has been complicated by the tolerance configurations — difference solvers can leapfrog each other, and become more or less stable, with non-default tuning. There is no question that the adjoint method will be extremely useful in the right circumstances, but it also exposes even more configuration options that can drastically effect its performance making the method more fragile.
State tolerance sensitivity has proven to be important for previous solvers, and the current signature exposes all of the state tolerances accordingly for both the forward and reverse solves. While the exposure of the full configuration is somewhat independent of introducing an adjoint solver, this will be the first solver that we’ve introduced since the community has really appreciated how sensitive performance can be to these configurations. This is all great so far.
The adjoint solver, however, also requires one-dimensional quadrature that parallels the reverse adjoint solve. No existing ODE method requires a similar quadrature and so there’s no local evidence from which we can draw about how important the quadrature tolerances might be. The closest analog is the 1D integrator exposed in Stan; although the quadrature algorithm isn’t exactly the same the tolerances in the 1D integrator can be very important for getting reasonable results out. Beyond the lack of empirical evidence no one has made any theoretical arguments regarding under what circumstances the adjoint quadrature would be robust or not so we have little understanding on how fragile the quadrature might be and, more importantly, how to identify any problems arising from a poor configuration. The situation is very much parallel to the state tolerances and to me that requires burden of proof for any argument to _not_ include quadrature tolerances.
The only arguments for including solver tolerances but not quadrature tolerances — independent of design considerations — are based on extrapolating a few documentation examples in packages focusing on different use cases than Stan. What frustrates me here is that the current arguments are all based on maintaining the variadic signature, which was introduced without any consideration for parameter-dependent solver configurations. Instead of considering that fundamental design incompatibility might be an argument for limiting the use of variadic signatures when parameter-dependent solver configurations are relevant the variadic solver is being taken as mandatory and incompatible functionality being rejected by default.
I very much hope that the quadrature tolerances do not end up that important, or that there are straightforward ways to identify poor configurations of those tolerances so that users can at least understand when the adjoint functionality might not be stable enough for their use, so that simpler signatures are sufficient. I do not think we have anywhere near enough evidence to support a simpler signature, however, and exposing one prematurely shifts all risks to unprepared users. At the very least I hope that everyone voting yes here will accept the responsibility of monitoring the forums for problems and providing consistent user support to minimize the costs of a poor user experience.
I continue to advocate for a more careful rollout, first exposing only a signature that requires the full adjoint solver configuration to be tuned so that users understand the requirements of using the functionality and can experiment and provide feedback. If that feedback motivates a simpler signature then we can expose new functions to the language in the future. Forcing speculative signatures, on the other hand, risks deprecated functions that complicate documentation and user experience.
|
I have read through this thread and was present at the online meeting where the issues were clearly laid out and discussed by Sebastian and Charles, among others. |
As per discussion today with @charlesm93 I have added a small sentence to clarify that a vector absolute tolerance for the parameters can be added later to the implementation. That is, the C++ code should be written in a way to allow for an extension should this be deemed necessary at a later point. Otherwise no change. Great to see one vote here (even better that it is positive from my personal perspective). |
I vote yes. Reasons over here |
For what it's worth, I also vote yes. But I do see @betanalpha's point. |
BUMP. One more week to go for voting. So far we have 3 positive votes and no vote against it... though @betanalpha has concerns against the proposal, but I have not read a clear vote for or against it. Tagging again interested people I am aware of: @charlesm93 @yizhang-yiz @bob-carpenter @rok-cesnovar @SteveBronder |
I also give a positive vote. Even though I am familiar with the underlying implementation and have worked on the Stanc3 side of things, I do have to acknowledge that I am by no means an expert in modeling with ODEs so put a lesser weight on my vote :) Comments:
vector[] ode_adjoint_tol_ctl(F f,
vector y0,
real t0, real[] times,
(real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
int interpolation_polynomial, int solver_f, int solver_b)
(T1 arg1, T1 arg_tol), (T2 arg2, T2 arg2_tol) ...) So bottom line for me is: this seems useful (I trust the judgement of Sebastian, Charles and the rest), should not put a |
+1 I vote yes! |
I also vote yes. |
BUMP. 6 positive votes and no (definite) negative vote by now. Thanks for everyone taking the time to dive into this so far. Voting closes on Monday 26th April 2021. |
My vote: yes. |
BUMP. Last day for voting is today! 7 positive votes and no (definite) negative vote by now. |
I vehemently vote no.
The intent of this implementation was to expose all of the tolerances (and not give them defaults) so that advanced users would be forced to experiment and provide feedback. Prioritizing the variadic signature negates this intent, both in limiting what tolerances can be configured and allowing unexamined use of the method. The design fundamentally changed halfway through the review process solely to rationalize merging a feature.
There is zero principled arguments for why the quadrature tolerances, and only the quadrature tolerances, can be ignored and no one has provided any strategy for how problems due to misconfigured quadrature tolerances can be, let alone will be, identified and studied.
Note also that in the history of Stan we have never actually deprecated a function from the language. Premature features will persist and users will continue to incorporate them into their Stan programs.
|
IMO Allowing deprecation is a healthy to do in the long term. Maybe we should have a discussion on that at some point. |
The functions are deprecated in the sense that the compiler will warn the users that a function or some syntax (˙#˙ is deprecated for comments, as is I have personally been slowly collecting things that will need to be removed once we bump (will open an issue somewhere at some point). Some already have an issue in one of the repos, others do not. So far collected: Language:
Others:
EDIT: Lets not derail this thread, we can discuss that elsewhere if needed. Just wanted to point out removing deprecated is in plans once we decide to bump. |
My understanding form the start was to expose a function signature which exposes a large set of tuning options - and we do have that now. In fact, I personally was for including an additional bdf variant, but in discussions over it I did concur with others that this functionality is not needed. Everything is a trade off. There is no variadic vs per-parameter thing here, not at all. Also, I really do not see a need to deprecate at some point the proposed signature in favor for the more complicated one with per-parameter tolerances. Deprecating the simpler signature would only be needed if almost all cases we encounter require the more complicated signature - but that is really unlikely (time will tell). Let's see it positive: We are really making progress here! Solving large ODE systems in Stan will be faster for a very large class of problems (at least for those we benchmarked). |
The functions are deprecated in the sense that the compiler will warn the users that a function or some syntax (˙#˙ is deprecated for comments, as is <-, etc.). We have not removed them as that would require a bump of the major version and when that will happen requires a bit more planning and discussion then just a typical release cycle. There is no immediate plan on when we will do that.
Yes, a major version bump has been “planned” for many years now with political and logistical issues preventing any progress. Deprecation is absolutely a useful tool in software development but until we can demonstrate that we can actually implement deprecation as a development team its use to motivate aggressive feature releases is, in my opinion, weak.
There is no variadic vs per-parameter thing here, not at all.
Quoting Sebastian:
"Right now the quadrature tolerances are all the same... but I just checked, the CVODES interface allows to specify the absolute tolerance for each parameter separately (I thought that is not possible). Should we change that and make that also a vector for the absolute tolerances? That's an awful lot of stuff to specify then and it's a bit inconvenient as the length of the vector changes whenever you add or drop a parameter...even worse is that the correspondence of absolute error tolerance to actual parameter is not so transparent in that we have a variadic interface, but the absolute tolerances would correspond to a "flattened" version of the variadic parameter pack... for the sake of simplifying at least something already now, I would leave things as they are right now and keep the same absolute tolerance for all parameters. Agree?"
A signal quadrature tolerance was used in the initial signature based on your misunderstanding of CVODES. Honest mistake. But reframing that initial signature as a default that’s “good enough” is not.
Also, I really do not see a need to deprecate at some point the proposed signature in favor for the more complicated one with per-parameter tolerances. Deprecating the simpler signature would only be needed if almost all cases we encounter require the more complicated signature - but that is really unlikely (time will tell).
The exact same argument holds for _any_ of the solver configuration arguments. Why are we exposing any of them? Why are we just now exposing vector tolerances? Why did everyone agree that a no-tolerance signature was a bad idea?
We’re starting to recognize that they have substantial influence on how the solvers behave and we have _no idea_ which are important. At best the choice of which tolerances are exposed and not exposed is arbitrary, at worst it’s based on forcing the variadic signature.
Let's see it positive: We are really making progress here! Solving large ODE systems in Stan will be faster for a very large class of problems (at least for those we benchmarked).
Introducing a new feature that is not fully understood is progress only when there are not poor consequences for users. Being able to fit larger problems is useful only if the solvers maintain their robustness in those larger problems, especially as prior specification becomes more and more subtle in higher-dimensional spaces.
I have no illusions of convincing anyone here; everyone is very optimistic that things will be fine based on little to no evidence. I again just hope that everyone voting “yes” because they speculate that there won’t be problems will take responsibility and do the work to review user feedback for stability problems due to the adjoint quadrature configuration and help users work through those issues.
|
Voting period is over as of today 27th April 2021 Result:
Thus, the decision is that we accept this PR as is ... and merge it. Thanks for everyone taking the time to go through the material! The PR for this in stan-math is getting along very well. |
I don't have an opinion on the technical aspects of the design doc, but the discussion prompted me to think about culture I would personally like to see in the Stan dev community (while others may obviously differ). I think it is important to bear in mind that everybody here shares a big portion of values - we all want Stan to be useful, reliable, usable, fast, ... We all value the effort others put into improving Stan, we all want our perspectives to be treated seriously. Disagreements like this are IMHO rarely due to some deep fundamental differences. Instead they arise partly from genuine uncertainty about what a design choice will actually entail in the real world and partly from somewhat different weights each of us puts on individual values/goals and how we evaluate the uncertainty. And I think this should be a reason for humility - some of the judgements any of us makes will be wrong. Some of our values/goals may turn out to lead to outcomes we didn't envision and don't like. Moreover whether a decision will in fact turn out to be good in practice doesn't in my experience correlate neatly with ability to convince others or to come up with good arguments. To me a consequence is that in most projects one should expect to rarely have all their requirements/expectations/... met - otherwise it would likely mean that the project only reflects the values of some of its team members. This humility also IMHO entails that we should try to avoid looking at results of votes (or any decision) as personal wins or losses, or try to keep scores on who was right how many times, or engage in some form of "I told you so" when it turns out somebody was wrong. We share most of our goals and we all benefit if Stan works great and lose out when there are issues. And I felt it is important to remind ourselves of this from time to time. End of my two cents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since votes were good!
This PR discusses how the adjoint ODE approach can be integrated with Stan math. The focus of the document is to
I am not sure on how to present a rendered version of this. I used this R command to render the text:
Here is a rendered pdf of the document of easier reading:
0027-adjoint-ode.pdf
updated version 7. March 2021
0027-adjoint-ode.pdf
update 23rd March 2021
0027-adjoint-ode.pdf
update 7th April 2021
0027-adjoint-ode.pdf