-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Normal and MvNormal #2847
Conversation
Thanks for the PR, this is a random fail on Travis, nevermind. Do you see any more code style nitpicks? |
Just PR'ing as I come across them. If there's a preference for less commits I'm happy to keep this open for a while in the eventuality of more crossing my path. I'm toying with the idea of harmonising all the covariance/precision/cholesky parameters declaration into classes of their own to unclutter model declaration and remove the code duplication. I'll PR separately if that reaches anywhere, however there's a good chance I bump into other tidbits such as the above while doing so. |
pymc3/distributions/multivariate.py
Outdated
try: | ||
self.chol_cov = cholesky(cov) | ||
except ValueError: | ||
raise ValueError('cov must be two dimensional.') from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing error in py2.7
pymc3/distributions/multivariate.py
Outdated
dist, logdet, ok = self._quaddist_tau(delta) | ||
else: | ||
# Use this when Theano#5908 is released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure theano#5908 is released. Is there way to incorporate this comment? cc @aseyboldt
Ah, I just realize I have gotten it wrong, I thought MvNormalLogp was somewhere in Theano. Will have another go. |
Thanks for the work - I edit the PR title and comment. |
One question about the Cholesky/logp implementation. If my understanding is correct, the decomposition of any positive definite (PD) matrix would have positive diagonals. Theano/Scipy returns a matrix of NaNs if the covariance matrix is not PD. This is different from the current implementation -- ie it returns an identity with only [0,0] set to NaN -- but I think the difference is trivial. Hence checking the diagonal has no purpose and logp should return a matrix of -Inf if the Cholesky fails. Is that correct? Edit : Having now spent a few hours with that bit of code, it does look like it was (somewhat) misconceived in the sense that it applied logic pertaining to |
(not aiming for elegance at this point). TODO: same for tau-parameters. TBC: write gradient of MvStudentT.logp, to mirror this implementation.
Fullrank: Stab in the dark here - to check
Yes, the multivariate stuff does that. It looks simple at first, but turns into a never-ending rabbit hole. :-)
I agree. I think it is probably worth revisiting this, I remember quite significant speed ups in some cases. But this PR is perfectly fine without it. A rebase or new PR (whatever you prefer) is probably a good idea, that would make things a bit more manageable. Thanks for sticking with this. |
pymc3/distributions/dist_math.py
Outdated
# will all be NaN if the Cholesky was no-go, which is fine | ||
diag = tt.ExtractDiag(view=True)(cov) | ||
else: | ||
diag = tt.ExtractDiag(view=True)(cov) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point to use view
shape = infer_shape(Xnew, kwargs.pop("shape", None)) | ||
return pm.MvNormal(name, mu=mu, chol=chol, shape=shape, **kwargs) | ||
return pm.MvNormal(name, mu=mu, cov=cov, shape=shape, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter @bwengals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it does, but why move Cholesky to MvNormal? i feel it does make the implementation a tiny(!) bit less clear, from the standpoint of someone reading the code. Is there a benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the graphOp implementation -- which is not implemented here after all, but could/might be at some point when and if it proves satisfactory -- the benefit was that this operation happened in the 'subgraph', thus decluttering the total graph, speeding up compilation and freeing up a tiny bit of memory in the process.
From a style point of view, explicitly calling the Cholesky decomposition in the source each time (a good 8 or so times?) might lend the wrong impression that this is the preferred/optimal way to configure a Mv. In truth, Quadbase takes care of the decomposition when given a cov, so either way of configuring it is equally fine. I would find it more coherent to just pass whatever you're working with, as a best practice, since it's shorter and more intuitive.
This is entirely personal though, and has neither computational cost nor benefit. There's also the argument that stabilize
only makes sense in the context of Cholesky decomposition, and that as it stands here, that context is lost. So I don't mind reverting to the way it was in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, it is fine with me either way. Just wanted to know what the reasoning was. Thanks!
need rebase |
Indeed it does. Also, embarrassingly, I had neither tested nor considered multi-chain, and the function constructors don't pickle. So I've started another approach, which I hope to PR soon. |
@ferrine I also wanted your take on wether you saw any cost to stabilizing the cov matrix in FullRankGroup, as it appears it can't always be decomposed, thus requiring otherwise unnecessary checks. i.e. IIRC at initialisation it starts off with a zero matrix that thus can't be decomposed. |
Hope I've fixed random Travis fails on master branch in a recent PR |
I saw it, good point |
@gBokiau What's the status of this? Would be great to merge it. Needs a rebase. |
Apologies for the delay. I am yet to find a simple alternative to circumvent the pickling/parallelisation issue caused by the function constructors I used here, while keeping things tidy. I might be overlooking a really obvious hack though. Also, I'm growing ever more convinced that there would be substantial value to a more thorough rewrite still of Mv's in general. My general idea is that the 'kernels' of Mv functions (LKJ etc) should compute the Mahalanobis distances themselves, instead of doing this in the To give an example, if using To illustrate, the goal would be to be able to do this :
Ideally the dimensions of the kernel would be inferred automatically. Perhaps even this:
would be more 'natural' than declaring regular More to the point:
…would be a new way to declare In short, in terms of architecture, I think Yet the implications are rather considerable. I'm struggling to find a promising a starting point to get a proof of concept going. |
Did you mean to close this @gBokiau ? |
Oops, that was accidental — but perhaps appropriate? |
I think it is a bit of a waste of all the hard work. However, I am not convinced that the distance should be computed in the kernel instead of the parent Mv Distribution. I agree that decomposition should be done only once, and a bit surprised that it is currently computed twice (not the case if you pass |
Correct, only with |
I see... hmmm this is a good point you raise, so what you are doing now is trying to combine the P.D. checking with logp computation by doing the decomposition midway? And you think it is better to do it when the cov is being declared? And I imagine that's what |
In that particular case, I think I would move towards having LKJCorr always save results Cholesky decomposed with unit diagonal (ie transformed), since they're being computed anyway and are not as costly to reverse-transform. However it's not quite clear to me when/how one would use LKJCorr without a unit diagonal, and if/how to accommodate for that use case. The more general case for having the kernels compute the distance is basically twofold.
Inheritance would be a more elegant approach than intricate switch loops, as both here and in master. Second, a cov/corr kernel is decomposed as many times as it is used. While definitely a fringe case, one might model a mixture between multiple MvN with a shared covariance matrix but different means. Certainly in the case of GP's it seems plausible that kernels would be reused in different RV's. Thankfully, I do think 'fixed' correlation matrices (ie np.array's) are only decomposed during model computation, yet that would be worth double checking as well. |
@junpenglao not sure by what you mean by midway — probably because I haven't yet figured it all out myself. Another way I look at it is that the GP kernels would :
I'm the first to admit the implications are intricate. Additions and multiplications could be handled as they are currently for GP kernels. Some (most?) of these ops could be implemented directly on the Cholesky factors. There's also the matter of wether it would be worth implementing this on Whishart etc., or rather to drop those. |
We can probably drop those, I don't think anyone uses those anymore (and they shouldn't). |
In the same vein, my impression is that I seem to recall mentions of computing GP kernels directly in "Cholesky space". Has anyone come across that? It's a tough one to find the right keywords to search for. |
I don't think the current parametrization of |
@gBokiau How do you feel about this PR? I think it is still worth to pursuit it - maybe not over complicate it and clean it up so that the current test pass? |
@gBokiau I'm going to close this as I think your other approach with making smaller PRs is the way to go here (let me know if you would like to continue here, however, and we can certainly reopen it). |
Thanks @twiecki, sounds good. I intend to see wether the logp approach fares better under @aseyboldt 's new parallelism approach shortly. |
This PR refactor Normal and MvNormal: