Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Normal and MvNormal #2847

Closed
wants to merge 68 commits into from
Closed

Conversation

gBokiau
Copy link
Contributor

@gBokiau gBokiau commented Feb 9, 2018

This PR refactor Normal and MvNormal:

  • clean up unused import and outdated comments
  • switching to using functions in theano 1.0 release
  • use (optimized) logp_sum where it makes sense

@ferrine
Copy link
Member

ferrine commented Feb 9, 2018

Thanks for the PR, this is a random fail on Travis, nevermind. Do you see any more code style nitpicks?

@gBokiau
Copy link
Contributor Author

gBokiau commented Feb 9, 2018

Just PR'ing as I come across them. If there's a preference for less commits I'm happy to keep this open for a while in the eventuality of more crossing my path.

I'm toying with the idea of harmonising all the covariance/precision/cholesky parameters declaration into classes of their own to unclutter model declaration and remove the code duplication. I'll PR separately if that reaches anywhere, however there's a good chance I bump into other tidbits such as the above while doing so.

try:
self.chol_cov = cholesky(cov)
except ValueError:
raise ValueError('cov must be two dimensional.') from None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing error in py2.7

dist, logdet, ok = self._quaddist_tau(delta)
else:
# Use this when Theano#5908 is released.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure theano#5908 is released. Is there way to incorporate this comment? cc @aseyboldt

@gBokiau gBokiau changed the title Remove declared but unused variables Clean up Normals Feb 10, 2018
@junpenglao junpenglao mentioned this pull request Feb 10, 2018
4 tasks
@junpenglao junpenglao added the WIP label Feb 10, 2018
@gBokiau
Copy link
Contributor Author

gBokiau commented Feb 10, 2018

Ah, I just realize I have gotten it wrong, I thought MvNormalLogp was somewhere in Theano. Will have another go.
I also think there's a couple places we could leverage Theano Views and inplace operations that are disabled by default in their convenience classes (e.g. in tt.diagonal). I have a suspicion a lot of solves could be done inplace.

@junpenglao junpenglao changed the title Clean up Normals Refactor Normal and MvNormal Feb 10, 2018
@junpenglao
Copy link
Member

Thanks for the work - I edit the PR title and comment.

@gBokiau
Copy link
Contributor Author

gBokiau commented Feb 10, 2018

One question about the Cholesky/logp implementation. If my understanding is correct, the decomposition of any positive definite (PD) matrix would have positive diagonals. Theano/Scipy returns a matrix of NaNs if the covariance matrix is not PD. This is different from the current implementation -- ie it returns an identity with only [0,0] set to NaN -- but I think the difference is trivial.

Hence checking the diagonal has no purpose and logp should return a matrix of -Inf if the Cholesky fails. Is that correct?


Edit : Having now spent a few hours with that bit of code, it does look like it was (somewhat) misconceived in the sense that it applied logic pertaining to chol parameters whereas it was designed to handle cov. I've tried correcting that and implementing both parameters. I'm now out to test if it works as expected (commits below if others want to have a go/look).

@aseyboldt
Copy link
Member

aseyboldt commented Feb 13, 2018

As this was turning in a bit of a monster

Yes, the multivariate stuff does that. It looks simple at first, but turns into a never-ending rabbit hole. :-)

I've given up on using OpFromGraph from now

I agree. I think it is probably worth revisiting this, I remember quite significant speed ups in some cases. But this PR is perfectly fine without it.

A rebase or new PR (whatever you prefer) is probably a good idea, that would make things a bit more manageable.

Thanks for sticking with this.

# will all be NaN if the Cholesky was no-go, which is fine
diag = tt.ExtractDiag(view=True)(cov)
else:
diag = tt.ExtractDiag(view=True)(cov)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point to use view

shape = infer_shape(Xnew, kwargs.pop("shape", None))
return pm.MvNormal(name, mu=mu, chol=chol, shape=shape, **kwargs)
return pm.MvNormal(name, mu=mu, cov=cov, shape=shape, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter @bwengals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it does, but why move Cholesky to MvNormal? i feel it does make the implementation a tiny(!) bit less clear, from the standpoint of someone reading the code. Is there a benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the graphOp implementation -- which is not implemented here after all, but could/might be at some point when and if it proves satisfactory -- the benefit was that this operation happened in the 'subgraph', thus decluttering the total graph, speeding up compilation and freeing up a tiny bit of memory in the process.

From a style point of view, explicitly calling the Cholesky decomposition in the source each time (a good 8 or so times?) might lend the wrong impression that this is the preferred/optimal way to configure a Mv. In truth, Quadbase takes care of the decomposition when given a cov, so either way of configuring it is equally fine. I would find it more coherent to just pass whatever you're working with, as a best practice, since it's shorter and more intuitive.

This is entirely personal though, and has neither computational cost nor benefit. There's also the argument that stabilize only makes sense in the context of Cholesky decomposition, and that as it stands here, that context is lost. So I don't mind reverting to the way it was in that regard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, it is fine with me either way. Just wanted to know what the reasoning was. Thanks!

@ferrine
Copy link
Member

ferrine commented Feb 26, 2018

need rebase

@gBokiau
Copy link
Contributor Author

gBokiau commented Feb 26, 2018

Indeed it does. Also, embarrassingly, I had neither tested nor considered multi-chain, and the function constructors don't pickle. So I've started another approach, which I hope to PR soon.

@gBokiau
Copy link
Contributor Author

gBokiau commented Feb 26, 2018

@ferrine I also wanted your take on wether you saw any cost to stabilizing the cov matrix in FullRankGroup, as it appears it can't always be decomposed, thus requiring otherwise unnecessary checks. i.e. IIRC at initialisation it starts off with a zero matrix that thus can't be decomposed.

@ferrine
Copy link
Member

ferrine commented Feb 26, 2018

Hope I've fixed random Travis fails on master branch in a recent PR

@ferrine
Copy link
Member

ferrine commented Feb 26, 2018

I saw it, good point

@twiecki
Copy link
Member

twiecki commented Mar 20, 2018

@gBokiau What's the status of this? Would be great to merge it. Needs a rebase.

@gBokiau
Copy link
Contributor Author

gBokiau commented Mar 30, 2018

Apologies for the delay. I am yet to find a simple alternative to circumvent the pickling/parallelisation issue caused by the function constructors I used here, while keeping things tidy. I might be overlooking a really obvious hack though.

Also, I'm growing ever more convinced that there would be substantial value to a more thorough rewrite still of Mv's in general. My general idea is that the 'kernels' of Mv functions (LKJ etc) should compute the Mahalanobis distances themselves, instead of doing this in the logp of the parent distribution (as is now the case), or through intermediaries (as in this PR).

To give an example, if using LKJCorr in a MvNormal, the procedure to add the unit diagonal and compute the Cholesky factor is done twice: once here https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/multivariate.py#L1091, and once respectively in passing it as a parameter and subsequently in MvNormal.logp. Considering the cost of decomposing the matrix, this is hardly ideal.

To illustrate, the goal would be to be able to do this :

kernel = LKJCorr('kernel', eta=1)
observations = MvNormal('name', kernel=kernel, observed=data)

Ideally the dimensions of the kernel would be inferred automatically.

Perhaps even this:

spherical_kernel = SphericalKernel(a=2)
observations = MvNormal('name', kernel=spherical_kernel, observed=data)

would be more 'natural' than declaring regular Normal(sd=2) when you essentially mean an uncorrelated bivariate normal (but this is debatable).

More to the point:

precision_kernel = PrecsionKernel(np.array([[1., 2.],[2.,1.]]))
observations = MvNormal('name', kernel=precision_kernel, observed=data)

…would be a new way to declare tau, and would be what happens behind the scene if that is how it is declared.

In short, in terms of architecture, I think quaddist should move from the distributions to the kernels. They would be a family of classes of their own, which can but are not required to be RVs.

Yet the implications are rather considerable. I'm struggling to find a promising a starting point to get a proof of concept going.

@gBokiau gBokiau closed this Mar 30, 2018
@twiecki
Copy link
Member

twiecki commented Mar 30, 2018

Did you mean to close this @gBokiau ?

@gBokiau
Copy link
Contributor Author

gBokiau commented Mar 30, 2018

Oops, that was accidental — but perhaps appropriate?

@junpenglao junpenglao reopened this Mar 30, 2018
@junpenglao
Copy link
Member

I think it is a bit of a waste of all the hard work. However, I am not convinced that the distance should be computed in the kernel instead of the parent Mv Distribution. I agree that decomposition should be done only once, and a bit surprised that it is currently computed twice (not the case if you pass chol to MvNormal right?)

@gBokiau
Copy link
Contributor Author

gBokiau commented Mar 30, 2018

not the case if you pass chol to MvNormal right?

Correct, only with cov=packed(LKJCorr)+np.eye(k). This is because LKJCorr pseudo-computes the cholesky to check if the sampled values are P.D., when computing logp. The result should instead obviously be saved and reused by MvN/MvT.

@junpenglao
Copy link
Member

junpenglao commented Mar 30, 2018

I see... hmmm this is a good point you raise, so what you are doing now is trying to combine the P.D. checking with logp computation by doing the decomposition midway? And you think it is better to do it when the cov is being declared?

And I imagine that's what SphericalKernel etc are for? A class that make sure P.D. and also do the necessary decomposition. But what about if you are combining different kernels such as in gp?

@gBokiau
Copy link
Contributor Author

gBokiau commented Mar 30, 2018

In that particular case, I think I would move towards having LKJCorr always save results Cholesky decomposed with unit diagonal (ie transformed), since they're being computed anyway and are not as costly to reverse-transform. However it's not quite clear to me when/how one would use LKJCorr without a unit diagonal, and if/how to accommodate for that use case.

The more general case for having the kernels compute the distance is basically twofold.
First, we need three approaches:

  • precision-based: distance obtained by simple dot product
  • cholesky factor-based: distance obtained by solving triangular
  • cov/corr-based: need to first check if P.D., then cholesky decompose, than as above.

Inheritance would be a more elegant approach than intricate switch loops, as both here and in master.

Second, a cov/corr kernel is decomposed as many times as it is used. While definitely a fringe case, one might model a mixture between multiple MvN with a shared covariance matrix but different means. Certainly in the case of GP's it seems plausible that kernels would be reused in different RV's.

Thankfully, I do think 'fixed' correlation matrices (ie np.array's) are only decomposed during model computation, yet that would be worth double checking as well.

@gBokiau
Copy link
Contributor Author

gBokiau commented Mar 30, 2018

@junpenglao not sure by what you mean by midway — probably because I haven't yet figured it all out myself.

Another way I look at it is that the GP kernels would :

  1. be computed 'later' (not in model declaration), presumably by the distribution, trough draw_values() calls
  2. once drawn/sampled, return 'augmented' variables, so as to have transform, k, and quaddist methods, the two former so that the distribution can generate random samples, the latter so that it can compute logp. Plus the properties they already have, ie diagonal and full.
  3. under the hood, either be cholesky or tau matrices, or implement/inherit methods to deliver those, which would be performed only once.

I'm the first to admit the implications are intricate. Additions and multiplications could be handled as they are currently for GP kernels. Some (most?) of these ops could be implemented directly on the Cholesky factors.

There's also the matter of wether it would be worth implementing this on Whishart etc., or rather to drop those.

@twiecki
Copy link
Member

twiecki commented Mar 30, 2018

There's also the matter of wether it would be worth implementing this on Whishart etc., or rather to drop those.

We can probably drop those, I don't think anyone uses those anymore (and they shouldn't).

@gBokiau
Copy link
Contributor Author

gBokiau commented Apr 6, 2018

In the same vein, my impression is that LKJCorr() offers no advantage over LKJCholeskyCov(sd=1). Discarding that as well would considerably alleviate the specs and solve the problem above.

I seem to recall mentions of computing GP kernels directly in "Cholesky space". Has anyone come across that? It's a tough one to find the right keywords to search for.

@aseyboldt
Copy link
Member

I don't think the current parametrization of LKJCorr is great, but it can't be replaced that easily with LKJCholeskyCov, because that only works if the variances are free random variables. So LKJCholeskyCov(sd=1) doesn't actually work. This is because we have one free variable per non-zero entry in the chol matrix, but no free variable for the variance (which is just the norm of one row of the chol matrix). If you want a LKJ Correlation (rather than covariance) matrix, you could right now just use an arbitrary dist for the sd, and then rescale the covariance matrix to a correlation matrix and use that. Using a stick breaking approach we could also parameterize the correlation matrix directly, but that isn't implemented.
Or maybe skip the whole cholesky decomposition and go for an eigenvalue decomposition instead. There, we could use gibs rotations or householder transforms of skew symmetric matrices to parameterize the orthogonal matrix.

@junpenglao
Copy link
Member

@gBokiau How do you feel about this PR? I think it is still worth to pursuit it - maybe not over complicate it and clean it up so that the current test pass?

@gBokiau gBokiau mentioned this pull request Jun 5, 2018
@twiecki
Copy link
Member

twiecki commented Jun 18, 2018

@gBokiau I'm going to close this as I think your other approach with making smaller PRs is the way to go here (let me know if you would like to continue here, however, and we can certainly reopen it).

@twiecki twiecki closed this Jun 18, 2018
@gBokiau
Copy link
Contributor Author

gBokiau commented Jun 18, 2018

Thanks @twiecki, sounds good. I intend to see wether the logp approach fares better under @aseyboldt 's new parallelism approach shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants