Refactor Normal and MvNormal #2847

gBokiau · 2018-02-09T20:55:14Z

This PR refactor Normal and MvNormal:

clean up unused import and outdated comments
switching to using functions in theano 1.0 release
use (optimized) logp_sum where it makes sense

ferrine · 2018-02-09T21:49:35Z

Thanks for the PR, this is a random fail on Travis, nevermind. Do you see any more code style nitpicks?

gBokiau · 2018-02-09T22:45:02Z

Just PR'ing as I come across them. If there's a preference for less commits I'm happy to keep this open for a while in the eventuality of more crossing my path.

I'm toying with the idea of harmonising all the covariance/precision/cholesky parameters declaration into classes of their own to unclutter model declaration and remove the code duplication. I'll PR separately if that reaches anywhere, however there's a good chance I bump into other tidbits such as the above while doing so.

junpenglao · 2018-02-10T08:03:22Z

pymc3/distributions/multivariate.py

+            try:
+                self.chol_cov = cholesky(cov)
+            except ValueError:
+                raise ValueError('cov must be two dimensional.') from None


This is causing error in py2.7

junpenglao · 2018-02-10T08:04:22Z

pymc3/distributions/multivariate.py

            dist, logdet, ok = self._quaddist_tau(delta)
        else:
+            # Use this when Theano#5908 is released.


Pretty sure theano#5908 is released. Is there way to incorporate this comment? cc @aseyboldt

gBokiau · 2018-02-10T12:24:33Z

Ah, I just realize I have gotten it wrong, I thought MvNormalLogp was somewhere in Theano. Will have another go.
I also think there's a couple places we could leverage Theano Views and inplace operations that are disabled by default in their convenience classes (e.g. in tt.diagonal). I have a suspicion a lot of solves could be done inplace.

junpenglao · 2018-02-10T12:29:17Z

Thanks for the work - I edit the PR title and comment.

gBokiau · 2018-02-10T14:26:23Z

One question about the Cholesky/logp implementation. If my understanding is correct, the decomposition of any positive definite (PD) matrix would have positive diagonals. Theano/Scipy returns a matrix of NaNs if the covariance matrix is not PD. This is different from the current implementation -- ie it returns an identity with only [0,0] set to NaN -- but I think the difference is trivial.

Hence checking the diagonal has no purpose and logp should return a matrix of -Inf if the Cholesky fails. Is that correct?

Edit : Having now spent a few hours with that bit of code, it does look like it was (somewhat) misconceived in the sense that it applied logic pertaining to chol parameters whereas it was designed to handle cov. I've tried correcting that and implementing both parameters. I'm now out to test if it works as expected (commits below if others want to have a go/look).

(not aiming for elegance at this point). TODO: same for tau-parameters. TBC: write gradient of MvStudentT.logp, to mirror this implementation.

Fullrank: Stab in the dark here - to check

aseyboldt · 2018-02-13T20:00:14Z

As this was turning in a bit of a monster

Yes, the multivariate stuff does that. It looks simple at first, but turns into a never-ending rabbit hole. :-)

I've given up on using OpFromGraph from now

I agree. I think it is probably worth revisiting this, I remember quite significant speed ups in some cases. But this PR is perfectly fine without it.

A rebase or new PR (whatever you prefer) is probably a good idea, that would make things a bit more manageable.

Thanks for sticking with this.

ferrine · 2018-02-11T06:22:29Z

pymc3/distributions/dist_math.py

+        # will all be NaN if the Cholesky was no-go, which is fine
+        diag = tt.ExtractDiag(view=True)(cov)
+    else:
+        diag = tt.ExtractDiag(view=True)(cov)


Good point to use view

ferrine · 2018-02-26T14:40:08Z

pymc3/gp/gp.py

        shape = infer_shape(Xnew, kwargs.pop("shape", None))
-        return pm.MvNormal(name, mu=mu, chol=chol, shape=shape, **kwargs)
+        return pm.MvNormal(name, mu=mu, cov=cov, shape=shape, **kwargs)


Does it matter @bwengals?

I don't think it does, but why move Cholesky to MvNormal? i feel it does make the implementation a tiny(!) bit less clear, from the standpoint of someone reading the code. Is there a benefit?

With the graphOp implementation -- which is not implemented here after all, but could/might be at some point when and if it proves satisfactory -- the benefit was that this operation happened in the 'subgraph', thus decluttering the total graph, speeding up compilation and freeing up a tiny bit of memory in the process.

From a style point of view, explicitly calling the Cholesky decomposition in the source each time (a good 8 or so times?) might lend the wrong impression that this is the preferred/optimal way to configure a Mv. In truth, Quadbase takes care of the decomposition when given a cov, so either way of configuring it is equally fine. I would find it more coherent to just pass whatever you're working with, as a best practice, since it's shorter and more intuitive.

This is entirely personal though, and has neither computational cost nor benefit. There's also the argument that stabilize only makes sense in the context of Cholesky decomposition, and that as it stands here, that context is lost. So I don't mind reverting to the way it was in that regard.

Ah ok, it is fine with me either way. Just wanted to know what the reasoning was. Thanks!

ferrine · 2018-02-26T14:42:20Z

need rebase

gBokiau · 2018-02-26T21:49:25Z

Indeed it does. Also, embarrassingly, I had neither tested nor considered multi-chain, and the function constructors don't pickle. So I've started another approach, which I hope to PR soon.

gBokiau · 2018-02-26T22:00:11Z

@ferrine I also wanted your take on wether you saw any cost to stabilizing the cov matrix in FullRankGroup, as it appears it can't always be decomposed, thus requiring otherwise unnecessary checks. i.e. IIRC at initialisation it starts off with a zero matrix that thus can't be decomposed.

ferrine · 2018-02-26T22:00:13Z

Hope I've fixed random Travis fails on master branch in a recent PR

ferrine · 2018-02-26T22:01:22Z

I saw it, good point

twiecki · 2018-03-20T13:40:35Z

@gBokiau What's the status of this? Would be great to merge it. Needs a rebase.

gBokiau · 2018-03-30T13:23:01Z

Apologies for the delay. I am yet to find a simple alternative to circumvent the pickling/parallelisation issue caused by the function constructors I used here, while keeping things tidy. I might be overlooking a really obvious hack though.

Also, I'm growing ever more convinced that there would be substantial value to a more thorough rewrite still of Mv's in general. My general idea is that the 'kernels' of Mv functions (LKJ etc) should compute the Mahalanobis distances themselves, instead of doing this in the logp of the parent distribution (as is now the case), or through intermediaries (as in this PR).

To give an example, if using LKJCorr in a MvNormal, the procedure to add the unit diagonal and compute the Cholesky factor is done twice: once here https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/multivariate.py#L1091, and once respectively in passing it as a parameter and subsequently in MvNormal.logp. Considering the cost of decomposing the matrix, this is hardly ideal.

To illustrate, the goal would be to be able to do this :

kernel = LKJCorr('kernel', eta=1)
observations = MvNormal('name', kernel=kernel, observed=data)

Ideally the dimensions of the kernel would be inferred automatically.

Perhaps even this:

spherical_kernel = SphericalKernel(a=2)
observations = MvNormal('name', kernel=spherical_kernel, observed=data)

would be more 'natural' than declaring regular Normal(sd=2) when you essentially mean an uncorrelated bivariate normal (but this is debatable).

More to the point:

precision_kernel = PrecsionKernel(np.array([[1., 2.],[2.,1.]]))
observations = MvNormal('name', kernel=precision_kernel, observed=data)

…would be a new way to declare tau, and would be what happens behind the scene if that is how it is declared.

In short, in terms of architecture, I think quaddist should move from the distributions to the kernels. They would be a family of classes of their own, which can but are not required to be RVs.

Yet the implications are rather considerable. I'm struggling to find a promising a starting point to get a proof of concept going.

twiecki · 2018-03-30T13:33:14Z

Did you mean to close this @gBokiau ?

gBokiau · 2018-03-30T13:57:20Z

Oops, that was accidental — but perhaps appropriate?

junpenglao · 2018-03-30T14:02:16Z

I think it is a bit of a waste of all the hard work. However, I am not convinced that the distance should be computed in the kernel instead of the parent Mv Distribution. I agree that decomposition should be done only once, and a bit surprised that it is currently computed twice (not the case if you pass chol to MvNormal right?)

gBokiau · 2018-03-30T14:11:12Z

not the case if you pass chol to MvNormal right?

Correct, only with cov=packed(LKJCorr)+np.eye(k). This is because LKJCorr pseudo-computes the cholesky to check if the sampled values are P.D., when computing logp. The result should instead obviously be saved and reused by MvN/MvT.

junpenglao · 2018-03-30T14:20:22Z

I see... hmmm this is a good point you raise, so what you are doing now is trying to combine the P.D. checking with logp computation by doing the decomposition midway? And you think it is better to do it when the cov is being declared?

And I imagine that's what SphericalKernel etc are for? A class that make sure P.D. and also do the necessary decomposition. But what about if you are combining different kernels such as in gp?

gBokiau · 2018-03-30T14:42:37Z

In that particular case, I think I would move towards having LKJCorr always save results Cholesky decomposed with unit diagonal (ie transformed), since they're being computed anyway and are not as costly to reverse-transform. However it's not quite clear to me when/how one would use LKJCorr without a unit diagonal, and if/how to accommodate for that use case.

The more general case for having the kernels compute the distance is basically twofold.
First, we need three approaches:

precision-based: distance obtained by simple dot product
cholesky factor-based: distance obtained by solving triangular
cov/corr-based: need to first check if P.D., then cholesky decompose, than as above.

Inheritance would be a more elegant approach than intricate switch loops, as both here and in master.

Second, a cov/corr kernel is decomposed as many times as it is used. While definitely a fringe case, one might model a mixture between multiple MvN with a shared covariance matrix but different means. Certainly in the case of GP's it seems plausible that kernels would be reused in different RV's.

Thankfully, I do think 'fixed' correlation matrices (ie np.array's) are only decomposed during model computation, yet that would be worth double checking as well.

gBokiau · 2018-03-30T19:37:44Z

@junpenglao not sure by what you mean by midway — probably because I haven't yet figured it all out myself.

Another way I look at it is that the GP kernels would :

be computed 'later' (not in model declaration), presumably by the distribution, trough draw_values() calls
once drawn/sampled, return 'augmented' variables, so as to have transform, k, and quaddist methods, the two former so that the distribution can generate random samples, the latter so that it can compute logp. Plus the properties they already have, ie diagonal and full.
under the hood, either be cholesky or tau matrices, or implement/inherit methods to deliver those, which would be performed only once.

I'm the first to admit the implications are intricate. Additions and multiplications could be handled as they are currently for GP kernels. Some (most?) of these ops could be implemented directly on the Cholesky factors.

There's also the matter of wether it would be worth implementing this on Whishart etc., or rather to drop those.

twiecki · 2018-03-30T20:49:45Z

There's also the matter of wether it would be worth implementing this on Whishart etc., or rather to drop those.

We can probably drop those, I don't think anyone uses those anymore (and they shouldn't).

gBokiau · 2018-04-06T15:19:54Z

In the same vein, my impression is that LKJCorr() offers no advantage over LKJCholeskyCov(sd=1). Discarding that as well would considerably alleviate the specs and solve the problem above.

I seem to recall mentions of computing GP kernels directly in "Cholesky space". Has anyone come across that? It's a tough one to find the right keywords to search for.

aseyboldt · 2018-04-08T12:14:52Z

I don't think the current parametrization of LKJCorr is great, but it can't be replaced that easily with LKJCholeskyCov, because that only works if the variances are free random variables. So LKJCholeskyCov(sd=1) doesn't actually work. This is because we have one free variable per non-zero entry in the chol matrix, but no free variable for the variance (which is just the norm of one row of the chol matrix). If you want a LKJ Correlation (rather than covariance) matrix, you could right now just use an arbitrary dist for the sd, and then rescale the covariance matrix to a correlation matrix and use that. Using a stick breaking approach we could also parameterize the correlation matrix directly, but that isn't implemented.
Or maybe skip the whole cholesky decomposition and go for an eigenvalue decomposition instead. There, we could use gibs rotations or householder transforms of skew symmetric matrices to parameterize the orthogonal matrix.

junpenglao · 2018-06-05T18:16:59Z

@gBokiau How do you feel about this PR? I think it is still worth to pursuit it - maybe not over complicate it and clean it up so that the current test pass?

twiecki · 2018-06-18T07:57:16Z

@gBokiau I'm going to close this as I think your other approach with making smaller PRs is the way to go here (let me know if you would like to continue here, however, and we can certainly reopen it).

gBokiau · 2018-06-18T16:47:11Z

Thanks @twiecki, sounds good. I intend to see wether the logp approach fares better under @aseyboldt 's new parallelism approach shortly.

Remove unused variables

87c4795

backyard cleaning

380253e

junpenglao reviewed Feb 10, 2018

View reviewed changes

WIP - switch to Theano implementations

0dece51

gBokiau changed the title ~~Remove declared but unused variables~~ Clean up Normals Feb 10, 2018

Harmonize how covariance matrices are initialized

9d5c4e4

junpenglao mentioned this pull request Feb 10, 2018

Check list for Theano 1.0.0 #2723

Closed

4 tasks

junpenglao added the WIP label Feb 10, 2018

junpenglao changed the title ~~Clean up Normals~~ Refactor Normal and MvNormal Feb 10, 2018

junpenglao added the request discussion label Feb 10, 2018

gBokiau added 15 commits February 10, 2018 20:25

Using OpFromGraph for MvNormal logp's

a7e182e

(not aiming for elegance at this point). TODO: same for tau-parameters. TBC: write gradient of MvStudentT.logp, to mirror this implementation.

fix imports

60493c9

delay floatX(k)

edf0ae8

Fix ifelse statements

4ca2caa

logp returns a vector, not a matrix

a28d3e1

Fix float mismatches

ead0d1f

TODO check why logp doesn't always return a vector

9f9da75

amend tests and fix typos

41fc49a

Hopefully solve float errors

8e8f84f

inelegant solution for mvt for the time being

8b2b217

minor fixes

23de07e

fix typo in test, hopefully final shot at fixing float32-mode errors

8be16c5

GP: Delegate cholesky to MvNormal

e3dbb16

Fullrank: Stab in the dark here - to check

harmonize floatX in mv

b4effab

more float fixes

d26d096

This was referenced Feb 14, 2018

Kronecker GP #2731

Merged

Error When Evaluating Log Likelihood for MvGaussianRandomWalk #2858

Closed

Log Likelihood of MvGaussianRandomWalk Returns Array Rather Than Scalar #2859

Closed

ferrine self-assigned this Feb 26, 2018

ferrine reviewed Feb 26, 2018

View reviewed changes

gBokiau closed this Mar 30, 2018

junpenglao reopened this Mar 30, 2018

gBokiau mentioned this pull request Jun 5, 2018

Mv cleanup #3008

Merged

twiecki closed this Jun 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Normal and MvNormal #2847

Refactor Normal and MvNormal #2847

gBokiau commented Feb 9, 2018 •

edited

Loading

ferrine commented Feb 9, 2018

gBokiau commented Feb 9, 2018

junpenglao Feb 10, 2018

junpenglao Feb 10, 2018

gBokiau commented Feb 10, 2018 •

edited

Loading

junpenglao commented Feb 10, 2018

gBokiau commented Feb 10, 2018 •

edited

Loading

aseyboldt commented Feb 13, 2018 •

edited

Loading

ferrine Feb 11, 2018

ferrine Feb 26, 2018

bwengals Feb 26, 2018

gBokiau Feb 26, 2018

bwengals Feb 28, 2018

ferrine commented Feb 26, 2018

gBokiau commented Feb 26, 2018

gBokiau commented Feb 26, 2018

ferrine commented Feb 26, 2018

ferrine commented Feb 26, 2018

twiecki commented Mar 20, 2018

gBokiau commented Mar 30, 2018 •

edited

Loading

twiecki commented Mar 30, 2018

gBokiau commented Mar 30, 2018

junpenglao commented Mar 30, 2018

gBokiau commented Mar 30, 2018

junpenglao commented Mar 30, 2018 •

edited

Loading

gBokiau commented Mar 30, 2018 •

edited

Loading

gBokiau commented Mar 30, 2018

twiecki commented Mar 30, 2018

gBokiau commented Apr 6, 2018

aseyboldt commented Apr 8, 2018

junpenglao commented Jun 5, 2018

twiecki commented Jun 18, 2018

gBokiau commented Jun 18, 2018 •

edited

Loading

Refactor Normal and MvNormal #2847

Refactor Normal and MvNormal #2847

Conversation

gBokiau commented Feb 9, 2018 • edited Loading

ferrine commented Feb 9, 2018

gBokiau commented Feb 9, 2018

junpenglao Feb 10, 2018

Choose a reason for hiding this comment

junpenglao Feb 10, 2018

Choose a reason for hiding this comment

gBokiau commented Feb 10, 2018 • edited Loading

junpenglao commented Feb 10, 2018

gBokiau commented Feb 10, 2018 • edited Loading

aseyboldt commented Feb 13, 2018 • edited Loading

ferrine Feb 11, 2018

Choose a reason for hiding this comment

ferrine Feb 26, 2018

Choose a reason for hiding this comment

bwengals Feb 26, 2018

Choose a reason for hiding this comment

gBokiau Feb 26, 2018

Choose a reason for hiding this comment

bwengals Feb 28, 2018

Choose a reason for hiding this comment

ferrine commented Feb 26, 2018

gBokiau commented Feb 26, 2018

gBokiau commented Feb 26, 2018

ferrine commented Feb 26, 2018

ferrine commented Feb 26, 2018

twiecki commented Mar 20, 2018

gBokiau commented Mar 30, 2018 • edited Loading

twiecki commented Mar 30, 2018

gBokiau commented Mar 30, 2018

junpenglao commented Mar 30, 2018

gBokiau commented Mar 30, 2018

junpenglao commented Mar 30, 2018 • edited Loading

gBokiau commented Mar 30, 2018 • edited Loading

gBokiau commented Mar 30, 2018

twiecki commented Mar 30, 2018

gBokiau commented Apr 6, 2018

aseyboldt commented Apr 8, 2018

junpenglao commented Jun 5, 2018

twiecki commented Jun 18, 2018

gBokiau commented Jun 18, 2018 • edited Loading

gBokiau commented Feb 9, 2018 •

edited

Loading

gBokiau commented Feb 10, 2018 •

edited

Loading

gBokiau commented Feb 10, 2018 •

edited

Loading

aseyboldt commented Feb 13, 2018 •

edited

Loading

gBokiau commented Mar 30, 2018 •

edited

Loading

junpenglao commented Mar 30, 2018 •

edited

Loading

gBokiau commented Mar 30, 2018 •

edited

Loading

gBokiau commented Jun 18, 2018 •

edited

Loading