Marked correlation function error estimation #471

aphearin · 2016-04-18T19:57:42Z

The most common way to estimate errors in a MCF analysis is to repeatedly count marked pairs with scrambled marks. The current API makes this process very slow, because with each call to the MCF, pairs of points need to be found all over again. There should be a way to open each pair of cells only once. The naive way of doing this would be to pass in a multi-dimensional array of shape (num_pts, num_realizations), and simply loop over num_realizations for each pair of cells. However, memory constraints make this infeasible for reasonable sample sizes and numbers of realizations, so we will need to be clever.

@duncandc - now that I have finished giving the underlying mock_observables engine a rewrite, I think now is the time to implement this, so let's hash this out when we sit down later this week to port the new engine to the rest of the mock_observables functions.

CC @vcalderon2009 and @andrew-zentner , who have both requested this feature.

The text was updated successfully, but these errors were encountered:

manodeep · 2017-08-23T07:51:43Z

Count me in as another interested user :)

aphearin · 2017-08-23T13:24:35Z

@manodeep - are the memory constraints I mentioned here actually a big problem? What do you think? The simplest solution is still to use a multi-dim array, where there is an extra dimension for each shuffling of the marks. For 10^6 points, and 10^3 marks, this doesn't seem out of the question, although it certainly is not a scalable solution. What do you think?

What makes this a hard problem is the fact that the error estimation is usually done by shuffling. If indices could be selected at random for each pair of points, the problem becomes very simple. But this solution would permit the same mark to be selected multiple times, which would not capture some aspects of the sample variance. This is probably important, but I have never tested the consequences.

@duncandc, @andrew-zentner @surhudm - any bright ideas?

aphearin · 2017-08-23T13:41:36Z

An alternative option would be to precompute all pairs once and store all pairs of indices and distances in a sparse matrix. Then, after the sparse matrices are built, in a separate subsequent step, initiate a new loop over the number of shuffles. The first operation in each step of that subsequent loop is a fresh shuffle of the array of marks. Then the code would loop over the sparse matrix of pairs.

That would be a tolerable amount of work, but I'm still not confident it would resolve the memory problems. It would certainly be a lot faster than repeatedly counting pairs, which is what the current code requires. So whether or not this is worth trying is somewhat dependent on how important this problem is. Does anyone have comments on whether the current method unworkably slow?

andrew-zentner · 2017-08-23T14:39:40Z

@aphearin - Is memory really an issue? My antiquated fortran code (which Antonio used), works more or less as you describe above. Suppose I have 10^6 galaxies. Suppose I want 10^3 shuffled samples (which is probably more than is necessary if you are quoting 1 or 2 sigma regions, for which only a few hundred samples are necessary for a good estimate). My fortran code works as follows. Shuffle the marks of the real sample 10^3 times to create 10^3 samples with shuffled marks and assign all of those 1001 (10^3 + 1 for the actual mark) marks to a single object (i.e., a single galaxy). Then I do pair counting using a tree. Even in this case, my code is only using a few GB of memory, which is completely feasible for most users these days. Is there something that I misunderstand?

duncandc · 2017-08-23T20:21:30Z

I certainly think the multi-dimensional array for the marks is the simplest solution. The sparse matrix required to store all pairs will become enormous. I think that is a bad direction to head...

In principle, I like the idea of the engine selecting random weights, but I think that would slow down the engine significantly. Also, the engine is often passed only a subset of points (and therefore marks), especially when the calculation parallelized. Selecting from a subset of points is definitely not what you want to do. Working out this would require some actual thought.

aphearin · 2017-08-23T20:34:32Z

I'm exploring an algorithm without any memory increase at all. What I do is add one more inner loop where I randomly select an index from each weights array at each loop iteration. This way, I just randomly sample the input weights array on-the-fly, so there's zero increased memory load, and also only pairs within rmax get selected. Provided this algorithm is correct, it's the optimal solution.

The only catch is that this requires random number generation on-the-fly in Cython. That is, you can't rely on Numpy for the random number generator anymore, it has to be done within the Cython layer. I had to develop code to do this for a separate project, which I've ported over here. @duncandc - if you haven't seen how to do this before, it's a handy trick so have a look sometime at this engine still under development. At the top of the module, the random_uniform function shows how to draw random uniforms in the Cython layer, the cython_randint shows how to draw integers.

I'm actually not confident that this is a correct algorithm, although it is the same algorithm used by @andrew-zentner in The Immitigable Nature of Assembly Bias. I'll be sure to test it on a real science case before merging.

aphearin self-assigned this Apr 18, 2016

aphearin added performance-enhancement feature request mock-observables labels Apr 18, 2016

aphearin added this to the v0.2 milestone Apr 18, 2016

aphearin modified the milestones: v0.3, v0.2 May 24, 2016

aphearin modified the milestones: v0.4, v0.3 Jun 26, 2016

aphearin modified the milestones: next-release, v0.4 Jul 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marked correlation function error estimation #471

Marked correlation function error estimation #471

aphearin commented Apr 18, 2016

manodeep commented Aug 23, 2017

aphearin commented Aug 23, 2017

aphearin commented Aug 23, 2017

andrew-zentner commented Aug 23, 2017

duncandc commented Aug 23, 2017

aphearin commented Aug 23, 2017

Marked correlation function error estimation #471

Marked correlation function error estimation #471

Comments

aphearin commented Apr 18, 2016

manodeep commented Aug 23, 2017

aphearin commented Aug 23, 2017

aphearin commented Aug 23, 2017

andrew-zentner commented Aug 23, 2017

duncandc commented Aug 23, 2017

aphearin commented Aug 23, 2017