Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marked correlation function error estimation #471

Open
aphearin opened this issue Apr 18, 2016 · 6 comments
Open

Marked correlation function error estimation #471

aphearin opened this issue Apr 18, 2016 · 6 comments

Comments

@aphearin
Copy link
Contributor

The most common way to estimate errors in a MCF analysis is to repeatedly count marked pairs with scrambled marks. The current API makes this process very slow, because with each call to the MCF, pairs of points need to be found all over again. There should be a way to open each pair of cells only once. The naive way of doing this would be to pass in a multi-dimensional array of shape (num_pts, num_realizations), and simply loop over num_realizations for each pair of cells. However, memory constraints make this infeasible for reasonable sample sizes and numbers of realizations, so we will need to be clever.

@duncandc - now that I have finished giving the underlying mock_observables engine a rewrite, I think now is the time to implement this, so let's hash this out when we sit down later this week to port the new engine to the rest of the mock_observables functions.

CC @vcalderon2009 and @andrew-zentner , who have both requested this feature.

@aphearin aphearin self-assigned this Apr 18, 2016
@aphearin aphearin added this to the v0.2 milestone Apr 18, 2016
@aphearin aphearin modified the milestones: v0.3, v0.2 May 24, 2016
@aphearin aphearin modified the milestones: v0.4, v0.3 Jun 26, 2016
@aphearin aphearin modified the milestones: next-release, v0.4 Jul 25, 2016
@manodeep
Copy link
Contributor

Count me in as another interested user :)

@aphearin
Copy link
Contributor Author

@manodeep - are the memory constraints I mentioned here actually a big problem? What do you think? The simplest solution is still to use a multi-dim array, where there is an extra dimension for each shuffling of the marks. For 10^6 points, and 10^3 marks, this doesn't seem out of the question, although it certainly is not a scalable solution. What do you think?

What makes this a hard problem is the fact that the error estimation is usually done by shuffling. If indices could be selected at random for each pair of points, the problem becomes very simple. But this solution would permit the same mark to be selected multiple times, which would not capture some aspects of the sample variance. This is probably important, but I have never tested the consequences.

@duncandc, @andrew-zentner @surhudm - any bright ideas?

@aphearin
Copy link
Contributor Author

An alternative option would be to precompute all pairs once and store all pairs of indices and distances in a sparse matrix. Then, after the sparse matrices are built, in a separate subsequent step, initiate a new loop over the number of shuffles. The first operation in each step of that subsequent loop is a fresh shuffle of the array of marks. Then the code would loop over the sparse matrix of pairs.

That would be a tolerable amount of work, but I'm still not confident it would resolve the memory problems. It would certainly be a lot faster than repeatedly counting pairs, which is what the current code requires. So whether or not this is worth trying is somewhat dependent on how important this problem is. Does anyone have comments on whether the current method unworkably slow?

@andrew-zentner
Copy link
Contributor

@aphearin - Is memory really an issue? My antiquated fortran code (which Antonio used), works more or less as you describe above. Suppose I have 10^6 galaxies. Suppose I want 10^3 shuffled samples (which is probably more than is necessary if you are quoting 1 or 2 sigma regions, for which only a few hundred samples are necessary for a good estimate). My fortran code works as follows. Shuffle the marks of the real sample 10^3 times to create 10^3 samples with shuffled marks and assign all of those 1001 (10^3 + 1 for the actual mark) marks to a single object (i.e., a single galaxy). Then I do pair counting using a tree. Even in this case, my code is only using a few GB of memory, which is completely feasible for most users these days. Is there something that I misunderstand?

@duncandc
Copy link
Contributor

I certainly think the multi-dimensional array for the marks is the simplest solution. The sparse matrix required to store all pairs will become enormous. I think that is a bad direction to head...

In principle, I like the idea of the engine selecting random weights, but I think that would slow down the engine significantly. Also, the engine is often passed only a subset of points (and therefore marks), especially when the calculation parallelized. Selecting from a subset of points is definitely not what you want to do. Working out this would require some actual thought.

@aphearin
Copy link
Contributor Author

I'm exploring an algorithm without any memory increase at all. What I do is add one more inner loop where I randomly select an index from each weights array at each loop iteration. This way, I just randomly sample the input weights array on-the-fly, so there's zero increased memory load, and also only pairs within rmax get selected. Provided this algorithm is correct, it's the optimal solution.

The only catch is that this requires random number generation on-the-fly in Cython. That is, you can't rely on Numpy for the random number generator anymore, it has to be done within the Cython layer. I had to develop code to do this for a separate project, which I've ported over here. @duncandc - if you haven't seen how to do this before, it's a handy trick so have a look sometime at this engine still under development. At the top of the module, the random_uniform function shows how to draw random uniforms in the Cython layer, the cython_randint shows how to draw integers.

I'm actually not confident that this is a correct algorithm, although it is the same algorithm used by @andrew-zentner in The Immitigable Nature of Assembly Bias. I'll be sure to test it on a real science case before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants