Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic binning #55

Closed
MuellerSeb opened this issue Jan 10, 2020 · 12 comments
Closed

Automatic binning #55

MuellerSeb opened this issue Jan 10, 2020 · 12 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@MuellerSeb
Copy link
Member

MuellerSeb commented Jan 10, 2020

This is just a note, so we don't forget and can talk about it:

My proposal for default bining:

  • number of bins calculated by either:

    • (a) Sturges rule:

      n_bins = 2 * np.log2(n_points) + 1
      
    • (b) Rice rule:

      n_bins = np.power(n_points, 2 / 3)
      
  • maximal bin edge as a third of the box diameter:

    diam = np.sqrt((x.max - x.min)**2 + (y.max - y.min)**2 + (z.max - z.min)**2)
    bin_max = diam / 3
    
  • resulting bin sizes (quadratic growth):

    bins = np.linspace(0, bin_max**(1/2), n_bins)**2
    

Example: 30 points with diam = 900 and Struges rule:

bins = np.linspace(0, (900/3)**0.5, int(np.around(2*np.log2(30) + 1))) ** 2
array([  0.,   3.,  12.,  27.,  48.,  75., 108., 147., 192., 243., 300.])
@MuellerSeb MuellerSeb added enhancement New feature or request help wanted Extra attention is needed labels Jan 10, 2020
@MuellerSeb MuellerSeb added this to the 1.2 milestone Jan 10, 2020
@banesullivan
Copy link
Collaborator

@MuellerSeb, what is the d in

diam = ((x.max - x.min)**d + (y.max - y.min)**d + (z.max - z.min)**d)**(1/d)

@MuellerSeb
Copy link
Member Author

It should be contantly 2, since it is the euclidean norm.. I just mixed it up with the d-norm.

Comment was updated.

@banesullivan
Copy link
Collaborator

@MuellerSeb, I did this for the example on #36 and it has really great results. The variogram doesn't look all that pretty, but the results were compelling.

I will make a PR with this now

@MuellerSeb
Copy link
Member Author

Thanks for your PR. @LSchueler is working on this at the moment, so I have closed your PR in favor of the re-written routines.

But it is good to hear, that the scheme was working for you! That means we are on the right way.

@MuellerSeb MuellerSeb modified the milestones: 1.2, 1.3 Mar 20, 2020
@MuellerSeb
Copy link
Member Author

MuellerSeb commented Mar 28, 2020

@LSchueler: It would be nice to get some information on the variance within each bin during the variogram estimation. One could use a sequential update during the estimation, so we don't consume memory:

mean
var

Where the x_n are the samples within one bin.

We could later use this variance as a target function to estimate anisotropy and rotation by minimizing it. Or we could use this as weights during the fitting of a variogram function.

@MuellerSeb
Copy link
Member Author

Then we could get the following values for each bin:

  • bin center
  • estimate variogram at the bin center
  • count of values within the bin
  • mean of values within the bin
  • variance of values within the bin

@TobiasGlaubach
Copy link
Contributor

TobiasGlaubach commented Apr 28, 2020

Hi everyone,
first of all thanks for pointing me here @MuellerSeb.
Did you try the sequential update scheme before somewhere @MuellerSeb? I tried it for a realtime sensordata processing pipeline a few years ago and I found that the numerical error accumulates much much quicker, compared with the standard approach. I am not sure however on the datatypes and data I used back then. Just saying that this should be something to test, when implementing.


EDIT:
I checked with my old notes and found that above statement only applies to sequential calculation of floating (windowed) statistics and iterations in the millions.

Sorry about that!

@TobiasGlaubach
Copy link
Contributor

Regarding your idea here I think it's a good idea and I tried something similar before (the rotation and stretching part not the minimizing variance part) what I failed to accomplish (in my quick tests without investing much time) was to properly de-stretch the data, since one would need to stretch based on variance. My problem was, that very anisotrophic and very sparse data (well drilling data) let to bad estimation results. I did not invest a lot of time in it though.

Regarding the rotation:
Scipy has a fast and efficient implementation for this.

Generating a test model:

from scipy.spatial.transform import Rotation as R
import numpy as np
import gstools as gs
import matplotlib.pyplot as plt
x = y = np.arange(100)
model = gs.Exponential(dim=2, var=1, len_scale=[12.0, 3.0], angles=np.pi / 8)
srf = gs.SRF(model, seed=20170519)
srf.structured([x, y])
srf.plot()
val = srf.field.copy()
(gridx, gridy) = srf.pos

image

Selecting a random unstructured subset of the data:

X, Y = np.meshgrid(gridx, gridy)
idx = np.random.choice(X.size, 5000)
x = X.flatten()[idx]
y = Y.flatten()[idx]
v = val.flatten()[idx]

And rotating it:

r = R.from_euler('z', 180-67, degrees=True)
xyz = np.stack((x, y, np.zeros_like(x))).T

xyzr = np.apply_along_axis(r.apply, 1, xyz)
f, axes = plt.subplots(1,2, figsize=(10,5))
axes[0].scatter(xyz[:,0], xyz[:,1], c=v, marker='.')
axes[1].scatter(xyzr[:,0], xyzr[:,1], c=v, marker='.')

image

I found that for my examples the speed of the rotation operation by far fast enough.

@TobiasGlaubach
Copy link
Contributor

TobiasGlaubach commented Apr 28, 2020

@LSchueler: It would be nice to get some information on the variance within each bin during the variogram estimation. One could use a sequential update during the estimation, so we don't consume memory:

mean
var

Where the x_n are the samples within one bin.

We could later use this variance as a target function to estimate anisotropy and rotation by minimizing it. Or we could use this as weights during the fitting of a variogram function.

I just did a quick test and found that there is a slight error between the formulas you proposed and
the mean and var implementations of numpy. The error seems to get smaller for higher n. Just putting this out as notes:

import numpy as np

for n in [100, 1000, 10000, 100000]:
    
    x = np.random.randn(n,1) * 3 + 1
    ninv = 1./float(n)
    print((n, ninv, np.mean(x)))
    
    x_var_i = 0
    x_dash_i = x[0]
    for i in range(1, n):
        x_dash_i_new = x_dash_i + 1/i * (x[i] - x_dash_i)
        x_var_i = x_var_i + 1/i * ( (x[i] - x_dash_i) * (x[i] - x_dash_i_new) - x_var_i )
        x_dash_i = x_dash_i_new
                                   
    print('mean: np.mean={}, x_dash_n={}, diff={}'.format(np.mean(x), x_dash_i[0], np.abs(np.mean(x) - x_dash_i)[0]))
    print(' var: np.var ={}, x_var_i ={}, diff={}'.format(np.var(x), x_var_i[0], np.abs(np.var(x) - x_var_i)[0]))
    print('')

yields:

(100, 0.01, 0.8750765908826976)
mean: np.mean=0.8750765908826976, x_dash_n=0.8851148621683292, diff=0.010038271285631617
 var: np.var =8.38949142543425, x_var_i =8.46415707402451, diff=0.07466564859026015

(1000, 0.001, 0.924959624466884)
mean: np.mean=0.924959624466884, x_dash_n=0.9251304266394927, diff=0.00017080217260878605
 var: np.var =9.688353388604213, x_var_i =9.69802226666209, diff=0.00966887805787664

(10000, 0.0001, 0.9895868039181663)
mean: np.mean=0.9895868039181663, x_dash_n=0.9891618020657509, diff=0.00042500185241545196
 var: np.var =8.964186993437021, x_var_i =8.963277236041739, diff=0.00090975739528254

(100000, 1e-05, 0.9966201846514295)
mean: np.mean=0.9966201846514295, x_dash_n=0.9965993949967373, diff=2.078965469221039e-05
 var: np.var =8.981013033043954, x_var_i =8.981059623098185, diff=4.6590054230577493e-05

@MuellerSeb
Copy link
Member Author

MuellerSeb commented Apr 28, 2020

@TobiasGlaubach : Indeed they should match. The advantage of the sequential formulation is, that you can update the current value for variance, when there comes new data. Since the cython routines loop over all pairs of points and then decides for the bin to put these combination into, it is more convenient to use a sequential update of the variance value, so we don't have to store all values to calculate the variance afterwards (memory efficiency).

I guess the difference in the variance formulation comes from the use of population vs. sample variance: https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

where the only difference is the divisor:

  • n (population variance)
  • n-1 (sample variance)

For increasing n they are converging to each other as you observed correctly. You can specify this in the numpy routine by ddof: https://numpy.org/doc/stable/reference/generated/numpy.var.html

Aaaand in your code-sample, you are starting at n=1 so you are keeping out the first value.
(EDIT: this was actually the problem. replacing range(1,n) with range(n) and 1/i with 1/(i+1) produces equal results)

Regarding your de-stretching question, GSTools provides routines to de-stretch and re-rotate fields given here:

def make_isotropic(dim, anis, y, z):

def unrotate_mesh(dim, angles, x, y, z):

@TobiasGlaubach
Copy link
Contributor

Thanks for looking into this. Actually the starting at 1 is no problem if one sets

    x_var_i = 0
    x_dash_i = x[0]

I am currently working on estimate.pyx for the directional variogram estimation in #83. I can implement the variance and mean estimation as well. So it can be used for other things like the mentioned anisotrophy and orientation fitting.
@MuellerSeb any reason against that?

@MuellerSeb MuellerSeb modified the milestones: 1.3, 2.0 Jan 2, 2021
@MuellerSeb MuellerSeb changed the title Automated bining Automatic binning Jan 14, 2021
@MuellerSeb MuellerSeb modified the milestones: 2.0, 1.3 Jan 14, 2021
@MuellerSeb MuellerSeb unpinned this issue Jan 15, 2021
@MuellerSeb
Copy link
Member Author

Closed by #131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants