Automatic binning #55

MuellerSeb · 2020-01-10T23:54:22Z

This is just a note, so we don't forget and can talk about it:

My proposal for default bining:

number of bins calculated by either:

(a) Sturges rule:
```
n_bins = 2 * np.log2(n_points) + 1
```
(b) Rice rule:
```
n_bins = np.power(n_points, 2 / 3)
```

maximal bin edge as a third of the box diameter:

diam = np.sqrt((x.max - x.min)**2 + (y.max - y.min)**2 + (z.max - z.min)**2)
bin_max = diam / 3

resulting bin sizes (quadratic growth):

bins = np.linspace(0, bin_max**(1/2), n_bins)**2

Example: 30 points with diam = 900 and Struges rule:

bins = np.linspace(0, (900/3)**0.5, int(np.around(2*np.log2(30) + 1))) ** 2
array([  0.,   3.,  12.,  27.,  48.,  75., 108., 147., 192., 243., 300.])

The text was updated successfully, but these errors were encountered:

banesullivan · 2020-01-11T22:24:03Z

@MuellerSeb, what is the d in

diam = ((x.max - x.min)**d + (y.max - y.min)**d + (z.max - z.min)**d)**(1/d)

MuellerSeb · 2020-01-12T09:56:37Z

It should be contantly 2, since it is the euclidean norm.. I just mixed it up with the d-norm.

Comment was updated.

banesullivan · 2020-01-12T18:16:42Z

@MuellerSeb, I did this for the example on #36 and it has really great results. The variogram doesn't look all that pretty, but the results were compelling.

I will make a PR with this now

MuellerSeb · 2020-01-12T19:13:04Z

Thanks for your PR. @LSchueler is working on this at the moment, so I have closed your PR in favor of the re-written routines.

But it is good to hear, that the scheme was working for you! That means we are on the right way.

MuellerSeb · 2020-03-28T14:40:03Z

@LSchueler: It would be nice to get some information on the variance within each bin during the variogram estimation. One could use a sequential update during the estimation, so we don't consume memory:

Where the x_n are the samples within one bin.

We could later use this variance as a target function to estimate anisotropy and rotation by minimizing it. Or we could use this as weights during the fitting of a variogram function.

MuellerSeb · 2020-03-28T14:45:23Z

Then we could get the following values for each bin:

bin center
estimate variogram at the bin center
count of values within the bin
mean of values within the bin
variance of values within the bin

TobiasGlaubach · 2020-04-28T13:01:21Z

Hi everyone,
first of all thanks for pointing me here @MuellerSeb.
Did you try the sequential update scheme before somewhere @MuellerSeb? I tried it for a realtime sensordata processing pipeline a few years ago and I found that the numerical error accumulates much much quicker, compared with the standard approach. I am not sure however on the datatypes and data I used back then. Just saying that this should be something to test, when implementing.

EDIT:
I checked with my old notes and found that above statement only applies to sequential calculation of floating (windowed) statistics and iterations in the millions.

Sorry about that!

TobiasGlaubach · 2020-04-28T13:21:19Z

Regarding your idea here I think it's a good idea and I tried something similar before (the rotation and stretching part not the minimizing variance part) what I failed to accomplish (in my quick tests without investing much time) was to properly de-stretch the data, since one would need to stretch based on variance. My problem was, that very anisotrophic and very sparse data (well drilling data) let to bad estimation results. I did not invest a lot of time in it though.

Regarding the rotation:
Scipy has a fast and efficient implementation for this.

Generating a test model:

from scipy.spatial.transform import Rotation as R
import numpy as np
import gstools as gs
import matplotlib.pyplot as plt

x = y = np.arange(100)
model = gs.Exponential(dim=2, var=1, len_scale=[12.0, 3.0], angles=np.pi / 8)
srf = gs.SRF(model, seed=20170519)
srf.structured([x, y])
srf.plot()
val = srf.field.copy()
(gridx, gridy) = srf.pos

Selecting a random unstructured subset of the data:

X, Y = np.meshgrid(gridx, gridy)
idx = np.random.choice(X.size, 5000)
x = X.flatten()[idx]
y = Y.flatten()[idx]
v = val.flatten()[idx]

And rotating it:

r = R.from_euler('z', 180-67, degrees=True)
xyz = np.stack((x, y, np.zeros_like(x))).T

xyzr = np.apply_along_axis(r.apply, 1, xyz)
f, axes = plt.subplots(1,2, figsize=(10,5))
axes[0].scatter(xyz[:,0], xyz[:,1], c=v, marker='.')
axes[1].scatter(xyzr[:,0], xyzr[:,1], c=v, marker='.')

I found that for my examples the speed of the rotation operation by far fast enough.

TobiasGlaubach · 2020-04-28T13:50:47Z

@LSchueler: It would be nice to get some information on the variance within each bin during the variogram estimation. One could use a sequential update during the estimation, so we don't consume memory:

Where the x_n are the samples within one bin.

We could later use this variance as a target function to estimate anisotropy and rotation by minimizing it. Or we could use this as weights during the fitting of a variogram function.

I just did a quick test and found that there is a slight error between the formulas you proposed and
the mean and var implementations of numpy. The error seems to get smaller for higher n. Just putting this out as notes:

import numpy as np

for n in [100, 1000, 10000, 100000]:
    
    x = np.random.randn(n,1) * 3 + 1
    ninv = 1./float(n)
    print((n, ninv, np.mean(x)))
    
    x_var_i = 0
    x_dash_i = x[0]
    for i in range(1, n):
        x_dash_i_new = x_dash_i + 1/i * (x[i] - x_dash_i)
        x_var_i = x_var_i + 1/i * ( (x[i] - x_dash_i) * (x[i] - x_dash_i_new) - x_var_i )
        x_dash_i = x_dash_i_new
                                   
    print('mean: np.mean={}, x_dash_n={}, diff={}'.format(np.mean(x), x_dash_i[0], np.abs(np.mean(x) - x_dash_i)[0]))
    print(' var: np.var ={}, x_var_i ={}, diff={}'.format(np.var(x), x_var_i[0], np.abs(np.var(x) - x_var_i)[0]))
    print('')

yields:

(100, 0.01, 0.8750765908826976)
mean: np.mean=0.8750765908826976, x_dash_n=0.8851148621683292, diff=0.010038271285631617
 var: np.var =8.38949142543425, x_var_i =8.46415707402451, diff=0.07466564859026015

(1000, 0.001, 0.924959624466884)
mean: np.mean=0.924959624466884, x_dash_n=0.9251304266394927, diff=0.00017080217260878605
 var: np.var =9.688353388604213, x_var_i =9.69802226666209, diff=0.00966887805787664

(10000, 0.0001, 0.9895868039181663)
mean: np.mean=0.9895868039181663, x_dash_n=0.9891618020657509, diff=0.00042500185241545196
 var: np.var =8.964186993437021, x_var_i =8.963277236041739, diff=0.00090975739528254

(100000, 1e-05, 0.9966201846514295)
mean: np.mean=0.9966201846514295, x_dash_n=0.9965993949967373, diff=2.078965469221039e-05
 var: np.var =8.981013033043954, x_var_i =8.981059623098185, diff=4.6590054230577493e-05

MuellerSeb · 2020-04-28T14:22:09Z

@TobiasGlaubach : Indeed they should match. The advantage of the sequential formulation is, that you can update the current value for variance, when there comes new data. Since the cython routines loop over all pairs of points and then decides for the bin to put these combination into, it is more convenient to use a sequential update of the variance value, so we don't have to store all values to calculate the variance afterwards (memory efficiency).

I guess the difference in the variance formulation comes from the use of population vs. sample variance: https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

where the only difference is the divisor:

n (population variance)
n-1 (sample variance)

For increasing n they are converging to each other as you observed correctly. You can specify this in the numpy routine by ddof: https://numpy.org/doc/stable/reference/generated/numpy.var.html

Aaaand in your code-sample, you are starting at n=1 so you are keeping out the first value.
(EDIT: this was actually the problem. replacing range(1,n) with range(n) and 1/i with 1/(i+1) produces equal results)

Regarding your de-stretching question, GSTools provides routines to de-stretch and re-rotate fields given here:

GSTools/gstools/field/tools.py

Line 114 in 2c26d74

def make_isotropic(dim, anis, y, z):

GSTools/gstools/field/tools.py

Line 136 in 2c26d74

def unrotate_mesh(dim, angles, x, y, z):

TobiasGlaubach · 2020-04-28T16:54:49Z

Thanks for looking into this. Actually the starting at 1 is no problem if one sets

    x_var_i = 0
    x_dash_i = x[0]

I am currently working on estimate.pyx for the directional variogram estimation in #83. I can implement the variance and mean estimation as well. So it can be used for other things like the mentioned anisotrophy and orientation fitting.
@MuellerSeb any reason against that?

MuellerSeb · 2021-01-15T20:16:07Z

Closed by #131

MuellerSeb added enhancement New feature or request help wanted Extra attention is needed labels Jan 10, 2020

MuellerSeb added this to the 1.2 milestone Jan 10, 2020

MuellerSeb assigned MuellerSeb and LSchueler Jan 10, 2020

MuellerSeb mentioned this issue Jan 10, 2020

Ordinary Kriging: help please? #36

Closed

banesullivan mentioned this issue Jan 12, 2020

Implement automated binning #58

Closed

MuellerSeb modified the milestones: 1.2, 1.3 Mar 20, 2020

MuellerSeb mentioned this issue Apr 4, 2020

Ability to specify cutoff GeoStat-Framework/PyKrige#57

Closed

MuellerSeb pinned this issue Apr 7, 2020

MuellerSeb mentioned this issue Apr 28, 2020

vario_estimate_structured returns only nan if nan points in the field #83

Closed

MuellerSeb modified the milestones: 1.3, 2.0 Jan 2, 2021

MuellerSeb mentioned this issue Jan 14, 2021

Automatic Binning #131

Closed

MuellerSeb changed the title ~~Automated bining~~ Automatic binning Jan 14, 2021

MuellerSeb modified the milestones: 2.0, 1.3 Jan 14, 2021

MuellerSeb unpinned this issue Jan 15, 2021

MuellerSeb closed this as completed Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic binning #55

Automatic binning #55

MuellerSeb commented Jan 10, 2020 •

edited

Loading

banesullivan commented Jan 11, 2020

MuellerSeb commented Jan 12, 2020

banesullivan commented Jan 12, 2020

MuellerSeb commented Jan 12, 2020

MuellerSeb commented Mar 28, 2020 •

edited

Loading

MuellerSeb commented Mar 28, 2020

TobiasGlaubach commented Apr 28, 2020 •

edited

Loading

TobiasGlaubach commented Apr 28, 2020

TobiasGlaubach commented Apr 28, 2020 •

edited

Loading

MuellerSeb commented Apr 28, 2020 •

edited

Loading

TobiasGlaubach commented Apr 28, 2020

MuellerSeb commented Jan 15, 2021

Automatic binning #55

Automatic binning #55

Comments

MuellerSeb commented Jan 10, 2020 • edited Loading

banesullivan commented Jan 11, 2020

MuellerSeb commented Jan 12, 2020

banesullivan commented Jan 12, 2020

MuellerSeb commented Jan 12, 2020

MuellerSeb commented Mar 28, 2020 • edited Loading

MuellerSeb commented Mar 28, 2020

TobiasGlaubach commented Apr 28, 2020 • edited Loading

TobiasGlaubach commented Apr 28, 2020

TobiasGlaubach commented Apr 28, 2020 • edited Loading

MuellerSeb commented Apr 28, 2020 • edited Loading

TobiasGlaubach commented Apr 28, 2020

MuellerSeb commented Jan 15, 2021

MuellerSeb commented Jan 10, 2020 •

edited

Loading

MuellerSeb commented Mar 28, 2020 •

edited

Loading

TobiasGlaubach commented Apr 28, 2020 •

edited

Loading

TobiasGlaubach commented Apr 28, 2020 •

edited

Loading

MuellerSeb commented Apr 28, 2020 •

edited

Loading