Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Lag error on ks_test #43

Closed
astanway opened this issue Aug 18, 2013 · 8 comments · Fixed by #44
Closed

Lag error on ks_test #43

astanway opened this issue Aug 18, 2013 · 8 comments · Fixed by #44

Comments

@astanway
Copy link
Contributor

I occasionally get this error:

ERROR:root:Algorithm error: Traceback (most recent call last):
  File "/home/astanway/skyline/src/analyzer/algorithms.py", line 263, in run_selected_algorithm
    ensemble = [globals()[algorithm](timeseries) for algorithm in ALGORITHMS]
  File "/home/astanway/skyline/src/analyzer/algorithms.py", line 206, in ks_test
    adf = sm.tsa.stattools.adfuller(reference, 10)
  File "/usr/lib64/python2.6/site-packages/statsmodels/tsa/stattools.py", line 201, in adfuller
    xdall = lagmat(xdiff[:,None], maxlag, trim='both', original='in')
  File "/usr/lib64/python2.6/site-packages/statsmodels/tsa/tsatools.py", line 305, in lagmat
    raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs

Any clues? cc @mabrek

Re: f886000

@mabrek
Copy link
Contributor

mabrek commented Aug 19, 2013

The error means that there is not enough datapoints for the test. What resolution (interval between observations) do you use?
I used this test for metrics with 2 seconds resolution. Initially it was 1s but I found quite high sampling jitter caused by linux kernel vm stats update interval (which is 1s).

@astanway
Copy link
Contributor Author

I use a 10 second resolution, with lots of variation in overall sample size. Is there a hard number on the minimum datapoints needed for this statistic?

@mabrek
Copy link
Contributor

mabrek commented Aug 19, 2013

Yes, there is a hard limit of 10 datapoints in reference part (between hour and 10 minutes ago).
adf = sm.tsa.stattools.adfuller(reference, 10)
I can guard it by 'if' condition.

@astanway
Copy link
Contributor Author

Ah, I see - yeah, I think a conditional there would be safer.

On Mon, Aug 19, 2013 at 8:29 AM, Anton Lebedevich
[email protected]:

Yes, there is a hard limit of 10 datapoints in reference part (between
hour and 10 minutes ago).
adf = sm.tsa.stattools.adfuller(reference, 10)
I can guard it by 'if' condition.


Reply to this email directly or view it on GitHubhttps://github.com//issues/43#issuecomment-22868159
.

Abe Stanway
abe.is

@mabrek
Copy link
Contributor

mabrek commented Aug 20, 2013

I've added conditional.

As a side note there might be some confusion in the way algorithms select data range to check for anomalies.

Checking last N datapoints gives different results on metrics with different resolutions. If anomaly is detected on 1 last datapoint or even 3 last datapoints on a metric with 2 seconds resolution that anomaly might disappear in 10 seconds. If metric has resolution of 5 minutes then there is quite a lot of time for human to notice detected anomaly.

Checking last N minutes would not provide enough datapoints for some algorithms (like ks_test) on low resolution metrics.

@astanway
Copy link
Contributor Author

That is correct. Perhaps a new setting is needed - TAIL_AVERAGE_SIZE?

On Aug 20, 2013, at 4:15 AM, Anton Lebedevich [email protected] wrote:

I've added conditional.

As a side note there might be some confusion in the way algorithms select data range to check for anomalies.

Checking last N datapoints gives different results on metrics with different resolutions. If anomaly is detected on 1 last datapoint or even 3 last datapoints on a metric with 2 seconds resolution that anomaly might disappear in 10 seconds. If metric has resolution of 5 minutes then there is quite a lot of time for human to notice detected anomaly.

Checking last N minutes would not provide enough datapoints for some algorithms (like ks_test) on low resolution metrics.


Reply to this email directly or view it on GitHub.

@mabrek
Copy link
Contributor

mabrek commented Aug 20, 2013

Metrics with a different resolutions might be present in the same environment so single size won't fit them all. Maybe it's better to use time to cut tail off the sequence (TAIL_TIME)?

@astanway
Copy link
Contributor Author

astanway commented Sep 7, 2013

I'm going to close this out, but can you please raise another issue with a case for TAIL_TIME and pragmatic resolution checking?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants