-
-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] multiple methods of outlier identification #647
[Feature] multiple methods of outlier identification #647
Conversation
Code Climate has analyzed commit d79318f and detected 0 issues on this pull request. View more on Code Climate. |
Codecov Report
@@ Coverage Diff @@
## dev #647 +/- ##
==========================================
- Coverage 53.64% 53.55% -0.09%
==========================================
Files 269 269
Lines 12010 12032 +22
==========================================
+ Hits 6443 6444 +1
- Misses 5567 5588 +21
Continue to review full report at Codecov.
|
Thanks @danibene, sorry I was on holidays, I will review this and merge asap :) |
@danibene I checked and somewhat simplified the function, as I was not sure about all the various arguments (such as the range multiplier and the difference between the percentile range and threshold). I tried to unify the arguments so that everything is controlled using |
Thanks @DominiqueMakowski ! It looks a lot cleaner now :-) One small comment about lines 92-93, since exclude is not an array/list/tuple, there should be no indexing right?
Without the extra arguments (e.g. the range multiplier), we can only have the threshold based directly on the percentile rather than, for example, 1.5*interquartile range + the 75th percentile (as in the "quartiles" method in MATLAB's isoutlier). |
good catch!
mmh I'd say for now it's better to keep the function fairly generic; this method seems to be somewhat very matlab-esque and oddly specific. If users want to mimic this behaviour they can probably do it on their side, as it's not too hard to do. If that's okay with you I'll merge :) |
to remove references to functions that do not exist anymore (_find_outliers_standardize and _find_outliers_percentile)
I have seen the 1.5*IQR rule outside of MATLAB (e.g. https://online.stat.psu.edu/stat200/lesson/3/3.2), but keeping things simple for now makes sense to me.
Yes please do! Thank you for your help with this :-) |
Thanks, I didn't know that. Well if in the future we see that there's demand for this method we can always think about adding it in :) Merging now, thanks again! |
Description
Allow for different methods of outlier identification based on standardization or percentiles.
Proposed Changes
_find_outliers_percentile()
, defaults to identifying observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR as outliers, e.g. as in this papermethod
infind_outliers()
calling_find_outliers_standardize()
or_find_outliers_percentile()
standardize()
e.g. robust=TrueChecklist
from ..stats import standardize
)