[Feature] multiple methods of outlier identification #647

danibene · 2022-05-27T02:09:55Z

Description

Allow for different methods of outlier identification based on standardization or percentiles.

Proposed Changes

Add new outlier identification method based on percentiles: _find_outliers_percentile(), defaults to identifying observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR as outliers, e.g. as in this paper
Add argument method in find_outliers() calling _find_outliers_standardize() or _find_outliers_percentile()
Pass keyword arguments to standardize() e.g. robust=True

Checklist

I have read the CONTRIBUTING file.
My PR is targetted at the dev branch (and not towards the master branch).
I ran the CODE CHECKS on the files I added or modified and fixed the errors.
- Just in case someone else runs into this: I had an import error with black at first, but it’s all good now
- Also I decided to ignore the following pylint output since it concerned a line that I did not add (from ..stats import standardize)
```
************* Module find_outliers
find_outliers.py:4:0: E0402: Attempted relative import beyond top-level package (relative-beyond-top-level)
```
I have added the newly added features to News.rst (if applicable)
- Though I am confused about which version the changes should correspond to: for the previous PR, I listed the features under 0.1.6 in the News file, but I saw that they were part of release 0.2.0. Does that mean I should move the previous changes to 0.2.0 and this PR to 0.2.1?

codeclimate · 2022-05-27T02:10:37Z

Code Climate has analyzed commit d79318f and detected 0 issues on this pull request.

View more on Code Climate.

codecov-commenter · 2022-05-27T02:15:14Z

Codecov Report

Merging #647 (3074bca) into dev (b123c7b) will decrease coverage by 0.08%.
The diff coverage is 3.44%.

@@            Coverage Diff             @@
##              dev     #647      +/-   ##
==========================================
- Coverage   53.64%   53.55%   -0.09%     
==========================================
  Files         269      269              
  Lines       12010    12032      +22     
==========================================
+ Hits         6443     6444       +1     
- Misses       5567     5588      +21

Impacted Files	Coverage Δ
neurokit2/misc/find_outliers.py	`11.42% <3.44%> (-19.35%)`	⬇️
neurokit2/eda/eda_eventrelated.py	`100.00% <0.00%> (+2.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b123c7b...3074bca. Read the comment docs.

DominiqueMakowski · 2022-05-31T11:08:50Z

Thanks @danibene, sorry I was on holidays, I will review this and merge asap :)

DominiqueMakowski · 2022-06-01T02:25:23Z

@danibene I checked and somewhat simplified the function, as I was not sure about all the various arguments (such as the range multiplier and the difference between the percentile range and threshold).

I tried to unify the arguments so that everything is controlled using exclude, and added examples. Can you check and let me know if there is any particular behaviour that is impossible to achieve using the current implementation? thanks

danibene · 2022-06-01T03:34:07Z

Thanks @DominiqueMakowski ! It looks a lot cleaner now :-)

One small comment about lines 92-93, since exclude is not an array/list/tuple, there should be no indexing right?

            right = np.percentile(data, (1 - (exclude[1] / 2)) * 100)
            left = np.percentile(data, (exclude[0] / 2) * 100)

Without the extra arguments (e.g. the range multiplier), we can only have the threshold based directly on the percentile rather than, for example, 1.5*interquartile range + the 75th percentile (as in the "quartiles" method in MATLAB's isoutlier).

DominiqueMakowski · 2022-06-01T04:11:05Z

there should be no indexing right?

good catch!

rather than, for example, 1.5*interquartile range + the 75th percentile (as in the "quartiles" method in MATLAB's isoutlier).

mmh I'd say for now it's better to keep the function fairly generic; this method seems to be somewhat very matlab-esque and oddly specific. If users want to mimic this behaviour they can probably do it on their side, as it's not too hard to do.

If that's okay with you I'll merge :)

to remove references to functions that do not exist anymore (_find_outliers_standardize and _find_outliers_percentile)

danibene · 2022-06-01T12:02:33Z

this method seems to be somewhat very matlab-esque and oddly specific. If users want to mimic this behaviour they can probably do it on their side, as it's not too hard to do.

I have seen the 1.5*IQR rule outside of MATLAB (e.g. https://online.stat.psu.edu/stat200/lesson/3/3.2), but keeping things simple for now makes sense to me.

If that's okay with you I'll merge :)

Yes please do! Thank you for your help with this :-)

DominiqueMakowski · 2022-06-01T13:56:09Z

I have seen the 1.5*IQR rule outside of MATLAB

Thanks, I didn't know that. Well if in the future we see that there's demand for this method we can always think about adding it in :)

Merging now, thanks again!

danibene and others added 5 commits May 26, 2022 11:00

add outlier identification method based on percentiles

1255664

pass keyword arguments to standardize function

d62e082

cosmetic changes to find_outliers.py

3c1dc8f

update NEWS.rst

bca1b5d

hopefully update NEWS.rst in the right place now

d79318f

pull-request-size bot added the size/L label May 27, 2022

DominiqueMakowski changed the base branch from master to dev June 1, 2022 00:55

DominiqueMakowski added 2 commits June 1, 2022 08:57

Merge branch 'dev' into pr/647

8647526

rework

b749408

fix

3074bca

change docstring for keyword arguments

451dec1

to remove references to functions that do not exist anymore (_find_outliers_standardize and _find_outliers_percentile)

DominiqueMakowski merged commit cf3535d into neuropsychology:dev Jun 1, 2022

danibene mentioned this pull request Jun 6, 2022

Adapt signal_fixpeaks to deal with larger gaps in data #650

Closed

danibene deleted the feature/expand_outlier_identification branch August 18, 2022 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] multiple methods of outlier identification #647

[Feature] multiple methods of outlier identification #647

danibene commented May 27, 2022

codeclimate bot commented May 27, 2022

codecov-commenter commented May 27, 2022 •

edited

Loading

DominiqueMakowski commented May 31, 2022

DominiqueMakowski commented Jun 1, 2022

danibene commented Jun 1, 2022

DominiqueMakowski commented Jun 1, 2022

danibene commented Jun 1, 2022

DominiqueMakowski commented Jun 1, 2022

[Feature] multiple methods of outlier identification #647

[Feature] multiple methods of outlier identification #647

Conversation

danibene commented May 27, 2022

Description

Proposed Changes

Checklist

codeclimate bot commented May 27, 2022

codecov-commenter commented May 27, 2022 • edited Loading

Codecov Report

DominiqueMakowski commented May 31, 2022

DominiqueMakowski commented Jun 1, 2022

danibene commented Jun 1, 2022

DominiqueMakowski commented Jun 1, 2022

danibene commented Jun 1, 2022

DominiqueMakowski commented Jun 1, 2022

codecov-commenter commented May 27, 2022 •

edited

Loading