Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] multiple methods of outlier identification #647

Conversation

danibene
Copy link
Collaborator

Description

Allow for different methods of outlier identification based on standardization or percentiles.

Proposed Changes

  • Add new outlier identification method based on percentiles: _find_outliers_percentile(), defaults to identifying observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR as outliers, e.g. as in this paper
  • Add argument method in find_outliers() calling _find_outliers_standardize() or _find_outliers_percentile()
  • Pass keyword arguments to standardize() e.g. robust=True

Checklist

  • I have read the CONTRIBUTING file.
  • My PR is targetted at the dev branch (and not towards the master branch).
  • I ran the CODE CHECKS on the files I added or modified and fixed the errors.
    • Just in case someone else runs into this: I had an import error with black at first, but it’s all good now
    • Also I decided to ignore the following pylint output since it concerned a line that I did not add (from ..stats import standardize)
      ************* Module find_outliers
      find_outliers.py:4:0: E0402: Attempted relative import beyond top-level package (relative-beyond-top-level)
      
  • I have added the newly added features to News.rst (if applicable)
    • Though I am confused about which version the changes should correspond to: for the previous PR, I listed the features under 0.1.6 in the News file, but I saw that they were part of release 0.2.0. Does that mean I should move the previous changes to 0.2.0 and this PR to 0.2.1?

@codeclimate
Copy link

codeclimate bot commented May 27, 2022

Code Climate has analyzed commit d79318f and detected 0 issues on this pull request.

View more on Code Climate.

@codecov-commenter
Copy link

codecov-commenter commented May 27, 2022

Codecov Report

Merging #647 (3074bca) into dev (b123c7b) will decrease coverage by 0.08%.
The diff coverage is 3.44%.

@@            Coverage Diff             @@
##              dev     #647      +/-   ##
==========================================
- Coverage   53.64%   53.55%   -0.09%     
==========================================
  Files         269      269              
  Lines       12010    12032      +22     
==========================================
+ Hits         6443     6444       +1     
- Misses       5567     5588      +21     
Impacted Files Coverage Δ
neurokit2/misc/find_outliers.py 11.42% <3.44%> (-19.35%) ⬇️
neurokit2/eda/eda_eventrelated.py 100.00% <0.00%> (+2.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b123c7b...3074bca. Read the comment docs.

@DominiqueMakowski
Copy link
Member

Thanks @danibene, sorry I was on holidays, I will review this and merge asap :)

@DominiqueMakowski DominiqueMakowski changed the base branch from master to dev June 1, 2022 00:55
@DominiqueMakowski
Copy link
Member

@danibene I checked and somewhat simplified the function, as I was not sure about all the various arguments (such as the range multiplier and the difference between the percentile range and threshold).

I tried to unify the arguments so that everything is controlled using exclude, and added examples. Can you check and let me know if there is any particular behaviour that is impossible to achieve using the current implementation? thanks

@danibene
Copy link
Collaborator Author

danibene commented Jun 1, 2022

Thanks @DominiqueMakowski ! It looks a lot cleaner now :-)

One small comment about lines 92-93, since exclude is not an array/list/tuple, there should be no indexing right?

            right = np.percentile(data, (1 - (exclude[1] / 2)) * 100)
            left = np.percentile(data, (exclude[0] / 2) * 100)

Without the extra arguments (e.g. the range multiplier), we can only have the threshold based directly on the percentile rather than, for example, 1.5*interquartile range + the 75th percentile (as in the "quartiles" method in MATLAB's isoutlier).

@DominiqueMakowski
Copy link
Member

there should be no indexing right?

good catch!

rather than, for example, 1.5*interquartile range + the 75th percentile (as in the "quartiles" method in MATLAB's isoutlier).

mmh I'd say for now it's better to keep the function fairly generic; this method seems to be somewhat very matlab-esque and oddly specific. If users want to mimic this behaviour they can probably do it on their side, as it's not too hard to do.

If that's okay with you I'll merge :)

to remove references to functions that do not exist anymore (_find_outliers_standardize and _find_outliers_percentile)
@danibene
Copy link
Collaborator Author

danibene commented Jun 1, 2022

this method seems to be somewhat very matlab-esque and oddly specific. If users want to mimic this behaviour they can probably do it on their side, as it's not too hard to do.

I have seen the 1.5*IQR rule outside of MATLAB (e.g. https://online.stat.psu.edu/stat200/lesson/3/3.2), but keeping things simple for now makes sense to me.

If that's okay with you I'll merge :)

Yes please do! Thank you for your help with this :-)

@DominiqueMakowski
Copy link
Member

I have seen the 1.5*IQR rule outside of MATLAB

Thanks, I didn't know that. Well if in the future we see that there's demand for this method we can always think about adding it in :)

Merging now, thanks again!

@DominiqueMakowski DominiqueMakowski merged commit cf3535d into neuropsychology:dev Jun 1, 2022
@danibene danibene deleted the feature/expand_outlier_identification branch August 18, 2022 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants