Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kNN Imputation #54

Merged
merged 1 commit into from
Mar 19, 2020
Merged

kNN Imputation #54

merged 1 commit into from
Mar 19, 2020

Conversation

appleparan
Copy link
Contributor

@appleparan appleparan commented Feb 18, 2020

Inspired by SVD imputation (#16), I implemented KNN imputation which closes #4

Reference

  • Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

@codecov
Copy link

codecov bot commented Feb 18, 2020

Codecov Report

Merging #54 into master will decrease coverage by 0.71%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   96.69%   95.97%   -0.72%     
==========================================
  Files          12       13       +1     
  Lines         272      298      +26     
==========================================
+ Hits          263      286      +23     
- Misses          9       12       +3
Impacted Files Coverage Δ
src/Impute.jl 57.14% <0%> (-6.02%) ⬇️
src/imputors.jl 100% <100%> (ø) ⬆️
src/imputors/knn.jl 95.83% <95.83%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4546b9a...86d0597. Read the comment docs.

@appleparan appleparan changed the title KNN Imputation kNN Imputation Feb 18, 2020
@appleparan
Copy link
Contributor Author

appleparan commented Feb 18, 2020

I have no idea why build fails in CI on 1.0. Does anyone helps me? Updating Manifest solved this problem.

@appleparan
Copy link
Contributor Author

In my opinion, broken tests should be solved in another PR.

Copy link
Member

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! I've made a few comments that should be addressed before I can merge this.

Manifest.toml Outdated Show resolved Hide resolved
Project.toml Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@rofinn
Copy link
Member

rofinn commented Feb 20, 2020

Since you seem to be comparing Impute.jl against R anyways, would you mind using RCall (or PyCall) to compare your code against an existing implementation (e.g., https://www.rdocumentation.org/packages/bnstruct/versions/1.0.6/topics/knn.impute, https://github.com/iskandr/fancyimpute)?

src/imputors/knn.jl Outdated Show resolved Hide resolved
@appleparan
Copy link
Contributor Author

I will add RCall Later because this could be applied not only kNN, but SVD. We need to merge SVD and kNN implementation first, then making tests more structural, and adding RCall should be the last one.

@rofinn
Copy link
Member

rofinn commented Feb 24, 2020

Okay, I've merged the svd implementation, if you want to rebase. I'm guessing most of the conflicts will get automerged.

@appleparan
Copy link
Contributor Author

Okay, I merged with master. I will rebase it. Thank you so much for merging other branches.

@appleparan appleparan force-pushed the ap/knn branch 2 times, most recently from 3389ac4 to 1173008 Compare February 25, 2020 14:39
@appleparan
Copy link
Contributor Author

I rebased it into single commit. After passing CI, you can merge it.

@appleparan
Copy link
Contributor Author

appleparan commented Mar 12, 2020

I rewrited code, however, I found some problems.

  1. Tests are unreliable. Sometimes it failed. Therefore, I ran tests multiple times (x100) and then averaged results.
  2. Added tests based on iris dataset.
  • Ref : P. Schimitt, et. al, A comparison of six methods for missing data imputation
  1. I changed order of original data and imputed data on nrmsd to make dividend for normalising to be always same.
  2. Test with random variables I'm not sure we need this. This doesn't guarantee knn imputation always be better than mean imputation due to randomness. What about to use other datasets such as some of in RDatasets?
  3. Mean imputation doesn't fill all missing values sometimes. It happens iris dataset usually. I still not figured out what's the problem (my test code or mean(Fill) imputation itself)

@appleparan appleparan force-pushed the ap/knn branch 2 times, most recently from c67e2fd to fe14cb7 Compare March 12, 2020 23:45
Project.toml Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
src/imputors/knn.jl Outdated Show resolved Hide resolved
@rofinn
Copy link
Member

rofinn commented Mar 16, 2020

Mean imputation doesn't fill all missing values sometimes. It happens iris dataset usually. I still not figured out what's the problem (my test code or mean(Fill) imputation itself)

Can you provide a reproducible example for this?

Test with random variables I'm not sure we need this. This doesn't guarantee knn imputation always be better than mean imputation due to randomness. What about to use other datasets such as some of in RDatasets?

That's correct, the random variable tests from svd was to confirm that poor low rank approximations wouldn't cause the method to fail. In fact, the test confirms that poor performance is expected when there are just a couple of uncorrelated variables (e.g., @test knn_nrmsd > mean_nrmsd * 0.9 the 0.9 was just in case of runtime variability).

@appleparan
Copy link
Contributor Author

appleparan commented Mar 16, 2020

Mean imputation doesn't fill all missing values sometimes. It happens iris dataset usually. I still not figured out what's the problem (my test code or mean(Fill) imputation itself)

I figured out why. Mean imputation to transposed array is wrong. First, it gives wrong result. Second, if doing imputation on small number of columns like iris dataset and all columns are missing, mean value also be missing.

I fixed most of minor things. As you pointed out, float.(collect()) is bad part of my commit. I thought this would be needed for KDTree, but your suggestion is right. I don't need that. I will fix them and commit again.

Copy link
Member

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, just a couple more style / cleanup changes and then I think this is good as a first pass.

src/imputors/knn.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
@appleparan
Copy link
Contributor Author

Travis was down. Would you retrigger?

If you feel okay, please let me know. I will rebase it.

@appleparan
Copy link
Contributor Author

Test fails due to #61

@rofinn
Copy link
Member

rofinn commented Mar 19, 2020

Yeah, we should also drop appveyor anyways.

@rofinn rofinn self-requested a review March 19, 2020 16:22
Copy link
Member

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for bearing with the review process. If there are any remaining issues they can be fixed in a future release. Might be good in the future to do a comparison table of the different methods on different datasets w/ missings.

@rofinn rofinn merged commit 26e50c7 into invenia:master Mar 19, 2020
@appleparan appleparan deleted the ap/knn branch March 20, 2020 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nearest Neighbour implementation
2 participants