SVD imputation #16

rofinn · 2019-01-02T23:51:37Z

An implementation of SVD imputation which uses an EM based algorithm.

Steps:

Missing values are initially imputed with the means imputations.
The svd is computed for the initialized dataset and a low rank approximation is generated
The original missing values are replaced with the corresponding values in the approximation
The previous 2 steps are repeated until convergence (new approximations have little change from existing values) or the max iterations are reached.

Currently, the rank of the approximations increases gradually, but I'm open to other suggestions (references). This PR also includes a couple smoke tests to self document what types of data this method would work well on. For example, datasets with a large number of correlated variables where a small subset of the eigen values explain most of the variance.

TODO:

Add documentation
Add a reference
Support different initialization methods?
Include some checks to warn when the provided dataset doesn't fit the model (e.g., too few variables, poor low rank approximations)?

Closes #7

src/imputors/svd.jl

codecov · 2019-01-03T22:41:43Z

Codecov Report

Merging #16 into master will increase coverage by 0.1%.
The diff coverage is 97.43%.

@@            Coverage Diff            @@
##           master      #16     +/-   ##
=========================================
+ Coverage   96.58%   96.69%   +0.1%     
=========================================
  Files          11       12      +1     
  Lines         234      272     +38     
=========================================
+ Hits          226      263     +37     
- Misses          8        9      +1

Impacted Files	Coverage Δ
src/imputors.jl	`100% <100%> (ø)`	⬆️
src/imputors/svd.jl	`100% <100%> (ø)`
src/Impute.jl	`63.15% <50%> (-1.55%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5e81f3...d7c0220. Read the comment docs.

appleparan · 2020-02-14T09:44:30Z

Bump! Any update for this PR?

* inspired by SVD imputation (invenia#16)

appleparan · 2020-02-19T08:18:57Z

May I extend your svd branch to adjust current API? This branch show me insights how to test multivariate imputation and how to write those method to fit Impute.jl structure

Let TODO list to be completed in future (It's been too old) and let's just push PR if test passed.
By the way, the following reference seems to be okay for the svd imputation method.

Reference

Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

rofinn · 2020-02-19T17:16:18Z

I've been reluctant to push this in because we need to do some refactoring of the API. In general, we have several competing interests for how the imputation API should work.

Iterator vs Arrays: iterative methods (e.g., SVD) only really make sense with Arrays, but that API doesn't work well for streaming data.
Mutable vs Immutable methods: Many methods can simply update the data in-place which is more efficient, but that largely depends on the methods and data types being used. For example, the Tables API makes no assertions about mutability.
Methods like multiple imputation will require a custom data structure for efficiently saving and accessing the imputed values and original dataset.
Automated tooling for testing seems like something we should support for both our own tests, but also to make experimentation easier for others. Do we implement this tooling for refactoring the above API or wait to see what it looks like?

I think you're probably right that we should probably just update this PR to get it working well enough to tag a release. It'll just mean that the refactoring will require more work.

appleparan · 2020-02-19T17:51:19Z

What you are mentioned needs too much works. This PR have been a year and if we wait those list to be implemented it would be forever.

Some methods would not be applicable for all types, let's just restrict supported types only for Arrays and extend it if possible.

rofinn · 2020-02-19T22:39:08Z

This PR have been a year and if we wait those list to be implemented it would be forever.

That seems like a bit of an exaggeration, but sure.

Some methods would not be applicable for all types, let's just restrict supported types only for Arrays and extend it if possible.

That's largely what we've been trying to do, but folks have been confused about which permutations are supported in the past, so simply leaving it as a method error doesn't seem ideal.

This PR was largely an experiment on my part and I tended to have mixed results in terms of performance. Is there an application where you're wanting to use this SVD method or is it just that you'd like to have more methods generally available in this package?

appleparan · 2020-02-19T23:22:24Z

My answer is both.

I want to impute some time series for preprocessing my research data. I use Julia as my main project language. I could use R or Python only for preprocessing, but I prefer integrated structure from preprocessing to postprocessing. That's why I want to use this package because this seems only usable package for imputation

Then, I found this and I liked this project because it is so simple to use. However, there are only simple methods (univariate imputation only) to use. Because, I wish this package become bigger project like mice in R, so I thought it would be great if several imputation methods such as SVD, kNN, bPCA are implemented in this repo.

…l Loosen tests a little.

rofinn · 2020-02-21T21:58:26Z

Alright, most of the TODO items are resolved. I'm fine to merge this as is, but I'll likely need to revisit this during some up-coming refactoring.

ararslan reviewed Jan 2, 2019

View reviewed changes

src/imputors/svd.jl Outdated Show resolved Hide resolved

rofinn force-pushed the rf/svd branch from 515b6a0 to 6b1869d Compare January 3, 2019 23:00

appleparan added a commit to appleparan/Impute.jl that referenced this pull request Feb 18, 2020

Simple implementation of KNN imputation

2d09bb9

* inspired by SVD imputation (invenia#16)

appleparan mentioned this pull request Feb 18, 2020

kNN Imputation #54

Merged

appleparan added a commit to appleparan/Impute.jl that referenced this pull request Feb 18, 2020

Simple implementation of KNN imputation

6132261

* inspired by SVD imputation (invenia#16)

appleparan added a commit to appleparan/Impute.jl that referenced this pull request Feb 18, 2020

Simple implementation of KNN imputation

905d431

* inspired by SVD imputation (invenia#16)

rofinn added 6 commits February 21, 2020 14:50

Basic implementation of SVD imputation.

b3b6a22

SVD rank should always be less than number of variables.

af5abb1

Minor svd fixes and better smoke tests.

101cfc4

Use abs2 for sum of squares.

0bad4bf

Due to performance variability across systems and julia versions we'l…

eec8630

…l Loosen tests a little.

Update svd docstrings.

d7c0220

rofinn force-pushed the rf/svd branch from 6b1869d to d7c0220 Compare February 21, 2020 21:45

rofinn merged commit 4546b9a into master Feb 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVD imputation #16

SVD imputation #16

rofinn commented Jan 2, 2019 •

edited

Loading

codecov bot commented Jan 3, 2019 •

edited

Loading

appleparan commented Feb 14, 2020

appleparan commented Feb 19, 2020

rofinn commented Feb 19, 2020 •

edited

Loading

appleparan commented Feb 19, 2020

rofinn commented Feb 19, 2020

appleparan commented Feb 19, 2020

rofinn commented Feb 21, 2020

SVD imputation #16

SVD imputation #16

Conversation

rofinn commented Jan 2, 2019 • edited Loading

codecov bot commented Jan 3, 2019 • edited Loading

Codecov Report

appleparan commented Feb 14, 2020

appleparan commented Feb 19, 2020

Reference

rofinn commented Feb 19, 2020 • edited Loading

appleparan commented Feb 19, 2020

rofinn commented Feb 19, 2020

appleparan commented Feb 19, 2020

rofinn commented Feb 21, 2020

rofinn commented Jan 2, 2019 •

edited

Loading

codecov bot commented Jan 3, 2019 •

edited

Loading

rofinn commented Feb 19, 2020 •

edited

Loading