Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVD imputation #16

Merged
merged 6 commits into from
Feb 24, 2020
Merged

SVD imputation #16

merged 6 commits into from
Feb 24, 2020

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Jan 2, 2019

An implementation of SVD imputation which uses an EM based algorithm.

Steps:

  1. Missing values are initially imputed with the means imputations.
  2. The svd is computed for the initialized dataset and a low rank approximation is generated
  3. The original missing values are replaced with the corresponding values in the approximation
  4. The previous 2 steps are repeated until convergence (new approximations have little change from existing values) or the max iterations are reached.

Currently, the rank of the approximations increases gradually, but I'm open to other suggestions (references). This PR also includes a couple smoke tests to self document what types of data this method would work well on. For example, datasets with a large number of correlated variables where a small subset of the eigen values explain most of the variance.

TODO:

  • Add documentation
  • Add a reference
  • Support different initialization methods?
  • Include some checks to warn when the provided dataset doesn't fit the model (e.g., too few variables, poor low rank approximations)?

Closes #7

src/imputors/svd.jl Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 3, 2019

Codecov Report

Merging #16 into master will increase coverage by 0.1%.
The diff coverage is 97.43%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #16     +/-   ##
=========================================
+ Coverage   96.58%   96.69%   +0.1%     
=========================================
  Files          11       12      +1     
  Lines         234      272     +38     
=========================================
+ Hits          226      263     +37     
- Misses          8        9      +1
Impacted Files Coverage Δ
src/imputors.jl 100% <100%> (ø) ⬆️
src/imputors/svd.jl 100% <100%> (ø)
src/Impute.jl 63.15% <50%> (-1.55%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5e81f3...d7c0220. Read the comment docs.

@appleparan
Copy link
Contributor

Bump! Any update for this PR?

appleparan added a commit to appleparan/Impute.jl that referenced this pull request Feb 18, 2020
* inspired by SVD imputation (invenia#16)
@appleparan appleparan mentioned this pull request Feb 18, 2020
appleparan added a commit to appleparan/Impute.jl that referenced this pull request Feb 18, 2020
* inspired by SVD imputation (invenia#16)
appleparan added a commit to appleparan/Impute.jl that referenced this pull request Feb 18, 2020
* inspired by SVD imputation (invenia#16)
@appleparan
Copy link
Contributor

May I extend your svd branch to adjust current API? This branch show me insights how to test multivariate imputation and how to write those method to fit Impute.jl structure

Let TODO list to be completed in future (It's been too old) and let's just push PR if test passed.
By the way, the following reference seems to be okay for the svd imputation method.

Reference

  • Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

@rofinn
Copy link
Member Author

rofinn commented Feb 19, 2020

I've been reluctant to push this in because we need to do some refactoring of the API. In general, we have several competing interests for how the imputation API should work.

  • Iterator vs Arrays: iterative methods (e.g., SVD) only really make sense with Arrays, but that API doesn't work well for streaming data.
  • Mutable vs Immutable methods: Many methods can simply update the data in-place which is more efficient, but that largely depends on the methods and data types being used. For example, the Tables API makes no assertions about mutability.
  • Methods like multiple imputation will require a custom data structure for efficiently saving and accessing the imputed values and original dataset.
  • Automated tooling for testing seems like something we should support for both our own tests, but also to make experimentation easier for others. Do we implement this tooling for refactoring the above API or wait to see what it looks like?

I think you're probably right that we should probably just update this PR to get it working well enough to tag a release. It'll just mean that the refactoring will require more work.

@appleparan
Copy link
Contributor

What you are mentioned needs too much works. This PR have been a year and if we wait those list to be implemented it would be forever.

Some methods would not be applicable for all types, let's just restrict supported types only for Arrays and extend it if possible.

@rofinn
Copy link
Member Author

rofinn commented Feb 19, 2020

This PR have been a year and if we wait those list to be implemented it would be forever.

That seems like a bit of an exaggeration, but sure.

Some methods would not be applicable for all types, let's just restrict supported types only for Arrays and extend it if possible.

That's largely what we've been trying to do, but folks have been confused about which permutations are supported in the past, so simply leaving it as a method error doesn't seem ideal.

This PR was largely an experiment on my part and I tended to have mixed results in terms of performance. Is there an application where you're wanting to use this SVD method or is it just that you'd like to have more methods generally available in this package?

@appleparan
Copy link
Contributor

My answer is both.

I want to impute some time series for preprocessing my research data. I use Julia as my main project language. I could use R or Python only for preprocessing, but I prefer integrated structure from preprocessing to postprocessing. That's why I want to use this package because this seems only usable package for imputation

Then, I found this and I liked this project because it is so simple to use. However, there are only simple methods (univariate imputation only) to use. Because, I wish this package become bigger project like mice in R, so I thought it would be great if several imputation methods such as SVD, kNN, bPCA are implemented in this repo.

@rofinn
Copy link
Member Author

rofinn commented Feb 21, 2020

Alright, most of the TODO items are resolved. I'm fine to merge this as is, but I'll likely need to revisit this during some up-coming refactoring.

@rofinn rofinn merged commit 4546b9a into master Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SVD
3 participants