Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple imputation methods: Performance #782

Open
eroell opened this issue Aug 1, 2024 · 6 comments
Open

Multiple imputation methods: Performance #782

eroell opened this issue Aug 1, 2024 · 6 comments
Assignees
Labels
performance Performance

Comments

@eroell
Copy link
Collaborator

eroell commented Aug 1, 2024

Question

Within ehrapy, we have

  • MissForestImputation
  • MiceForrestImputation

as multiple imputation (MI) methods so far. MI methods are typically computationally expensive but have been shown by many many benchmarks to have the best imputation performance. However, they are simply too slow for our big datasets on CPU. We don't want to force users to use a GPU.

We should profile these two methods and check for

  • is the runtime and memory consumption comparable when using them plainly, vs when called from ehrapy
  • are there bottlenecks caused by non-optimized code
  • (is it possible to perform this lazily by using dask)
@eroell eroell added the performance Performance label Aug 1, 2024
@Zethson
Copy link
Member

Zethson commented Aug 1, 2024

I wonder whether we can figure out new implementations of his where we don't impute every single value so crazily but maybe a few closely related ones? Like calculate KNN first and then impute a group of values? I know that this is a new imputation method but oh well. Maybe an autoencoder imputation is also of interest? Probably faster to train and use. Would need to look at benchmarks..

@eroell
Copy link
Collaborator Author

eroell commented Aug 1, 2024

This can absolutely be stretched to coming up with and adding more (well-performing) imputation strategies yes!

@Zethson
Copy link
Member

Zethson commented Aug 1, 2024

@eroell
Copy link
Collaborator Author

eroell commented Aug 1, 2024

Or even preparing larger synthetic datasets or ones which are well known in the imputation literature, and comparing different methods (and new ones) for performance, runtime, memory requirement, failure modes...

Not just an interesting notebook, but also fast and convenient benchmark possibility for others

Like the imputation part of the bias notebook but in big, and focused on imputation

@Zethson
Copy link
Member

Zethson commented Aug 1, 2024

MissForest with Extremely Randomized Trees can maybe be parallelized better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance
Projects
None yet
Development

No branches or pull requests

3 participants