Multiple imputation methods: Performance #782

eroell · 2024-08-01T12:43:21Z

Question

Within ehrapy, we have

MissForestImputation
MiceForrestImputation

as multiple imputation (MI) methods so far. MI methods are typically computationally expensive but have been shown by many many benchmarks to have the best imputation performance. However, they are simply too slow for our big datasets on CPU. We don't want to force users to use a GPU.

We should profile these two methods and check for

is the runtime and memory consumption comparable when using them plainly, vs when called from ehrapy
are there bottlenecks caused by non-optimized code
(is it possible to perform this lazily by using dask)

Zethson · 2024-08-01T12:45:54Z

I wonder whether we can figure out new implementations of his where we don't impute every single value so crazily but maybe a few closely related ones? Like calculate KNN first and then impute a group of values? I know that this is a new imputation method but oh well. Maybe an autoencoder imputation is also of interest? Probably faster to train and use. Would need to look at benchmarks..

eroell · 2024-08-01T12:49:21Z

This can absolutely be stretched to coming up with and adding more (well-performing) imputation strategies yes!

Zethson · 2024-08-01T12:51:48Z

https://arxiv.org/abs/1705.02737

eroell · 2024-08-01T12:52:25Z

Or even preparing larger synthetic datasets or ones which are well known in the imputation literature, and comparing different methods (and new ones) for performance, runtime, memory requirement, failure modes...

Not just an interesting notebook, but also fast and convenient benchmark possibility for others

Like the imputation part of the bias notebook but in big, and focused on imputation

Zethson · 2024-08-01T12:53:34Z

MissForest with Extremely Randomized Trees can maybe be parallelized better

Zethson · 2024-08-08T13:53:41Z

https://github.com/AnotherSamWilson/miceforest?tab=readme-ov-file#how-to-make-the-process-faster

eroell added the performance Performance label Aug 1, 2024

eroell assigned nicolassidoux Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple imputation methods: Performance #782

Multiple imputation methods: Performance #782

eroell commented Aug 1, 2024 •

edited by Zethson

Loading

Zethson commented Aug 1, 2024

eroell commented Aug 1, 2024

Zethson commented Aug 1, 2024

eroell commented Aug 1, 2024

Zethson commented Aug 1, 2024

Zethson commented Aug 8, 2024

Multiple imputation methods: Performance #782

Multiple imputation methods: Performance #782

Comments

eroell commented Aug 1, 2024 • edited by Zethson Loading

Question

Zethson commented Aug 1, 2024

eroell commented Aug 1, 2024

Zethson commented Aug 1, 2024

eroell commented Aug 1, 2024

Zethson commented Aug 1, 2024

Zethson commented Aug 8, 2024

eroell commented Aug 1, 2024 •

edited by Zethson

Loading