Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impractical GPU memory requirements #43

Open
SebastienTs opened this issue Jul 19, 2022 · 4 comments
Open

Impractical GPU memory requirements #43

SebastienTs opened this issue Jul 19, 2022 · 4 comments

Comments

@SebastienTs
Copy link

SebastienTs commented Jul 19, 2022

While indeed extremely fast, the GPU memory requirement is impractical on my setup: about 8 GB for a 1024x1024x19 image (16-bit) and a tiny 32x32x16 PSF. For images slightly above 1024x1024 (same number of Z slices), I can only run the code on a RTX 3090 (24 GB)!

The problem seems to stem from the FFT CUDA kernel. The error reported is:

tensorflow/stream_executor/cuda/cuda_fft.cc:253] failed to allocate work area.
tensorflow/stream_executor/cuda/cuda_fft.cc:430] Initialize Params: rank: 3 elem_count: 32 input_embed: 32 input_stride: 1 input_distance: 536870912 output_embed: 32 output_stride: 1 output_distance: 536870912 batch_count: 1
tensorflow/stream_executor/cuda/cuda_fft.cc:439] failed to initialize batched cufft plan with customized allocator:

Something is probably not right in the code... anybody knows of a workaround?

@SebastienTs SebastienTs changed the title Inpractical GPU memory usage Impractical GPU memory usage Jul 20, 2022
@SebastienTs SebastienTs changed the title Impractical GPU memory usage Impractical GPU memory requirements Jul 20, 2022
@eric-czech
Copy link
Member

Hey @SebastienTs, there are a number of reasons the memory usage is often way more than you might expect:

  1. The PSF is padded to the size of the image, so it doesn't matter if it's smaller
  2. Tensorflow FFT operations don't support sub 32 bit types (or at least they didn't when this was written)
  3. The image array is copied in intermediate states
  4. Other tensorflow memory overhead (e.g. as observed in Estimation of required memory #32)
  5. Often most importantly, the default padding mode pushes all dimensions up to next power of 2 (so in your case 1024x1024x19 would become 1024x1024x32 for both the image and the PSF).

I would suggest you try pad_mode='2357' which is a more memory-efficient but less computationally-efficient (sometimes) method added in #18.

Apart from that, the only other practical option is to chunk the arrays as in Tile-by-tile deconvolution using dask.ipynb.

@SebastienTs
Copy link
Author

SebastienTs commented Jul 21, 2022

Thanks a lot for your reply! I had 1, 2 and 5 in mind but even then do you really believe that 3 and 4 could explain the remaining 30x memory overhead (from 270 MB to 8 GB)?

If that is the case I can sleep peacefuly but it sounds like a real lot to me and I want to make sure that something is not misconfigured or extremely suboptimal for the Tensorflow version I am using...

I have not seen any noticeable reduction in memory usage by using pad_mode='2357' when invoking fd_restoration.RichardsonLucyDeconvolver.

I would happily consider the cucim alternative that is recommened but unfortunately my code needs to run on a Windows box.

@eric-czech
Copy link
Member

Hm well 10x wouldn't surprise me too much but 30x does seem extreme. When it comes to potential TF issues I really have no idea.

You should take a look at this too if you haven't seen it: #42 (comment). Some of those alternatives to this library may be Windows friendly.

@joaomamede
Copy link

joaomamede commented Jul 26, 2022

Have a look in my repo:

https://github.com/joaomamede/mamedelab_scripts/blob/main/notebooks/Google2021_Deconvolve_Live_gui.ipynb

I basically use dask to divide the images and assemble them again when the GPU mem is not enough.

This is the bioformats version (older, might have some tweaks to be done)
https://github.com/joaomamede/mamedelab_scripts/blob/main/notebooks/2021deconvolve_live_bioformats.ipynb

They should be able to run on google collaboratory version if you'd like to tweak around.

You also need the libraries at:
https://github.com/joaomamede/mamedelab_scripts/blob/main/notebooks/libraries/deco_libraries.py

hope it helps

I can do 2048x1024 times two in my 6GB laptop.
2048x2048 usually need 12GB vRAM.

The other option is to add the "RAM option" that will share RAM and vRAM and it's still a lot faster than only normal RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants