-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not enough GPU ram ... make a "slow" option #31
Comments
Also tested whether Coincidentally, I just noticed this: So I guess I don't need to test with pycudadeconv either |
this is the code I need to look at
fall back to Bluestein for arrays that are too large. |
Turns out that was easy. The mode that gets passed into Will add this as a commandline option |
You should always pad by at least a PSF extent. The FFT essentially “wraps” the edges together. E.g. If you don’t pad, what is at the top edge will bleed and appear at the bottom. The best is “mirror padding”. Zero padding will create ringing artifacts.
…-Dan
On Apr 5, 2019, at 6:54 AM, VolkerH ***@***.***> wrote:
Turns out that was easy. The mode that gets passed into optimize_dims can be set when initializing the deconvolver class in flowdec using the named argument pad_mode. In the specific example mentioned above this allows deconvolution on the GPU. Haven't benchmarked it but it is not much slower. However, I appear to obtain more artefacts at the image boundary.
Will add this as a commandline option
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@dmilkie I will need to check a few more things in the flowdec source code, I believe if I provide images that already have optimal dimensions for the FFT it will not perform any padding by default (but I might be wrong). So far, most of the stacks I put through there had dimensions that would have been very generously rounded up and padded (not sure about the default fill strategy, also will have to check the source code) so the output always looked quite artefact-free. |
Nice. I believe there is a “mirror” pad option in there.
Regarding VRAM bloat
An other thing to check is the precision of the calculations. If they are double that’s too much.
Smaller gain would be to check the FFTs. if they are complex to complex (instead of complex to real and real to complex) that’ll cost you. Also I just implemented a shared workArea in the cudaDecon code so the two FFT plans use the same space (since they execute serially) which saved roughly a single copy of the data (out of the 7 or so it needs). Small gain but it’s something.
…-Dan
On Apr 6, 2019, at 12:27 AM, VolkerH ***@***.***> wrote:
@dmilkie
Thanks for the comment and I agree: that is exactly the type of artefact that I was seeing when disabling padding.
I will need to check a few more things in the flowdec source code, I believe if I provide images that already have optimal dimensions for the FFT it will not perform any padding by default (but I might be wrong). So far, most of the stacks I put through there had dimensions that would have been very generously rounded up and padded (not sure about the default fill strategy, also will have to check the source code) so the output always looked quite artefact-free.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass |
I am impressed about the level of such nice tweaks that are implemented in I was wondering whether tensorflow could automatically fall back to storing arrays in CPU memory when there is not enough VRAM. From what I understand this is happening for operations which are not supported on the GPU if the option
Yes, I did see that and setting
This gets complicated rather quickly, but I think I will implement the first part of it to ensure consistent quality regardless of the input size. |
Yeah. I’ve thought about decon performance quite a bit. Another thing to check is if these other methods are using the accelerated RL. (Biggs and Andrews) as we do ( and matlab does). The acceleration does take extra memory to compute, but the performance is worth it IMHO. That Biggs paper is pretty old and there maybe a higher performance flavor or something that uses less memory.... or maybe the Biggs and Andrew workspace arrays might be able to share memory. I should look into that.
…-Dan
On Apr 7, 2019, at 7:56 PM, VolkerH ***@***.***> wrote:
Nice. I believe there is a “mirror” pad option in there. Regarding VRAM bloat An other thing to check is the precision of the calculations. If they are double that’s too much. Smaller gain would be to check the FFTs. if they are complex to complex (instead of complex to real and real to complex) that’ll cost you. Also I just implemented a shared workArea in the cudaDecon code so the two FFT plans use the same space (since they execute serially) which saved roughly a single copy of the data (out of the 7 or so it needs). Small gain but it’s something.
I am impressed about the level of such nice tweaks that are implemented in cudaDecon. I had already looked at whether it is possible to use lower precision (such as float16/complex32) to save VRAM. However, tensorflow does not provide that level of control it is either complex64 or complex128 (see https://www.tensorflow.org/api_docs/python/tf/signal/fft3d). Also, tensorflow doesn't give much option over the FFT plan creation/storage.
I was wondering whether tensorflow could automatically fall back to storing arrays in CPU memory when there is not enough VRAM. From what I understand this is happening for operations which are not supported on the GPU if the option allow_soft_placement is given, but apparently tensorflow does not do this based on memory considerations.
FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass pad_mode='none' to just turn off any of the automatic padding/cropping if you wanted to take care of that externally.
Yes, I did see that and setting pad_mode='none' allowed me to deconvolve a volume that otherwise gave error messages due to lack of VRAM. To ensure optimum quality whenever possible, one would probably have to have a hiearchy like in this pseudocode:
increase input size (origsize) by at least half the width of the PSF along each dimension -> (newsize)
increase newsize to nearest size optimal for speed -> (optimalnewsize)
try allocating graph for optimalnewsize:
Catch vram_exception:
try allocating graph for newsize:
catch vram_exception:
try_allocating graph for origsize (warn about potential artefacts)
finally:
fall back to storing in main memory rather than VRAM.
This gets complicated rather quickly, but I think I will implemented the first part of it to ensure consistent quality regardless of the input size.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
The Biggs Thesis from the U of Auckland (https://researchspace.auckland.ac.nz/handle/2292/1760). I skimmed it but haven't found the time to read it. However, now that your cudaDeconv code has been open-sourced it becomes a bit more difficult to justify putting the effort in (it is fun though and the lessons learnt will definitely be useful in other projects). I think I need to lock down the feature set I want for now and create a useable and easily installable version. Back to VRAM utilization: |
a bit more difficult to justify putting the effort in
Cool! I'm happy to hear it. :)
The Biggs Thesis
Give his paper a shot. It's pretty readable.
https://doi.org/10.1364/AO.36.001766
he approach I am using of deskewing on the raw data (by resampling and
skewing the PSF) can potentially save considerable memory, I could also try
that with cudaDeconv.
Right. I think you could initially give this a go by just first running
the cudaDeconv with skewed data and a skewed PSF. (be aware somewhere,
maybe in OTFgen.exe or cudaDeconv.exe, there is some rotational averaging
(i.e. assume that PSF has some symmetry) there might be a command line
switch to turn this on/off). Then run cudaDeconv again with itereations=0
to just deskew the data. (or use whatever transform tool you already have).
…-Dan
On Sun, Apr 7, 2019 at 9:39 PM VolkerH ***@***.***> wrote:
The Biggs Thesis from the U of Auckland (
https://researchspace.auckland.ac.nz/handle/2292/1760). I skimmed it but
haven't found the time to read it.
However, now that your cudaDeconv code has been open-sourced it becomes a
bit more difficult to justify putting the effort in (it is fun though and
the lessons learnt will definitely be useful in other projects). I think I
need to lock down the feature set I want for now and create a useable and
easily installable version.
Back to VRAM utilization:
The approach I am using of deskewing on the raw data (by resampling and
skewing the PSF) can potentially save considerable memory, I could also try
that with cudaDeconv.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQzqb42PVBNJSrfaFocTgkPq5tM1thUyks5vep3GgaJpZM4ceI-Q>
.
|
Thanks for the link to the paper and the suggestions. I will close this issue as I've addressed some of the discussed issues via this branch |
both gputools (which used reika for fft) and flowdec run out of GPU memory when trying to process a stack of size (151, 800, 600).
Depending on what I am trying do exactly the error message in tensorflow either shows up when initializing the batch cufft plan or later on when allocating space for a tensor.
One of the error messages I saw is that it was trying to allocate a tensor of size 256, 1024, 1024. When I crop the volume by 23 pixes in Z (this would correspond to 128 z slices), everything works fine. When I only crop 22 pixels it fails.
flowdec rounds up the sizes to the next size where the fastest FFT can be performed. This appears to be very generous rounding. It would be nice to be able to trade some speed for the ability to process such volumes. I should look into adding an option to round up to the next size for which an FFT can be performed, even if it is not optimal in terms of speed.
The text was updated successfully, but these errors were encountered: