Header only C++ implementation of a fast gaussian blur approximation in linear time. It is based on a blog post by Ivan Kutskir: blog. Which refers to a presentation by Wojciech Jarosz: slides. Which itself describes an algorithm from the paper Fast Almost-Gaussian Filtering by Peter Kovesi: site, paper. The code uses STB_IMAGE and STB_IMAGE_WRITE by stb for image manipulation: stb github.
The fast Gaussian blur algorithm in linear time is performed with several box blur passes over an image. Applying several times box blur converges towards a true Gaussian blur thanks to the theorem central limit (cf. next image). Three passes are sufficient for good quality results but the exposed implementation supports arbitrary number of box blur passes.
A 2D box blur is a separable convolution, hence it is most of the time performed using first an horizontal 1D box blur pass and then a vertical 1D box blur pass. Usually the process of N box blur passes should alternate between these horizontal and vertical passes. However thanks to box blur properties the horizontal and vertical passes can be performed in any order without changing the result. Hence for performance purposes I came up with the following algorithm:
- apply N times horizontal box blur (horizontal passes)
- flip the image buffer (transposition)
- apply N times horizontal box blur (vertical passes)
- flip the image buffer (transposition)
Steps 1. and 3. are performed with the horizontal_blur
function, which is a fast 1D box blur pass with a sliding accumulator.
Steps 2. and 4. are performed with the flip_block
function, which is a fast image buffer transposition, processed per block such that it better preserves cache coherency.
Note 1: The fast gaussian blur algorithm is not accurate on image boundaries. It performs a diffusion of the signal with several independant passes, each pass depending of the preceding one. Some of the diffused signal is lost near borders and results in a slight loss of accuracy for next pass. This problem can be solved by increasing the image support of half the box kernel extent at each pass of the algorithm. The added padding would in this case capture the diffusion and make the next pass accurate. On contrary true Gaussian blur does not suffer this problem since the whole diffusion process is performed in one pass only. The extra padding is not performed in this implementation, however we provide several border policies resulting in dfferent approximations and accuracies.
Note 2: The fast gaussian blur algorithm does not reproduce accurately a true desired Gaussian standard deviation (sigma). The approximate sigma oscillate around the true sigma and the error will be less noticeable as sigma increases. In fact, this method is designed to resolve medium or high values of sigma super fast, and are not well suited for small sigmas (<=2), since a simple separable Gaussian blur implementation could be equally fast and of better quality.
For further details please refer to:
- http://blog.ivank.net/fastest-gaussian-blur.html
- https://www.peterkovesi.com/papers/FastGaussianSmoothing.pdf
The implementation is defined in the fast_gaussian_blur_template.h
header that contains the fastest templated cache coherent version I could make.
The main exposed function is defined as:
template<typename T>
void fast_gaussian_blur(T *& in, T *& out, const int w, const int h, const int c, const float sigma, const unsigned int n);
where the arguments are:
in
is a reference to the source buffer ptr,out
is a reference to the target buffer ptr,w
is the image width,h
is the image height,c
is the image number of channels,sigma
is the desired Gaussian standard deviation,n
is the number of box blur passes to perform.
Note that the buffer values (input and output) and their pointers are modified during the process hence they can not be constant.
This version blurs 2000k pixels in ~7ms on all cores of a Ryzen 7 2700X CPU with OpenMP. Hence it may be used for real-time applications with reasonable image resolutions. A SIMD vectorized or a GPU version of this algorithm could be significantly faster (but may be painful for the developper for arbitrary channels number / data sizes).
Note that I have tried to beat the template version with an ISPC compiled version, but still can not match the performance. If one manage to improve this version I would be pleased to discuss how :)
In a Unix or WSL term you can use the provided makefile; use make
to build the target fastblur
example (main.cpp) without dependencies.
Run the program with the following command:
./fastblur <input_filename> <output_filename> <sigma> <passes = 3>
- input_image_filename should be any of [.jpg, .png, .bmp, .tga, .psd, .gif, .hdr, .pic, .pnm].
- output_image_filename should be any of [.png, .jpg, .bmp] (unknown extensions will be saved as .png by default).
- sigma is the desired Gaussian blur standard deviation (should be positive).
- passes is an optional argument that controls the number of box blur passes (should be positive). Default is 3 and current implementation supports up to 10 passes, but one can easily add more in the code.
The fast Gaussian blur is linear in time regarding the size of the input image, but independent of sigma hence sigma = 5 is equally fast as sigma = 50.
Original | sigma = 2 | sigma = 5 | sigma = 10 | sigma = 30 | sigma = 50 |
---|---|---|---|---|---|
The graph below shows the average exectution time of blur algorithm w.r.t pixel number on Ryzen 7 2700X. Compared to true Gaussian blur the approximate fast Gaussian blur results in a huge speedup. The original algorithm is described with 3 horizontal blur passes and 3 vertical blur passes (brown and red curves). But the column major traversal of large image buffer may result in cache incohenrency in the original vertical blur pass (see at the right of the dashed blue line). Instead, we can perform an image buffer transpositions and only cache coherent row major traversals (horizontal blur) to mitigate the problem (purple and green curves). Again, the transposition step is not a cache friendly operation thus on large image buffer we observe the slope w/ transpose slowly increasing w.r.t image size (more visible on the green curve). Performing the image transposition with fixed squared blocks per thread helps preserving the cache coherency and results in the fastest version of the algortihm (orange and blue curves). Further improved in the templated version exposed in the header (blue curve).
Special thanks to Jean-Claude Iehl (@jciehl) for our insightful discussions and his passion for making code simple and fast.
You may use, distribute and modify this code under the terms of the MIT license. For further details please refer to : https://mit-license.org/