[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

mugulmd · 2024-12-02T18:56:46Z

Recently I've been looking into space filling curves, in particular the Morton curve (see https://en.wikipedia.org/wiki/Z-order_curve) and the Hilbert curve (see https://en.wikipedia.org/wiki/Hilbert_curve). I've read that these curves can be used to iterate through the pixels of an image in a "cache-friendly" way as they preserve spatial locality fairly well, and thus could improve the performances of some image processing algorithms.

I've started doing some investigation work, however the results I got so far do not suggest any performance improvements when using the Morton curve to iterate: I'm applying a simple convolution with a Gaussian kernel on a 4096x4096 image (on a single thread), and iterating with ImageBuffer::Iterator vs with my custom Morton curve implementation both take on average the exact same time.

If someone has some knowledge around space filling curves, performance engineering, or simply wants to continue the investigation on their side and provide more insight on this, your help is definitely welcome!

Here is the code I wrote for this initial investigation:

#include <OpenImageIO/imageio.h>
#include <OpenImageIO/imagebuf.h>
#include <OpenImageIO/imagebufalgo.h>

#include <iostream>
#include <string>
#include <cmath>
#include <chrono>


using namespace OIIO;


void process_ref(ImageBuf & dst, const ImageBuf & src, const ImageBuf & kernel)
{
    ROI roi    = src.roi();
    ROI kroi   = kernel.roi();

    ImageBuf::Iterator<float> dstit(dst);
    ImageBuf::ConstIterator<float> srcit(src);

    for (; !dstit.done(); ++dstit) {
        float sum = 0.f;

        const float* k = (const float*)kernel.localpixels();
        srcit.rerange(dstit.x() + kroi.xbegin, dstit.x() + kroi.xend,
                      dstit.y() + kroi.ybegin, dstit.y() + kroi.yend,
                      0                      , 1                    ,
                      ImageBuf::WrapClamp);
        for (; !srcit.done(); ++srcit, k += 1) {
            sum += k[0] * srcit[0];
        }

        dstit[0] = sum;
    }
}


void process_morton(ImageBuf & dst, const ImageBuf & src, const ImageBuf & kernel, int nbits)
{
    float* pixels = static_cast<float*>(dst.localpixels());

    const ImageSpec & spec = dst.spec();
    const uint64_t nmax    = (1 << (2 * nbits));

    ROI roi    = src.roi();
    ROI kroi   = kernel.roi();

    ImageBuf::ConstIterator<float> srcit(src);

    for (uint64_t i = 0; i < nmax; ++i) {
        uint32_t xmorton = 0;
        uint32_t ymorton = 0;
        for (int j = 0; j < nbits; ++j) {
            xmorton |= (i & (1 << (2 * j))) >> j;
            ymorton |= (i & (1 << (2 * j + 1))) >> (j + 1);
        }

        float sum = 0.f;

        const float* k = (const float*)kernel.localpixels();
        srcit.rerange(xmorton + kroi.xbegin, xmorton + kroi.xend,
                      ymorton + kroi.ybegin, ymorton + kroi.yend,
                      0                    , 1                  ,
                      ImageBuf::WrapClamp);
        for (; !srcit.done(); ++srcit, k += 1) {
            sum += k[0] * srcit[0];
        }

        pixels[ymorton * spec.width + xmorton] = sum;
    }
}


int main(int argc, char *argv[])
{
    // Command-line arguments
    const std::string mode = argv[1];
    const int nbits = std::stoi(argv[2]);
    const float kernel_size = std::stof(argv[3]);
    const std::string filename = argv[4];

    // Specs for image buffer
    // We only use dimensions that are powers of 2,
    // with an exponent specified by the user
    const int xres = (1 << nbits);
    const int yres = (1 << nbits);
    const int nchannels = 1;
    ImageSpec spec(xres, yres, nchannels, TypeDesc::FLOAT);

    // Destination buffer
    ImageBuf dst(spec);

    // Source buffer
    // Use a simple checker
    ImageBuf src(spec);
    float dark[3]  = { 0.1, 0.1, 0.1 };
    float light[3] = { 0.4, 0.4, 0.4 };
    ImageBufAlgo::checker(src, 64, 64, 1, cspan<float>(dark), cspan<float>(light));

    // Kernel buffer
    // Use a normalized Gaussian kernel
    ImageBuf kernel = ImageBufAlgo::make_kernel("gaussian", kernel_size, kernel_size);

    // Apply convolution to compute the final image
    // Measure elapsed time to compare iterator-based approach
    // and Morton-curve-based approach
    auto t_start = std::chrono::high_resolution_clock::now();
    if (mode == "ref") process_ref(dst, src, kernel);
    else if (mode == "morton") process_morton(dst, src, kernel, nbits);
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << "Elapsed time: "
              << std::chrono::duration_cast<std::chrono::microseconds>(t_end - t_start).count()
              << " microseconds" << std::endl;

    // Store final image on disk
    auto out = ImageOutput::create(filename);
    out->open(filename, spec);
    dst.write(out.get());
    out->close();

    return EXIT_SUCCESS;
}

The text was updated successfully, but these errors were encountered:

ThiagoIze · 2024-12-03T00:27:08Z

My first suggestion would be to double check what the access pattern is in the current code. For instance, if the data is stored in scanline order and you're reading in scanline order, then you are already optimal. On the other hand, if your algorithm needs to randomly read a 3x3 window, then the space filling curve is likely to help. If you have a 3x3 moving window, then you're reading from 3 scanlines in a predictable fashion and each time you move the window to the right by one pixel, the 3 new pixels are all likely to already be in your L2 and often in 3 L1 cachelines.

mugulmd · 2024-12-03T11:29:07Z

Hey @ThiagoIze, thanks for the note that definitely makes sense !
Do you have anything in mind for an algorithm that "randomly reads a 3x3 window" ? Ideally something that is already implemented in OIIO and that we could improve with space filling curves.

ThiagoIze · 2024-12-03T23:13:42Z

Sorry, I don't have anything in mind that could benefit from space filling curves. I imagine most image processing algorithms already operate on all the pixels in relatively cache friendly ways. Having said that, there's large amounts of OIIO I am not familiar with, so maybe there's something that could benefit from it?

lgritz · 2024-12-03T23:59:26Z

Maybe if you were operating on huge, ImageCache-backed, tiled images that would exceed tile cache size if read in a suboptimal order, you could reduce I/O significantly if you had a pixel traversal order that was tile-size-aware and made sure it never had to revisit a tile after moving to another?

For in-core images (or cached images where the cache size is more than the total of the images you need in memory -- in which case you're still reading each tile exactly once), I'm not sure that you could do measurably better than we do currently in scanline order.

(Detail: most IBA function will use the thread pool, so actually it's breaking the image into horizontal strips of adjacent scanlines, and working on those simultaneously. So I'm less sure about how that complicates predictions of how either the RAM cache or IC cache performance would be affected by pixel traversal order.)

Maybe before investing much work, one thing you could try is to experiment with doing it as badly as possible -- like if you made an alternate IB::Iterator that visits every pixel within the region, but as incoherently as possible (it's the blue noise of iterators!) and benchmark a bunch of IBA functions against to the current boring scanline traversal iterator, then surely that tells you something about whether this is likely to be helpful. I mean, if you can barely measure the perf difference between "as incoherent as possible" and "scanline", you are not going to get much benefit from "slightly better space-filling order". But if there is a huge difference in this experiment, maybe it's worth the next step of the investigation?

lgritz · 2024-12-04T00:03:31Z

Another thought:

If this did seem promising in benchmarks, I still think it's likely that there is no best order, but that different orders might be better in different situations.

I definitely wouldn't want to make additional conditional evaluations in the IB::Iterator traversal, like you don't want every ++ to be doing "if traversal order A then... else if traversal order B then...". It seems self-defeating to add expense to the traversal itself.

A design I might propose is to change IB::Iterator, which is currently templated on the data type of the buffer and the data type that you'd like the values to APPEAR to be (usually float), and add a third template parameter for "IteratorTraits", which among other possible thing can be the home of the code that is currently IB::Iterator::operator++, which is implicitly determines the traversal order. So the default one can be as it currently is, but if you defined

class MyIteratorTraits { operator++() { implement weird traversal order here... } }

then you could have

ImageBuf::Iterator<float,float,MyIteratorTraits> myiterator;

and it would do the alternate traversal order.

Anyway, something like that could allow you to experiment with different traversal patterns without adding any overhead or complexity to the default behavior.

mugulmd · 2024-12-04T12:14:02Z

Maybe before investing much work, one thing you could try is to experiment with doing it as badly as possible -- like if you made an alternate IB::Iterator that visits every pixel within the region, but as incoherently as possible (it's the blue noise of iterators!) and benchmark a bunch of IBA functions against to the current boring scanline traversal iterator, then surely that tells you something about whether this is likely to be helpful. I mean, if you can barely measure the perf difference between "as incoherent as possible" and "scanline", you are not going to get much benefit from "slightly better space-filling order". But if there is a huge difference in this experiment, maybe it's worth the next step of the investigation?

Definitely a good idea, I'll start with that 😅

mugulmd · 2024-12-04T12:17:00Z

A design I might propose is to change IB::Iterator, which is currently templated on the data type of the buffer and the data type that you'd like the values to APPEAR to be (usually float), and add a third template parameter for "IteratorTraits", which among other possible thing can be the home of the code that is currently IB::Iterator::operator++, which is implicitly determines the traversal order.

Sounds like a good way to go !

mugulmd · 2024-12-04T12:20:48Z

I just want to emphasize something: I absolutely do not want to "force push" a new feature in OIIO, if the investigation does not display any real gain in implementing space filling curves, or if it's only in some particuliar situation that no one is ever going to run into, then let's just not do it, that would create additional code complexity for nothing and I tend to hate that 😅

lgritz mentioned this issue Dec 5, 2024

fix: broken pgm having memory access error #4559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

mugulmd commented Dec 2, 2024

ThiagoIze commented Dec 3, 2024

mugulmd commented Dec 3, 2024 •

edited

Loading

ThiagoIze commented Dec 3, 2024

lgritz commented Dec 3, 2024

lgritz commented Dec 4, 2024

mugulmd commented Dec 4, 2024

mugulmd commented Dec 4, 2024

mugulmd commented Dec 4, 2024

[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

Comments

mugulmd commented Dec 2, 2024

ThiagoIze commented Dec 3, 2024

mugulmd commented Dec 3, 2024 • edited Loading

ThiagoIze commented Dec 3, 2024

lgritz commented Dec 3, 2024

lgritz commented Dec 4, 2024

mugulmd commented Dec 4, 2024

mugulmd commented Dec 4, 2024

mugulmd commented Dec 4, 2024

mugulmd commented Dec 3, 2024 •

edited

Loading