Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

Open
mugulmd opened this issue Dec 2, 2024 · 8 comments
Open

[INVESTIGATION] ImageBuffer iterators based on Space Filling Curves #4554

mugulmd opened this issue Dec 2, 2024 · 8 comments

Comments

@mugulmd
Copy link
Contributor

mugulmd commented Dec 2, 2024

Recently I've been looking into space filling curves, in particular the Morton curve (see https://en.wikipedia.org/wiki/Z-order_curve) and the Hilbert curve (see https://en.wikipedia.org/wiki/Hilbert_curve). I've read that these curves can be used to iterate through the pixels of an image in a "cache-friendly" way as they preserve spatial locality fairly well, and thus could improve the performances of some image processing algorithms.

I've started doing some investigation work, however the results I got so far do not suggest any performance improvements when using the Morton curve to iterate: I'm applying a simple convolution with a Gaussian kernel on a 4096x4096 image (on a single thread), and iterating with ImageBuffer::Iterator vs with my custom Morton curve implementation both take on average the exact same time.

If someone has some knowledge around space filling curves, performance engineering, or simply wants to continue the investigation on their side and provide more insight on this, your help is definitely welcome!

Here is the code I wrote for this initial investigation:

#include <OpenImageIO/imageio.h>
#include <OpenImageIO/imagebuf.h>
#include <OpenImageIO/imagebufalgo.h>

#include <iostream>
#include <string>
#include <cmath>
#include <chrono>


using namespace OIIO;


void process_ref(ImageBuf & dst, const ImageBuf & src, const ImageBuf & kernel)
{
    ROI roi    = src.roi();
    ROI kroi   = kernel.roi();

    ImageBuf::Iterator<float> dstit(dst);
    ImageBuf::ConstIterator<float> srcit(src);

    for (; !dstit.done(); ++dstit) {
        float sum = 0.f;

        const float* k = (const float*)kernel.localpixels();
        srcit.rerange(dstit.x() + kroi.xbegin, dstit.x() + kroi.xend,
                      dstit.y() + kroi.ybegin, dstit.y() + kroi.yend,
                      0                      , 1                    ,
                      ImageBuf::WrapClamp);
        for (; !srcit.done(); ++srcit, k += 1) {
            sum += k[0] * srcit[0];
        }

        dstit[0] = sum;
    }
}


void process_morton(ImageBuf & dst, const ImageBuf & src, const ImageBuf & kernel, int nbits)
{
    float* pixels = static_cast<float*>(dst.localpixels());

    const ImageSpec & spec = dst.spec();
    const uint64_t nmax    = (1 << (2 * nbits));

    ROI roi    = src.roi();
    ROI kroi   = kernel.roi();

    ImageBuf::ConstIterator<float> srcit(src);

    for (uint64_t i = 0; i < nmax; ++i) {
        uint32_t xmorton = 0;
        uint32_t ymorton = 0;
        for (int j = 0; j < nbits; ++j) {
            xmorton |= (i & (1 << (2 * j))) >> j;
            ymorton |= (i & (1 << (2 * j + 1))) >> (j + 1);
        }

        float sum = 0.f;

        const float* k = (const float*)kernel.localpixels();
        srcit.rerange(xmorton + kroi.xbegin, xmorton + kroi.xend,
                      ymorton + kroi.ybegin, ymorton + kroi.yend,
                      0                    , 1                  ,
                      ImageBuf::WrapClamp);
        for (; !srcit.done(); ++srcit, k += 1) {
            sum += k[0] * srcit[0];
        }

        pixels[ymorton * spec.width + xmorton] = sum;
    }
}


int main(int argc, char *argv[])
{
    // Command-line arguments
    const std::string mode = argv[1];
    const int nbits = std::stoi(argv[2]);
    const float kernel_size = std::stof(argv[3]);
    const std::string filename = argv[4];

    // Specs for image buffer
    // We only use dimensions that are powers of 2,
    // with an exponent specified by the user
    const int xres = (1 << nbits);
    const int yres = (1 << nbits);
    const int nchannels = 1;
    ImageSpec spec(xres, yres, nchannels, TypeDesc::FLOAT);

    // Destination buffer
    ImageBuf dst(spec);

    // Source buffer
    // Use a simple checker
    ImageBuf src(spec);
    float dark[3]  = { 0.1, 0.1, 0.1 };
    float light[3] = { 0.4, 0.4, 0.4 };
    ImageBufAlgo::checker(src, 64, 64, 1, cspan<float>(dark), cspan<float>(light));

    // Kernel buffer
    // Use a normalized Gaussian kernel
    ImageBuf kernel = ImageBufAlgo::make_kernel("gaussian", kernel_size, kernel_size);

    // Apply convolution to compute the final image
    // Measure elapsed time to compare iterator-based approach
    // and Morton-curve-based approach
    auto t_start = std::chrono::high_resolution_clock::now();
    if (mode == "ref") process_ref(dst, src, kernel);
    else if (mode == "morton") process_morton(dst, src, kernel, nbits);
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << "Elapsed time: "
              << std::chrono::duration_cast<std::chrono::microseconds>(t_end - t_start).count()
              << " microseconds" << std::endl;

    // Store final image on disk
    auto out = ImageOutput::create(filename);
    out->open(filename, spec);
    dst.write(out.get());
    out->close();

    return EXIT_SUCCESS;
}
@ThiagoIze
Copy link
Collaborator

My first suggestion would be to double check what the access pattern is in the current code. For instance, if the data is stored in scanline order and you're reading in scanline order, then you are already optimal. On the other hand, if your algorithm needs to randomly read a 3x3 window, then the space filling curve is likely to help. If you have a 3x3 moving window, then you're reading from 3 scanlines in a predictable fashion and each time you move the window to the right by one pixel, the 3 new pixels are all likely to already be in your L2 and often in 3 L1 cachelines.

@mugulmd
Copy link
Contributor Author

mugulmd commented Dec 3, 2024

Hey @ThiagoIze, thanks for the note that definitely makes sense !
Do you have anything in mind for an algorithm that "randomly reads a 3x3 window" ? Ideally something that is already implemented in OIIO and that we could improve with space filling curves.

@ThiagoIze
Copy link
Collaborator

Sorry, I don't have anything in mind that could benefit from space filling curves. I imagine most image processing algorithms already operate on all the pixels in relatively cache friendly ways. Having said that, there's large amounts of OIIO I am not familiar with, so maybe there's something that could benefit from it?

@lgritz
Copy link
Collaborator

lgritz commented Dec 3, 2024

Maybe if you were operating on huge, ImageCache-backed, tiled images that would exceed tile cache size if read in a suboptimal order, you could reduce I/O significantly if you had a pixel traversal order that was tile-size-aware and made sure it never had to revisit a tile after moving to another?

For in-core images (or cached images where the cache size is more than the total of the images you need in memory -- in which case you're still reading each tile exactly once), I'm not sure that you could do measurably better than we do currently in scanline order.

(Detail: most IBA function will use the thread pool, so actually it's breaking the image into horizontal strips of adjacent scanlines, and working on those simultaneously. So I'm less sure about how that complicates predictions of how either the RAM cache or IC cache performance would be affected by pixel traversal order.)

Maybe before investing much work, one thing you could try is to experiment with doing it as badly as possible -- like if you made an alternate IB::Iterator that visits every pixel within the region, but as incoherently as possible (it's the blue noise of iterators!) and benchmark a bunch of IBA functions against to the current boring scanline traversal iterator, then surely that tells you something about whether this is likely to be helpful. I mean, if you can barely measure the perf difference between "as incoherent as possible" and "scanline", you are not going to get much benefit from "slightly better space-filling order". But if there is a huge difference in this experiment, maybe it's worth the next step of the investigation?

@lgritz
Copy link
Collaborator

lgritz commented Dec 4, 2024

Another thought:

If this did seem promising in benchmarks, I still think it's likely that there is no best order, but that different orders might be better in different situations.

I definitely wouldn't want to make additional conditional evaluations in the IB::Iterator traversal, like you don't want every ++ to be doing "if traversal order A then... else if traversal order B then...". It seems self-defeating to add expense to the traversal itself.

A design I might propose is to change IB::Iterator, which is currently templated on the data type of the buffer and the data type that you'd like the values to APPEAR to be (usually float), and add a third template parameter for "IteratorTraits", which among other possible thing can be the home of the code that is currently IB::Iterator::operator++, which is implicitly determines the traversal order. So the default one can be as it currently is, but if you defined

class MyIteratorTraits { operator++() { implement weird traversal order here... } }

then you could have

ImageBuf::Iterator<float,float,MyIteratorTraits> myiterator;

and it would do the alternate traversal order.

Anyway, something like that could allow you to experiment with different traversal patterns without adding any overhead or complexity to the default behavior.

@mugulmd
Copy link
Contributor Author

mugulmd commented Dec 4, 2024

Maybe before investing much work, one thing you could try is to experiment with doing it as badly as possible -- like if you made an alternate IB::Iterator that visits every pixel within the region, but as incoherently as possible (it's the blue noise of iterators!) and benchmark a bunch of IBA functions against to the current boring scanline traversal iterator, then surely that tells you something about whether this is likely to be helpful. I mean, if you can barely measure the perf difference between "as incoherent as possible" and "scanline", you are not going to get much benefit from "slightly better space-filling order". But if there is a huge difference in this experiment, maybe it's worth the next step of the investigation?

Definitely a good idea, I'll start with that 😅

@mugulmd
Copy link
Contributor Author

mugulmd commented Dec 4, 2024

A design I might propose is to change IB::Iterator, which is currently templated on the data type of the buffer and the data type that you'd like the values to APPEAR to be (usually float), and add a third template parameter for "IteratorTraits", which among other possible thing can be the home of the code that is currently IB::Iterator::operator++, which is implicitly determines the traversal order.

Sounds like a good way to go !

@mugulmd
Copy link
Contributor Author

mugulmd commented Dec 4, 2024

I just want to emphasize something: I absolutely do not want to "force push" a new feature in OIIO, if the investigation does not display any real gain in implementing space filling curves, or if it's only in some particuliar situation that no one is ever going to run into, then let's just not do it, that would create additional code complexity for nothing and I tend to hate that 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants