Port of stb_image optimized paeth unfiltering #539

Shnatsel · 2024-12-01T08:04:17Z

10% end-to-end performance gain on a Paeth-filtered image I concocted with convert -quality 49 in.png out.png, which is comparatively big and Paeth images are rather common.

Godbolt shows much shorter assembly that gets autovectorized even in the bpp=3 case: https://godbolt.org/z/fq3EjvT4b

TODO:

Have someone else confirm this performance boost exists and isn't isolated to my machine
Exhaustively test this implementation against the reference
Do something about the nightly variant of Paeth unfiltering

Also done:

Fix unstable feature dramatically regressing performance on ARM
Add CI job to make sure unstable feature compiles
Add CI job to run tests on ARM

HeroicKatora · 2024-12-01T11:48:42Z

I believe I can confirm the performance change on my machine.

decode/Transparency.png time:   [61.886 µs 61.898 µs 61.916 µs]
                        thrpt:  [5.4150 GiB/s 5.4166 GiB/s 5.4176 GiB/s]
                 change:
                        time:   [-5.6629% -5.5677% -5.4790%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7966% +5.8960% +6.0028%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
decode/kodim02.png      time:   [2.1276 ms 2.1296 ms 2.1312 ms]
                        thrpt:  [527.87 MiB/s 528.27 MiB/s 528.75 MiB/s]
                 change:
                        time:   [-5.2926% -5.2361% -5.1695%] (p = 0.00 < 0.05)
                        thrpt:  [+5.4513% +5.5254% +5.5883%]
                        Performance has improved.
decode/kodim07.png      time:   [2.5823 ms 2.5825 ms 2.5827 ms]
                        thrpt:  [435.59 MiB/s 435.63 MiB/s 435.66 MiB/s]
                 change:
                        time:   [-12.692% -12.406% -12.139%] (p = 0.00 < 0.05)
                        thrpt:  [+13.816% +14.163% +14.537%]
                        Performance has improved.
decode/kodim17.png      time:   [2.1296 ms 2.1311 ms 2.1319 ms]
                        thrpt:  [527.69 MiB/s 527.89 MiB/s 528.28 MiB/s]
                 change:
                        time:   [-0.6423% -0.5476% -0.4601%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4622% +0.5506% +0.6465%]
                        Change within noise threshold.
decode/kodim23.png      time:   [1.9203 ms 1.9214 ms 1.9226 ms]
                        thrpt:  [585.15 MiB/s 585.52 MiB/s 585.84 MiB/s]
                 change:
                        time:   [-9.0051% -8.8174% -8.6456%] (p = 0.00 < 0.05)
                        thrpt:  [+9.4638% +9.6700% +9.8963%]
                        Performance has improved.

Here kodim17 is indeed not using any Paeth filter line 🎉

…torized version of the scalar one

Shnatsel · 2024-12-01T20:49:18Z

I've changed the nightly version to just use autovectorized stable version. I'm not seeing a big boost from enabling unstable for RGBA after that, the 1% difference might just be noise. We might be able to simply remove all that SIMD code.
RGB is still faster on nightly due to the clever while loop.

Shnatsel · 2024-12-01T21:08:08Z

I have the removal of nightly SIMD for RGBA ready to go in a branch, but I'm not including it in this PR to keep this one as simple and uncontroversial as possible, and hopefully get it merged quickly.

kornelski · 2024-12-01T22:20:14Z

In a microbenchmark on Apple Silicon speed of this filter varies depending on pixel size, and it's half the speed for some sizes. On x86 it's mostly faster, except 11% slower for 6-byte pixels.

#540

It's odd that there's such a big variance.

Shnatsel · 2024-12-02T00:07:15Z

On my Zen machine this gets us +5 MP/s on corpus-bench with a corpus of random PNG images craped off the web.

According to Kornel's benchmarks this only regresses the bpp=6 case on x86, and even that regression is not dramatic (11% on the filter alone). But on Apple ARM this significantly regresses the crucial bpp=3 case, and I'm not willing to accept that.

How about we land this only for x86, and stick to the previous impl on other platforms? I'm not a fan of the duplication, but the code is trivial enough that it's only adding 10 lines or so.

fintelia · 2024-12-02T01:35:09Z

Other decoders have different (manual SIMD) implementations for different architectures. It would be nice if we could find a single autovectorized/portable-simd implementation that got ideal performance on all architectures. But I'm not shocked that isn't the case. Using different implementations on x86 and ARM seem fine to me.

Also, it probably shouldn't be in this PR, but filter.rs has gotten large enough that we should probably think about spinning it off into multiple files.

Shnatsel · 2024-12-02T02:42:06Z

@kornelski could you benchmark this PR on ARM with the unstable feature enabled, and compare that against master?

That would be cargo +nightly bench --features=unstable -Fbenchmarks Paeth --bench unfilter -- --save-baseline=…

I am particularly curious about bpp=4 and 8. These benchmarks will tell me whether I need to bring back the portable SIMD version, or if autovectorization does a good job for those cases already.

I can't do it myself because I don't have any Apple hardware, renting it in the cloud for a short time is impossible due to Apple's licensing shenanigans, and ARM cloud instances have different performance characteristics.

Edit: wait nevermind I didn't push the changes I need benches on yet, I'm losing my mind

Shnatsel · 2024-12-02T02:56:04Z

The code isn't pretty but it's ready for benchmarking on ARM with --features=unstable

kornelski · 2024-12-02T17:04:50Z

Apple M3 Max:

unstable pr vs stable main

unfilter/filter=Paeth/bpp=1
                        time:   [7.4984 µs 7.5002 µs 7.5025 µs]
                        thrpt:  [520.66 MiB/s 520.82 MiB/s 520.95 MiB/s]
                 change:
                        time:   [-4.2620% -4.0539% -3.8443%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9980% +4.2252% +4.4517%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=2
                        time:   [8.6324 µs 8.6350 µs 8.6375 µs]
                        thrpt:  [904.48 MiB/s 904.75 MiB/s 905.02 MiB/s]
                 change:
                        time:   [-0.2792% -0.2214% -0.1713%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1716% +0.2219% +0.2800%]
                        Change within noise threshold.

unfilter/filter=Paeth/bpp=3
                        time:   [15.242 µs 15.266 µs 15.297 µs]
                        thrpt:  [766.08 MiB/s 767.65 MiB/s 768.84 MiB/s]
                 change:
                        time:   [+52.791% +53.175% +53.641%] (p = 0.00 < 0.05)
                        thrpt:  [-34.913% -34.715% -34.551%]
                        Performance has regressed.

unfilter/filter=Paeth/bpp=4
                        time:   [17.279 µs 17.283 µs 17.289 µs]
                        thrpt:  [903.76 MiB/s 904.05 MiB/s 904.27 MiB/s]
                 change:
                        time:   [-0.0658% +0.1367% +0.3469%] (p = 0.20 > 0.05)
                        thrpt:  [-0.3457% -0.1365% +0.0659%]
                        No change in performance detected.

unfilter/filter=Paeth/bpp=6
                        time:   [17.291 µs 17.300 µs 17.309 µs]
                        thrpt:  [1.3223 GiB/s 1.3230 GiB/s 1.3237 GiB/s]
                 change:
                        time:   [-6.1936% -6.0063% -5.8107%] (p = 0.00 < 0.05)
                        thrpt:  [+6.1692% +6.3901% +6.6026%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=8
                        time:   [15.184 µs 15.192 µs 15.207 µs]
                        thrpt:  [2.0069 GiB/s 2.0088 GiB/s 2.0099 GiB/s]
                 change:
                        time:   [-0.4410% -0.3503% -0.2358%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2364% +0.3516% +0.4429%]
                        Change within noise threshold.

unstable pr vs unstable main

unfilter/filter=Paeth/bpp=1
                        time:   [7.5009 µs 7.5058 µs 7.5131 µs]
                        thrpt:  [519.92 MiB/s 520.43 MiB/s 520.77 MiB/s]
                 change:
                        time:   [-0.2384% -0.1265% -0.0037%] (p = 0.02 < 0.05)
                        thrpt:  [+0.0037% +0.1267% +0.2390%]
                        Change within noise threshold.

unfilter/filter=Paeth/bpp=2
                        time:   [8.6217 µs 8.6257 µs 8.6292 µs]
                        thrpt:  [905.35 MiB/s 905.72 MiB/s 906.14 MiB/s]
                 change:
                        time:   [+1.9294% +2.2062% +2.4550%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3962% -2.1586% -1.8929%]
                        Performance has regressed.

unfilter/filter=Paeth/bpp=3
                        time:   [15.239 µs 15.252 µs 15.274 µs]
                        thrpt:  [767.24 MiB/s 768.34 MiB/s 768.98 MiB/s]
                 change:
                        time:   [-1.0307% -0.6365% -0.2955%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2964% +0.6406% +1.0414%]
                        Change within noise threshold.

unfilter/filter=Paeth/bpp=4
                        time:   [17.280 µs 17.295 µs 17.324 µs]
                        thrpt:  [901.93 MiB/s 903.43 MiB/s 904.24 MiB/s]
                 change:
                        time:   [-34.943% -34.832% -34.735%] (p = 0.00 < 0.05)
                        thrpt:  [+53.220% +53.448% +53.710%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=6
                        time:   [17.286 µs 17.292 µs 17.303 µs]
                        thrpt:  [1.3228 GiB/s 1.3236 GiB/s 1.3241 GiB/s]
                 change:
                        time:   [-1.6563% -1.3522% -1.0617%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0731% +1.3707% +1.6842%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=8
                        time:   [15.190 µs 15.205 µs 15.224 µs]
                        thrpt:  [2.0046 GiB/s 2.0071 GiB/s 2.0091 GiB/s]
                 change:
                        time:   [-17.951% -17.610% -17.295%] (p = 0.00 < 0.05)
                        thrpt:  [+20.912% +21.374% +21.879%]
                        Performance has improved.

Shnatsel · 2024-12-02T20:30:09Z

Those benchmarks tell me two important things:

Ditching portable SIMD intrinsics for autovectorization was the right call, and massively helped ARM in the bpp=4 case.
The unstable SIMD loop for bpp=3 is (and always was) a massive regression on ARM. We'll need to disable it there.

Thank you!

Shnatsel · 2024-12-02T21:19:24Z

I believe this is now ready to go, and performance is as good as I'm going to get it in this PR.

On top of the x86-specific optimization for Paeth, this also fixes the issue of unstable feature tanking performance on ARM.

Now that we've added architecture-specific code, I've also made tests run on Apple M1 (ARM) on CI in addition to x86_64 Linux so that non-x86 bugs would get caught on CI.

I've removed filter_paeth_decode_i16 because it's dead code and I didn't like having dead code around, but that made me put #[cfg(not(target_arch = "x86_64"))] on a bunch of functions in simd module. I can bring it back if you prefer 10 lines of dead code to this much conditional compilation. I just gated the whole SIMD module on x86 because it barely helps ARM anyway.

…e configurations in conditional compilation

anforowicz · 2024-12-03T01:25:55Z

Ditching portable SIMD intrinsics for autovectorization was the right call
...
I just gated the whole SIMD module on x86 because it barely helps ARM anyway.

Yeah, I am no longer sure if std::simd is a net benefit. In recent measurements turning off the unstable feature resulted in ~1.6% improvement (on my machine, with my corpus of 1.6k images from top 500 websites).

Shnatsel · 2024-12-03T02:04:16Z

After this PR, enabling unstable feature significantly helps bpp=6 and bpp=8 cases, but doesn't really do much for anything else, at least on Zen:

unfilter/filter=Paeth/bpp=1
                        time:   [11.444 µs 11.448 µs 11.453 µs]
                        thrpt:  [341.08 MiB/s 341.21 MiB/s 341.32 MiB/s]
                 change:
                        time:   [-0.1272% -0.0140% +0.0751%] (p = 0.80 > 0.05)
                        thrpt:  [-0.0751% +0.0140% +0.1273%]
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) high mild
  11 (11.00%) high severe

unfilter/filter=Paeth/bpp=2
                        time:   [15.047 µs 15.069 µs 15.095 µs]
                        thrpt:  [517.56 MiB/s 518.45 MiB/s 519.20 MiB/s]
                 change:
                        time:   [+0.6312% +0.7745% +0.9771%] (p = 0.00 < 0.05)
                        thrpt:  [-0.9676% -0.7686% -0.6272%]
                        Change within noise threshold.
Found 23 outliers among 100 measurements (23.00%)
  13 (13.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

unfilter/filter=Paeth/bpp=3
                        time:   [14.746 µs 14.757 µs 14.772 µs]
                        thrpt:  [793.31 MiB/s 794.12 MiB/s 794.72 MiB/s]
                 change:
                        time:   [-3.1331% -2.8258% -2.3192%] (p = 0.00 < 0.05)
                        thrpt:  [+2.3743% +2.9080% +3.2344%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

unfilter/filter=Paeth/bpp=4
                        time:   [13.696 µs 13.707 µs 13.725 µs]
                        thrpt:  [1.1118 GiB/s 1.1132 GiB/s 1.1141 GiB/s]
                 change:
                        time:   [-0.4634% -0.3745% -0.2910%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2918% +0.3759% +0.4656%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

unfilter/filter=Paeth/bpp=6
                        time:   [14.812 µs 14.830 µs 14.850 µs]
                        thrpt:  [1.5413 GiB/s 1.5433 GiB/s 1.5452 GiB/s]
                 change:
                        time:   [-44.147% -44.095% -44.044%] (p = 0.00 < 0.05)
                        thrpt:  [+78.712% +78.873% +79.040%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

unfilter/filter=Paeth/bpp=8
                        time:   [13.751 µs 13.752 µs 13.753 µs]
                        thrpt:  [2.2189 GiB/s 2.2192 GiB/s 2.2194 GiB/s]
                 change:
                        time:   [-32.143% -32.076% -31.992%] (p = 0.00 < 0.05)
                        thrpt:  [+47.041% +47.223% +47.369%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

These cases are rather uncommon and also already fast, so we might want to consider removing them to reduce the maintenance burden.

Shnatsel · 2024-12-03T02:20:18Z

@anforowicz thanks for the info!

I see that the Chromium merge window is open from today until the 18th of December. When do we need to publish new releases of png and fdeflate so you could get them into the Chromium tree before the merge window closes?

anforowicz · 2024-12-03T03:47:28Z

I see that the Chromium merge window is open from today until the 18th of December. When do we need to publish new releases of png and fdeflate so you could get them into the Chromium tree before the merge window closes?

The Dec 3rd to Dec 18th window is for starting and/or reconfiguring field trials, but we should have more time for making code changes (such as absorbing new png versions). If Canary/Dev trials work well in December, then we can try shipping SkPngRustCodec / SkPngRustEncoderImpl in Chrome/M133 which (according to
https://chromiumdash.appspot.com/schedule) will branch on Jan 6, 2025. So I think that if a new png version is available before the end of December, then there should be no problem getting it into Chromium in time for M133.

HeroicKatora

LGTM, if you're satisified with the test coverage across architectures then feel free to merge. It's an interesting data point against std::simd optimality indeed.

Initial port of stb_image PNG paeth unfiltering

21b475c

Shnatsel requested a review from fintelia December 1, 2024 08:04

Shnatsel added 4 commits December 1, 2024 08:10

Sneaky fix, you saw nothing

14c448d

Don't widen types where it's not necessary

5b19c5b

Restore comment

7bfaa49

Add an equivalence test for the Paeth implementations

a557386

Shnatsel added 3 commits December 1, 2024 20:10

Convert nightly-only portable SIMD path to the stb_image Paeth algorithm

f30ec3c

Drop the explicit SIMD Paeth algorithm, it's faster to use an autovec…

21c5b4e

…torized version of the scalar one

cargo fmt

e0b1953

Shnatsel marked this pull request as ready for review December 1, 2024 20:49

ignore the paeth equivalence test by default to avoid inflating CI times

9ed0d9e

Use the previous Paeth impl on non-x86 platforms

3e083e9

Shnatsel added 4 commits December 2, 2024 20:34

Drop outdated comment

bb16b83

Disable unstable bpp=3 and bpp=6 codepaths on non-x86, add comments

c97f54c

Only build Paeth bpp=3 and bpp=6 routines on x86; fixes build on ARM

5c2538c

Run tests on ARM on CI

d5e9741

Shnatsel added 2 commits December 2, 2024 21:27

Run cargo check --features=unstable on ARM too

54d2b7a

Gate the entire SIMD module on x86_64 to reduce the amount of possibl…

95fabd4

…e configurations in conditional compilation

HeroicKatora approved these changes Dec 4, 2024

View reviewed changes

Shnatsel merged commit 7e4a5a4 into image-rs:master Dec 4, 2024
24 checks passed

Shnatsel deleted the stb-paeth branch December 4, 2024 10:30

HeroicKatora mentioned this pull request Dec 4, 2024

Use stb's paeth predictor #540

Closed

Shnatsel mentioned this pull request Dec 5, 2024

Vectorize Paeth filtering on stable #511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port of stb_image optimized paeth unfiltering #539

Port of stb_image optimized paeth unfiltering #539

Shnatsel commented Dec 1, 2024 •

edited

Loading

HeroicKatora commented Dec 1, 2024

Shnatsel commented Dec 1, 2024

Shnatsel commented Dec 1, 2024

kornelski commented Dec 1, 2024

Shnatsel commented Dec 2, 2024 •

edited

Loading

fintelia commented Dec 2, 2024

Shnatsel commented Dec 2, 2024 •

edited

Loading

Shnatsel commented Dec 2, 2024

kornelski commented Dec 2, 2024

Shnatsel commented Dec 2, 2024

Shnatsel commented Dec 2, 2024 •

edited

Loading

anforowicz commented Dec 3, 2024

Shnatsel commented Dec 3, 2024

Shnatsel commented Dec 3, 2024

anforowicz commented Dec 3, 2024

HeroicKatora left a comment

Port of stb_image optimized paeth unfiltering #539

Port of stb_image optimized paeth unfiltering #539

Conversation

Shnatsel commented Dec 1, 2024 • edited Loading

HeroicKatora commented Dec 1, 2024

Shnatsel commented Dec 1, 2024

Shnatsel commented Dec 1, 2024

kornelski commented Dec 1, 2024

Shnatsel commented Dec 2, 2024 • edited Loading

fintelia commented Dec 2, 2024

Shnatsel commented Dec 2, 2024 • edited Loading

Shnatsel commented Dec 2, 2024

kornelski commented Dec 2, 2024

unstable pr vs stable main

unstable pr vs unstable main

Shnatsel commented Dec 2, 2024

Shnatsel commented Dec 2, 2024 • edited Loading

anforowicz commented Dec 3, 2024

Shnatsel commented Dec 3, 2024

Shnatsel commented Dec 3, 2024

anforowicz commented Dec 3, 2024

HeroicKatora left a comment

Choose a reason for hiding this comment

Shnatsel commented Dec 1, 2024 •

edited

Loading

Shnatsel commented Dec 2, 2024 •

edited

Loading

Shnatsel commented Dec 2, 2024 •

edited

Loading

Shnatsel commented Dec 2, 2024 •

edited

Loading