Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port of stb_image optimized paeth unfiltering #539

Merged
merged 16 commits into from
Dec 4, 2024

Conversation

Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Dec 1, 2024

10% end-to-end performance gain on a Paeth-filtered image I concocted with convert -quality 49 in.png out.png, which is comparatively big and Paeth images are rather common.

Godbolt shows much shorter assembly that gets autovectorized even in the bpp=3 case: https://godbolt.org/z/fq3EjvT4b

TODO:

  • Have someone else confirm this performance boost exists and isn't isolated to my machine
  • Exhaustively test this implementation against the reference
  • Do something about the nightly variant of Paeth unfiltering

Also done:

  • Fix unstable feature dramatically regressing performance on ARM
  • Add CI job to make sure unstable feature compiles
  • Add CI job to run tests on ARM

@Shnatsel Shnatsel requested a review from fintelia December 1, 2024 08:04
@HeroicKatora
Copy link
Member

I believe I can confirm the performance change on my machine.

decode/Transparency.png time:   [61.886 µs 61.898 µs 61.916 µs]
                        thrpt:  [5.4150 GiB/s 5.4166 GiB/s 5.4176 GiB/s]
                 change:
                        time:   [-5.6629% -5.5677% -5.4790%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7966% +5.8960% +6.0028%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
decode/kodim02.png      time:   [2.1276 ms 2.1296 ms 2.1312 ms]
                        thrpt:  [527.87 MiB/s 528.27 MiB/s 528.75 MiB/s]
                 change:
                        time:   [-5.2926% -5.2361% -5.1695%] (p = 0.00 < 0.05)
                        thrpt:  [+5.4513% +5.5254% +5.5883%]
                        Performance has improved.
decode/kodim07.png      time:   [2.5823 ms 2.5825 ms 2.5827 ms]
                        thrpt:  [435.59 MiB/s 435.63 MiB/s 435.66 MiB/s]
                 change:
                        time:   [-12.692% -12.406% -12.139%] (p = 0.00 < 0.05)
                        thrpt:  [+13.816% +14.163% +14.537%]
                        Performance has improved.
decode/kodim17.png      time:   [2.1296 ms 2.1311 ms 2.1319 ms]
                        thrpt:  [527.69 MiB/s 527.89 MiB/s 528.28 MiB/s]
                 change:
                        time:   [-0.6423% -0.5476% -0.4601%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4622% +0.5506% +0.6465%]
                        Change within noise threshold.
decode/kodim23.png      time:   [1.9203 ms 1.9214 ms 1.9226 ms]
                        thrpt:  [585.15 MiB/s 585.52 MiB/s 585.84 MiB/s]
                 change:
                        time:   [-9.0051% -8.8174% -8.6456%] (p = 0.00 < 0.05)
                        thrpt:  [+9.4638% +9.6700% +9.8963%]
                        Performance has improved.

Here kodim17 is indeed not using any Paeth filter line 🎉

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 1, 2024

I've changed the nightly version to just use autovectorized stable version. I'm not seeing a big boost from enabling unstable for RGBA after that, the 1% difference might just be noise. We might be able to simply remove all that SIMD code.
RGB is still faster on nightly due to the clever while loop.

@Shnatsel Shnatsel marked this pull request as ready for review December 1, 2024 20:49
@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 1, 2024

I have the removal of nightly SIMD for RGBA ready to go in a branch, but I'm not including it in this PR to keep this one as simple and uncontroversial as possible, and hopefully get it merged quickly.

@kornelski
Copy link
Contributor

In a microbenchmark on Apple Silicon speed of this filter varies depending on pixel size, and it's half the speed for some sizes. On x86 it's mostly faster, except 11% slower for 6-byte pixels.

#540

It's odd that there's such a big variance.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 2, 2024

On my Zen machine this gets us +5 MP/s on corpus-bench with a corpus of random PNG images craped off the web.

According to Kornel's benchmarks this only regresses the bpp=6 case on x86, and even that regression is not dramatic (11% on the filter alone). But on Apple ARM this significantly regresses the crucial bpp=3 case, and I'm not willing to accept that.

How about we land this only for x86, and stick to the previous impl on other platforms? I'm not a fan of the duplication, but the code is trivial enough that it's only adding 10 lines or so.

@fintelia
Copy link
Contributor

fintelia commented Dec 2, 2024

Other decoders have different (manual SIMD) implementations for different architectures. It would be nice if we could find a single autovectorized/portable-simd implementation that got ideal performance on all architectures. But I'm not shocked that isn't the case. Using different implementations on x86 and ARM seem fine to me.

Also, it probably shouldn't be in this PR, but filter.rs has gotten large enough that we should probably think about spinning it off into multiple files.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 2, 2024

@kornelski could you benchmark this PR on ARM with the unstable feature enabled, and compare that against master?

That would be cargo +nightly bench --features=unstable -Fbenchmarks Paeth --bench unfilter -- --save-baseline=…

I am particularly curious about bpp=4 and 8. These benchmarks will tell me whether I need to bring back the portable SIMD version, or if autovectorization does a good job for those cases already.

I can't do it myself because I don't have any Apple hardware, renting it in the cloud for a short time is impossible due to Apple's licensing shenanigans, and ARM cloud instances have different performance characteristics.

Edit: wait nevermind I didn't push the changes I need benches on yet, I'm losing my mind

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 2, 2024

The code isn't pretty but it's ready for benchmarking on ARM with --features=unstable

@kornelski
Copy link
Contributor

Apple M3 Max:

unstable pr vs stable main

unfilter/filter=Paeth/bpp=1
                        time:   [7.4984 µs 7.5002 µs 7.5025 µs]
                        thrpt:  [520.66 MiB/s 520.82 MiB/s 520.95 MiB/s]
                 change:
                        time:   [-4.2620% -4.0539% -3.8443%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9980% +4.2252% +4.4517%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=2
                        time:   [8.6324 µs 8.6350 µs 8.6375 µs]
                        thrpt:  [904.48 MiB/s 904.75 MiB/s 905.02 MiB/s]
                 change:
                        time:   [-0.2792% -0.2214% -0.1713%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1716% +0.2219% +0.2800%]
                        Change within noise threshold.

unfilter/filter=Paeth/bpp=3
                        time:   [15.242 µs 15.266 µs 15.297 µs]
                        thrpt:  [766.08 MiB/s 767.65 MiB/s 768.84 MiB/s]
                 change:
                        time:   [+52.791% +53.175% +53.641%] (p = 0.00 < 0.05)
                        thrpt:  [-34.913% -34.715% -34.551%]
                        Performance has regressed.

unfilter/filter=Paeth/bpp=4
                        time:   [17.279 µs 17.283 µs 17.289 µs]
                        thrpt:  [903.76 MiB/s 904.05 MiB/s 904.27 MiB/s]
                 change:
                        time:   [-0.0658% +0.1367% +0.3469%] (p = 0.20 > 0.05)
                        thrpt:  [-0.3457% -0.1365% +0.0659%]
                        No change in performance detected.

unfilter/filter=Paeth/bpp=6
                        time:   [17.291 µs 17.300 µs 17.309 µs]
                        thrpt:  [1.3223 GiB/s 1.3230 GiB/s 1.3237 GiB/s]
                 change:
                        time:   [-6.1936% -6.0063% -5.8107%] (p = 0.00 < 0.05)
                        thrpt:  [+6.1692% +6.3901% +6.6026%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=8
                        time:   [15.184 µs 15.192 µs 15.207 µs]
                        thrpt:  [2.0069 GiB/s 2.0088 GiB/s 2.0099 GiB/s]
                 change:
                        time:   [-0.4410% -0.3503% -0.2358%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2364% +0.3516% +0.4429%]
                        Change within noise threshold.

unstable pr vs unstable main

unfilter/filter=Paeth/bpp=1
                        time:   [7.5009 µs 7.5058 µs 7.5131 µs]
                        thrpt:  [519.92 MiB/s 520.43 MiB/s 520.77 MiB/s]
                 change:
                        time:   [-0.2384% -0.1265% -0.0037%] (p = 0.02 < 0.05)
                        thrpt:  [+0.0037% +0.1267% +0.2390%]
                        Change within noise threshold.

unfilter/filter=Paeth/bpp=2
                        time:   [8.6217 µs 8.6257 µs 8.6292 µs]
                        thrpt:  [905.35 MiB/s 905.72 MiB/s 906.14 MiB/s]
                 change:
                        time:   [+1.9294% +2.2062% +2.4550%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3962% -2.1586% -1.8929%]
                        Performance has regressed.

unfilter/filter=Paeth/bpp=3
                        time:   [15.239 µs 15.252 µs 15.274 µs]
                        thrpt:  [767.24 MiB/s 768.34 MiB/s 768.98 MiB/s]
                 change:
                        time:   [-1.0307% -0.6365% -0.2955%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2964% +0.6406% +1.0414%]
                        Change within noise threshold.

unfilter/filter=Paeth/bpp=4
                        time:   [17.280 µs 17.295 µs 17.324 µs]
                        thrpt:  [901.93 MiB/s 903.43 MiB/s 904.24 MiB/s]
                 change:
                        time:   [-34.943% -34.832% -34.735%] (p = 0.00 < 0.05)
                        thrpt:  [+53.220% +53.448% +53.710%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=6
                        time:   [17.286 µs 17.292 µs 17.303 µs]
                        thrpt:  [1.3228 GiB/s 1.3236 GiB/s 1.3241 GiB/s]
                 change:
                        time:   [-1.6563% -1.3522% -1.0617%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0731% +1.3707% +1.6842%]
                        Performance has improved.

unfilter/filter=Paeth/bpp=8
                        time:   [15.190 µs 15.205 µs 15.224 µs]
                        thrpt:  [2.0046 GiB/s 2.0071 GiB/s 2.0091 GiB/s]
                 change:
                        time:   [-17.951% -17.610% -17.295%] (p = 0.00 < 0.05)
                        thrpt:  [+20.912% +21.374% +21.879%]
                        Performance has improved.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 2, 2024

Those benchmarks tell me two important things:

  1. Ditching portable SIMD intrinsics for autovectorization was the right call, and massively helped ARM in the bpp=4 case.
  2. The unstable SIMD loop for bpp=3 is (and always was) a massive regression on ARM. We'll need to disable it there.

Thank you!

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 2, 2024

I believe this is now ready to go, and performance is as good as I'm going to get it in this PR.

On top of the x86-specific optimization for Paeth, this also fixes the issue of unstable feature tanking performance on ARM.

Now that we've added architecture-specific code, I've also made tests run on Apple M1 (ARM) on CI in addition to x86_64 Linux so that non-x86 bugs would get caught on CI.

I've removed filter_paeth_decode_i16 because it's dead code and I didn't like having dead code around, but that made me put #[cfg(not(target_arch = "x86_64"))] on a bunch of functions in simd module. I can bring it back if you prefer 10 lines of dead code to this much conditional compilation. I just gated the whole SIMD module on x86 because it barely helps ARM anyway.

@anforowicz
Copy link
Contributor

Ditching portable SIMD intrinsics for autovectorization was the right call
...
I just gated the whole SIMD module on x86 because it barely helps ARM anyway.

Yeah, I am no longer sure if std::simd is a net benefit. In recent measurements turning off the unstable feature resulted in ~1.6% improvement (on my machine, with my corpus of 1.6k images from top 500 websites).

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 3, 2024

After this PR, enabling unstable feature significantly helps bpp=6 and bpp=8 cases, but doesn't really do much for anything else, at least on Zen:

unfilter/filter=Paeth/bpp=1
                        time:   [11.444 µs 11.448 µs 11.453 µs]
                        thrpt:  [341.08 MiB/s 341.21 MiB/s 341.32 MiB/s]
                 change:
                        time:   [-0.1272% -0.0140% +0.0751%] (p = 0.80 > 0.05)
                        thrpt:  [-0.0751% +0.0140% +0.1273%]
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) high mild
  11 (11.00%) high severe

unfilter/filter=Paeth/bpp=2
                        time:   [15.047 µs 15.069 µs 15.095 µs]
                        thrpt:  [517.56 MiB/s 518.45 MiB/s 519.20 MiB/s]
                 change:
                        time:   [+0.6312% +0.7745% +0.9771%] (p = 0.00 < 0.05)
                        thrpt:  [-0.9676% -0.7686% -0.6272%]
                        Change within noise threshold.
Found 23 outliers among 100 measurements (23.00%)
  13 (13.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

unfilter/filter=Paeth/bpp=3
                        time:   [14.746 µs 14.757 µs 14.772 µs]
                        thrpt:  [793.31 MiB/s 794.12 MiB/s 794.72 MiB/s]
                 change:
                        time:   [-3.1331% -2.8258% -2.3192%] (p = 0.00 < 0.05)
                        thrpt:  [+2.3743% +2.9080% +3.2344%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

unfilter/filter=Paeth/bpp=4
                        time:   [13.696 µs 13.707 µs 13.725 µs]
                        thrpt:  [1.1118 GiB/s 1.1132 GiB/s 1.1141 GiB/s]
                 change:
                        time:   [-0.4634% -0.3745% -0.2910%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2918% +0.3759% +0.4656%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

unfilter/filter=Paeth/bpp=6
                        time:   [14.812 µs 14.830 µs 14.850 µs]
                        thrpt:  [1.5413 GiB/s 1.5433 GiB/s 1.5452 GiB/s]
                 change:
                        time:   [-44.147% -44.095% -44.044%] (p = 0.00 < 0.05)
                        thrpt:  [+78.712% +78.873% +79.040%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

unfilter/filter=Paeth/bpp=8
                        time:   [13.751 µs 13.752 µs 13.753 µs]
                        thrpt:  [2.2189 GiB/s 2.2192 GiB/s 2.2194 GiB/s]
                 change:
                        time:   [-32.143% -32.076% -31.992%] (p = 0.00 < 0.05)
                        thrpt:  [+47.041% +47.223% +47.369%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

These cases are rather uncommon and also already fast, so we might want to consider removing them to reduce the maintenance burden.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Dec 3, 2024

@anforowicz thanks for the info!

I see that the Chromium merge window is open from today until the 18th of December. When do we need to publish new releases of png and fdeflate so you could get them into the Chromium tree before the merge window closes?

@anforowicz
Copy link
Contributor

I see that the Chromium merge window is open from today until the 18th of December. When do we need to publish new releases of png and fdeflate so you could get them into the Chromium tree before the merge window closes?

The Dec 3rd to Dec 18th window is for starting and/or reconfiguring field trials, but we should have more time for making code changes (such as absorbing new png versions). If Canary/Dev trials work well in December, then we can try shipping SkPngRustCodec / SkPngRustEncoderImpl in Chrome/M133 which (according to
https://chromiumdash.appspot.com/schedule) will branch on Jan 6, 2025. So I think that if a new png version is available before the end of December, then there should be no problem getting it into Chromium in time for M133.

Copy link
Member

@HeroicKatora HeroicKatora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if you're satisified with the test coverage across architectures then feel free to merge. It's an interesting data point against std::simd optimality indeed.

@Shnatsel Shnatsel merged commit 7e4a5a4 into image-rs:master Dec 4, 2024
24 checks passed
@Shnatsel Shnatsel deleted the stb-paeth branch December 4, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants