Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

anforowicz · 2024-02-01T00:45:52Z

No description provided.

This commit moves `expand_paletted_into_rgb8` and `expand_paletted_into_rgba8` (and their unit tests) into a separate `transform/palette.rs` module. This prepares room for encapsulating extra complexity in this module in follow-up commits, where we will start to precompute and memoize some data when creating a `TransformFn`. This commit just moves the code around - it should have no impact on correctness or performance.

The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB entries - 3 bytes per entry). Additionally, taking 10000 samples in the `bench_create_fn` benchmarks is a bit excessive after memoization.

This commit changes the `TransformFn` type alias from `fn(...)` into `Box<dyn Fn(...)>`. This allows the `TransformFn` to have store some precomputer, memoized state that we plan to add in follow-up commits. In theory this commit may have negative performance impact, but in the grand scheme of things it disappears into the measurement noise. In particular, when there is no state, then `Box` shouldn't allocate.

Before this commit `expand_paletted_into_rgba8` would: * Perform 2 lookups - `palette.get(i)` and `trns.get(i)` * Check via `unwrap_or` if `i` was within the bounds of `palette`/`trns` This commit introduces `create_rgba_palette` which combines `palette` and `trns` into a fixed-size `[[u8;4]; 256]` look-up table (called `rgba_palette` in the code). After this commit `expand_paletted_into_rgba8` only needs to perform a single look-up and doesn't need to check the bounds. This helps to improve the expansion time by 60+%: - expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: [-60.208% -60.057% -59.899%] (p = 0.00 < 0.05) - expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: [-77.520% -77.407% -77.301%] (p = 0.00 < 0.05) `expand_paletted_into_rgb8` performs only a single lookup before and after this commit, but avoiding bounds checks still helps to improve the expansion time by ~12%: - expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: [-12.357% -12.005% -11.664%] (p = 0.00 < 0.05) - expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: [-13.135% -12.584% -12.092%] (p = 0.00 < 0.05) Understandably, this commit regresses the time of `create_transform_fn`. Future commits will reduce this regression 2-4 times: - expand_paletted(ctor)/plte=256/trns=256: [+3757.2% +3763.8% +3770.5%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=224/trns=32: [+3807.3% +3816.2% +3824.6%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=16/trns=1: [+1672.0% +1675.0% +1678.1%] (p = 0.00 < 0.05)

Before this commit `expand_into_rgb8` would copy 3 bytes at a time into the output. After this commit it copies 4 bytes at a time (possibly cloberring pixels that will be populated during the next iteration - this is ok). This improved the performance as follows: expand_paletted(exec)/trns=no/src_bits=8/src_size=5461 time: [-23.852% -23.593% -23.319%] (p = 0.00 < 0.05)

This improves the performance as follows: - expand_paletted(ctor)/plte=256/trns=256 [-40.581% -40.396% -40.211%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=224/trns=32 [-24.070% -23.840% -23.592%] (p = 0.00 < 0.05) Small palettes are mostly unaffected: - expand_paletted(ctor)/plte=16/trns=1 [-0.2525% +0.0338% +0.3239%] (p = 0.81 > 0.05)

anforowicz · 2024-02-01T00:49:42Z

@fintelia, can you PTAL?

/cc @Shnatsel (thanks again for pointing out my earlier benchmarking mistake; I read your article on eliminating bounds checks, but wasn’t able to successfully apply it - more details below)

As before, I think it may be best to avoid squeezing the commits when merging - I think it is desirable to preserve individual commits
The end-to-end speed-up looks like this: decode/paletted-zune.png: time: [-12.282% -11.986% -11.625%] (p = 0.00 < 0.05). (This is the only image in tests/benches for which identify -verbose reports png:IHDR.color_type: 3 (Indexed).)
When working through my old list of planned changes I made some adjustments:
- I wasn’t able to achieve speed-ups via std::simd (see below for more details)
- I decided that falling back to non-memoized processing isn’t worth it (see below for more details)
- I decided to compare memoized-vs-non-memoized results only in palette-specific tests (rather than in #[cfg(test)] parts of the product code itself)
- I’ve also realized that some palette benchmarks were using a PLTE chunk where the length wasn’t a multiple of 3… ooops… this is fixed in the very first commit of this PR
- And I’ve also realized that in my previous PR I missed an opportunity to make most of the transform.rs functions private to that module

SIMD

Memoization

My attempt to SIMD-ify create_rgba_palette can be found in c71f944. It seems to actually make execution slower by ~5%.

Short link to the non-SIMD-ified code can be found here: https://godbolt.org/z/KeKrfWcs5. There is no auto-vectorization AFAICT.

Execution

My attempt to SIMD-ify expand_8bit_into_rgb8 can be found in 63d4595. It seems to actually make execution slower by ~60%. Using an unsafe gather_select_unchecked doesn’t make a difference AFAICT.

The non-SIMD-ified code can be found here: https://godbolt.org/z/T9zv891dc. There is no auto-vectorization AFAICT - this makes me a bit surprised that this is faster than the SIMD-ified version.

The SIMD-ified version can be found here: https://godbolt.org/z/P19TPP33G. AFAICT it does in fact compile into SIMD instructions.

There are some inefficiencies that (I think) may be addressed by (somehow) hoisting bounds checks outside of the loop (both in the non-SIMD-ified and the SIMD-ified version):

each loop not only maintains pointers into input and output, but also keeps updating the slice length
each loop iteration does some bounds checks (e.g. jumping to core::slice::index::slice_end_index_len_fail in the SIMD-ified version or to core::panicking::panic_bounds_check in the non-SIMD-ified version).

I don’t know how to address the inefficiencies / how to do the hoisting. (I’ve tried some assert!s about the relative output.len() and input.len() but it didn’t help.)

Break-even point

Memoization is a trade-off:

On one hand, memoization requires spending X ns before starting to call
expand_paletted_... functions.
On the other hand, memoization improves the throughput of the
expand_paletted_... functions - they take Y ns less to process each byte

If we process B bytes, then:

With memoization expansion costs X + BBaselineThroughput - BY
Without memoization expansion costs B*BaselineThroughput

The breakeven point can be calculated with B=X/Y.

Note that X and Y are typically measure via microbenchmarks which are somewhat unrealistic
(i.e. the benchmarks don't compete with other code for the L1 cache). Therefore the breakeven
point calculated this way may be a bit too optimistic - for example it won't take into account
the negative impact of additional ~1kB of cache/memory pressure from the memoized
rgba_palette.

Measurements

Without memoization we get:

Summary:
- Construction average: 21.029 ns
- Expansion average: 0.4643 ns per output byte
- RGB8 expansion: 0.4181 ns per output byte
- RGBA8 expansion: 0.5245 ns per output byte
Details:
- expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: 1.6935 GiB/s
- expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: 2.7610 GiB/s
- expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: 1.4933 GiB/s
- expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: 2.0579 GiB/s
- expand_paletted(ctor)/plte=256/trns=256: 21.012 ns
- expand_paletted(ctor)/plte=224/trns=32: 20.990 ns
- expand_paletted(ctor)/plte=16/trns=1: 21.088 ns

With memoization we get:

Summary:
- Construction average: 177.065 ns (+156.036 ns)
- Expansion average: 0.1981 ns (-0.2661 ns) per output byte
- RGB8 expansion: 0.1845 ns (-0.3259 ns) per output byte
- RGBA8 expansion: 0.1534 ns (-0.3711 ns) per output byte
Details:
- expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: 2.2342 GiB/s: +32.042%
- expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: 4.4242 GiB/s: +60.373%
- expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: 3.8193 GiB/s: +152.83%
- expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: 8.3201 GiB/s: +300.30%
- expand_paletted(ctor)/plte=256/trns=256: 231.59 ns: +996.76%
- expand_paletted(ctor)/plte=224/trns=32: 222.06 ns: +945.54%
- expand_paletted(ctor)/plte=16/trns=1: 77.547 ns: +266.00%

Results

The measurements above give the following thresholds:

Overall: 586 output bytes
RGB8: 478 output bytes = 159 pixels = ~13x13 pixels
RGBA8: 420 output bytes = 105 pixels = ~10x10 pixels

Such small images seem relatively rare. IMHO it is unjustified (in terms of extra code complexity, binary size, etc.) to implement falling back to non-memoized implementation for these kinds of images.

Shnatsel · 2024-02-01T01:21:39Z

I am surprised the crate isn't doing this already! Turning the chunks into a lookup table seems like a no-brainer.

I'll take a look at the bounds checks.

Shnatsel · 2024-02-01T01:28:41Z

What machine did you run those benchmarks on? I am getting dramatically higher numbers on a Zen 4 CPU.

Benchmark results on Zen 4

``` expand_paletted(exec)/trns=no/src_bits=4/src_size=5461 time: [7.0350 µs 7.0351 µs 7.0352 µs] thrpt: [4.3376 GiB/s 4.3376 GiB/s 4.3377 GiB/s] Found 73 outliers among 500 measurements (14.60%) 7 (1.40%) low severe 35 (7.00%) low mild 13 (2.60%) high mild 18 (3.60%) high severe

expand_paletted(exec)/trns=no/src_bits=8/src_size=5461
time: [1.3668 µs 1.3669 µs 1.3670 µs]
thrpt: [11.161 GiB/s 11.162 GiB/s 11.163 GiB/s]
Found 53 outliers among 500 measurements (10.60%)
14 (2.80%) low severe
12 (2.40%) low mild
18 (3.60%) high mild
9 (1.80%) high severe

expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461
time: [5.7850 µs 5.7859 µs 5.7867 µs]
thrpt: [7.0312 GiB/s 7.0322 GiB/s 7.0333 GiB/s]
Found 92 outliers among 500 measurements (18.40%)
3 (0.60%) low severe
36 (7.20%) low mild
32 (6.40%) high mild
21 (4.20%) high severe

expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461
time: [1.3811 µs 1.3811 µs 1.3812 µs]
thrpt: [14.729 GiB/s 14.730 GiB/s 14.731 GiB/s]
Found 12 outliers among 500 measurements (2.40%)
1 (0.20%) low severe
1 (0.20%) low mild
6 (1.20%) high mild
4 (0.80%) high severe

expand_paletted(ctor)/plte=256/trns=256
time: [139.95 ns 140.01 ns 140.08 ns]
Found 111 outliers among 1000 measurements (11.10%)
80 (8.00%) low mild
21 (2.10%) high mild
10 (1.00%) high severe

expand_paletted(ctor)/plte=224/trns=32
time: [138.72 ns 138.87 ns 139.02 ns]
Found 9 outliers among 1000 measurements (0.90%)
9 (0.90%) high mild

expand_paletted(ctor)/plte=16/trns=1
time: [49.913 ns 49.963 ns 50.013 ns]
Found 22 outliers among 1000 measurements (2.20%)
1 (0.10%) low severe
11 (1.10%) low mild
10 (1.00%) high mild

</details>

Shnatsel · 2024-02-01T01:56:41Z

Just so you know, with -C target-cpu=znver4 your expand_8bit_into_rgb8 from this PR gets autovectorized into AVX-512 instructions: https://godbolt.org/z/3bTG8oYz3
And it absolutely tanks performance on the micro-benchmarks, with some of them taking more than 2x more time. It also regresses the end-to-end decoding benchmark for the zune-paletted file by 23%.

I suspect the expansion function ends up getting called a lot on very small chunks, so the vector code is wasted and only gets in the way. I'll try to confirm this hunch.

Shnatsel · 2024-02-01T02:34:25Z

Hmm, that function seems to be called for every row. I am surprised that autovectorization actually hurts it even on rather large images.

fintelia · 2024-02-01T02:56:49Z

Zen4 has a somewhat weird AVX-512 implementation, so I wonder if LLVM just isn't realizing that the autovectorized version will be slower.

As another datapoint, I tested on a Zen3 CPU and don't see any huge performance differences compiling with -C target-cpu=znver3

Shnatsel · 2024-02-01T03:07:22Z

I couldn't convince the compiler to remove the bounds checks from the loop. But it seems to speculate right past them without too much trouble.

A naive iterator version that removes bounds checks but copies 3 bytes instead of 4 is 25% slower than this code.

okaneco · 2024-02-01T05:01:48Z

src/decoder/transform/palette.rs

+    {
+        let mut palette_iter = palette;
+        let mut rgba_iter = &mut rgba_palette[..];
+        while palette_iter.len() >= 4 {


Suggested change

while palette_iter.len() >= 4 {

while palette_iter.len() >= 4 && !rgba_iter.is_empty() {

Modifying https://godbolt.org/z/KeKrfWcs5, I haven't benchmarked or tested it however.
Removes the panic but doesn't unroll the loop like the original https://rust.godbolt.org/z/53MaeEano

This results in a 25% regression.

The compiler is aware that the branches leading to the panic are cold, and so decides to unroll the loop. The CPU should speculate right past the branch leading to a panic. So it is best to keep it as a clearly-cold panic rather than a regular branch.

Shnatsel · 2024-02-01T05:35:38Z

src/decoder/transform/palette.rs

+        input = &input[1..];
+        output = &mut output[3..];


Replacing these with get_unchecked() showed exactly zero performance difference on the benchmarks. It seems eliminating bounds checks here is not worthwhile.

anforowicz · 2024-02-01T15:00:01Z

What machine did you run those benchmarks on?

AMD EPYC 7B12.

`lscpu` output

$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc c puid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3d nowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nr ip_save umip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec rstack overflow: Vulnerable: Safe RET, no microcode Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

FWIW, I also didn't use -C target-cpu=native or something like that, because AFAIK Chromium builds and ships a single binary for a given target platform/architecture. AFAIU this still in general allows for some (lowest-common denominator) auto-vectorization (e.g. see #414 (comment)).

Shnatsel · 2024-02-01T15:10:19Z

Ah, that runs at half the frequency of Zen4 desktop parts, so those numbers make sense. I did not use -C target-cpu=native either.

I suppose we could use multiversioning if wider SIMD helped, but at least in palette expansion it doesn't.

Before this CL, `png::Transformations::EXPAND` was used unconditionally, which meant that the `png` crate was responsible for 1) expanding to at least 8 bits, 2) expanding `ColorType::Rgb` to `Rgba` (and `Grayscale` to `GrayscaleAlpha`) based on `tRNS` chunk, and 3) expanding `ColorType::Indexed` based on `PLTE` and `tRNS` chunks. After this CL, expanding `ColorType::Indexed` images is done on Skia side instead. This has some performance benefits: * Performance measurements show that the total runtime of decoding PNG images gathered from top 500 websites goes down - the runtime after this CL is ~94% of the runtime before this CL. For comparison with other potential interventions, please see the document here: https://docs.google.com/document/d/16P0I_4AglenbkU1IBM1Qfe1zAQrMXa5NoEazZqOTHZE/edit?usp=sharing * Arm-chair analysis: - There are less transformation "hops" after this CL: - Before this CL: indexed data ==png-crate==> rgb or rgba ==SkSwizzler-and/or-skcms_Transform==> final data - After this CL: indexed data ==SkSwizzler==> final data - The indexed=>rgb/rgba transformation is still happening, but it is happening within SkSwizzler, so there is (maybe?) less loop overhead and less memory pressure (although one image row is not _that_ much) - The skcms_Transform may still be needed, but it can be applied to the palette (at most 256 rgba pixels) rather than to all the image pixels. (This may also apply for some stages handles by SkSwizzler like alpha-premultiplication.) - I note that the indexed=>rgb/rgba transformation depends on auto-vectorization both before and after this CL. So far the `png` crate didn't even see improvements from using `std::simd` (see the notes in image-rs/image-png#462 (comment)). Similarily, `SkSwizzler` doesn't provide SIMD-based routines for this transformation (see `swizzle_small_index_to_n32` and `swizzle_index_to_n32`). There may be additional performance improvement opportunities here, because `libpng` expansion does use SIMD intrinsics - see the notes in https://crbug.com/40512957. Fixed: chromium:356882657 Change-Id: I4bf63bd126140401ae1f38f82e613ffe7d9d13e8 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/911438 Reviewed-by: Daniel Dilan <[email protected]> Commit-Queue: Łukasz Anforowicz <[email protected]>

anforowicz added 6 commits January 29, 2024 19:42

Fix constants used in palette benchmarks.

89153e6

The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB entries - 3 bytes per entry). Additionally, taking 10000 samples in the `bench_create_fn` benchmarks is a bit excessive after memoization.

fintelia approved these changes Feb 1, 2024

View reviewed changes

okaneco reviewed Feb 1, 2024

View reviewed changes

Shnatsel reviewed Feb 1, 2024

View reviewed changes

fintelia merged commit b13388f into image-rs:master Feb 2, 2024
19 checks passed

anforowicz deleted the plte--memoization branch February 2, 2024 16:38

Shnatsel mentioned this pull request Feb 18, 2024

Very slow decoding of paletted PNG images #393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

anforowicz commented Feb 1, 2024

anforowicz commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

fintelia commented Feb 1, 2024 •

edited

Loading

Shnatsel commented Feb 1, 2024

okaneco Feb 1, 2024

Shnatsel Feb 1, 2024

Shnatsel Feb 1, 2024

anforowicz commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

	while palette_iter.len() >= 4 {
	while palette_iter.len() >= 4 && !rgba_iter.is_empty() {

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Conversation

anforowicz commented Feb 1, 2024

anforowicz commented Feb 1, 2024

SIMD

Memoization

Execution

Break-even point

Measurements

Results

Shnatsel commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

fintelia commented Feb 1, 2024 • edited Loading

Shnatsel commented Feb 1, 2024

okaneco Feb 1, 2024

Choose a reason for hiding this comment

Shnatsel Feb 1, 2024

Choose a reason for hiding this comment

Shnatsel Feb 1, 2024

Choose a reason for hiding this comment

anforowicz commented Feb 1, 2024

Shnatsel commented Feb 1, 2024

fintelia commented Feb 1, 2024 •

edited

Loading