Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid 32kB decompression lag + compact less often. #447

Merged
merged 1 commit into from
Jan 6, 2024

Conversation

anforowicz
Copy link
Contributor

@anforowicz anforowicz commented Jan 5, 2024

Avoiding 32kB decompression lag

Before this commit, decompressed data would be accumulated in ZlibStream::out_buffer and returned via image_data with 32kB lag corresponding to CHUNCK_BUFFER_SIZE:

```
fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
    let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE);
    image_data.extend(self.out_buffer.drain(..safe));
    ...
```

32kB is a typical size of L1 cache, so the lag would mean that the data passed to image_data.extend(...) would be already cold and evicted from the L1 cache.

This commit avoids the lag by always returning into image_data all the data from out_buffer (i.e. data up to out_pos):

```
fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
    let transferred = &self.out_buffer[self.read_pos..self.out_pos];
    image_data.extend_from_slice(transferred);
    self.read_pos = self.out_pos;
    ...
```

Compacting less often

The changes above mean that Vec::drain no longer compacts out_buffer. Therefore this commit also refactors how this compaction works.

Before this commit, not-yet-returned data would be shifted to the beginning of out_buffer every time transfer_finished_data is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction.

After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction.

Performance impact

The commit has a positive impact on performance, except for:

  • decode/Transparency.png - regression between 15% and 20% is reported in 3-out-of-3 measurements.
  • decode/kodim17.png - a regression of 2.1% has been reported in 1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been reported in the other 2-out-of-3 measurements).
  • generated-noncompressed-64k-idat/128x128.png - a regression of 25% has been reported in 1-out-of-3 measurements (an improvement of 21% - 29% has been reported in the other 2-out-of-3 measurements).

The results below have been gathered by running the decoder benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05.

  • decode/kodim23.png:
    * [-2.9560% -2.7112% -2.4009%]
    * [-3.4876% -3.3406% -3.1928%]
    * [-3.0559% -2.9208% -2.7787%]

  • decode/kodim07.png:
    * [-1.2527% -1.0110% -0.6780%]
    * [-1.7851% -1.6558% -1.5164%]
    * [-1.6576% -1.5216% -1.3856%]

  • decode/kodim02.png:
    * [-0.5108% -0.2806% -0.0112%]
    * [-1.0885% -0.9493% -0.8118%]
    * [-0.5563% -0.4239% -0.2874%]

  • decode/kodim17.png:
    * [+1.8649% +2.1138% +2.4169%] (regression)
    * [-1.2891% -1.1322% -0.9736%]
    * [-0.7753% -0.6276% -0.4866%]

  • decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
    * [-1.7165% -1.4968% -1.2650%]
    * [-1.7051% -1.4473% -1.2229%]
    * [-1.2544% -1.0457% -0.8375%]

  • decode/Transparency.png:
    * [+19.329% +19.789% +20.199%] (regression)
    * [+15.337% +15.798% +16.294%] (regression)
    * [+18.694% +19.106% +19.518%] (regression)

  • generated-noncompressed-4k-idat/8x8.png:
    * [-2.3295% -1.9940% -1.5912%]
    * [-6.1285% -5.8872% -5.6091%]
    * [-2.8814% -2.6787% -2.4820%]

  • generated-noncompressed-4k-idat/128x128.png:
    * [-59.793% -59.599% -59.417%]
    * [-63.930% -63.846% -63.756%]
    * [-62.377% -62.248% -62.104%]

  • generated-noncompressed-4k-idat/2048x2048.png:
    * [-67.678% -67.579% -67.480%]
    * [-65.616% -65.519% -65.429%]
    * [-65.824% -65.647% -65.413%]

  • generated-noncompressed-4k-idat/12288x12288.png:
    * [-60.932% -60.774% -60.528%]
    * [-62.088% -62.016% -61.940%]
    * [-61.663% -61.604% -61.546%]

  • generated-noncompressed-64k-idat/128x128.png:
    * [-22.237% -21.975% -21.701%]
    * [-29.656% -29.480% -29.311%]
    * [+24.812% +25.190% +25.571%] (regression)

  • generated-noncompressed-64k-idat/2048x2048.png:
    * [-21.826% -21.499% -21.087%]
    * [-54.279% -54.049% -53.715%]
    * [-11.174% -10.828% -10.482%]

  • generated-noncompressed-64k-idat/12288x12288.png:
    * [-40.421% -40.311% -40.180%]
    * [-39.496% -39.183% -38.871%]
    * [-41.443% -41.367% -41.295%]

  • generated-noncompressed-2g-idat/2048x2048.png:
    * [-40.136% -40.010% -39.865%]
    * [-58.507% -58.333% -58.060%]
    * [-35.822% -35.457% -35.038%]

  • generated-noncompressed-2g-idat/12288x12288.png:
    * [-37.196% -37.107% -37.014%]
    * [-36.125% -36.049% -35.970%]
    * [-35.636% -35.477% -35.350%]

@anforowicz
Copy link
Contributor Author

I think this may be worth landing - the performance improvements can be seen everywhere except the Transparency testcase.

I don't really understand what happens in the Transparency testcase. I can recover some of the regression with an additional commit in master...anforowicz:image-png:cap-incremental-decompression-size. Maybe we should also land this other commit (as a separate PR?). OTOH after this other commit the performance gains elsewhere are still there, but slightly smaller. I don't know how to decide between 1) landing just this commit/PR vs 2) landing this commit + the other commit. I think I am leaning toward just landing this commit/PR for now.

FWIW, I've also tested this commit on some small images from the top-500 benchmark described in #416 (comment). I've measured 10% - 15% performance improvements for those small images (Transparency.jpg is also small, but apparently being small is not the deciding factor).

Comment on lines 222 to 224
// We can't reuse the rest of `out_buffer` because it hasn't been zeroed-out (as
// expected by `fdeflate`. Because of this, we `truncate` the `out_buffer`.
self.out_buffer.truncate(preserved_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I eliminated the requirement in fdeflate about zeroing the buffer, so this shouldn't be needed

Copy link
Contributor Author

@anforowicz anforowicz Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed these 3 lines.

This changes how much will be decompressed in subsequent calls to fn decompress though - this probably means that I should redo the performance measurements to double-check the performance impact is still the same. I'll try to do that next week.

@fintelia
Copy link
Contributor

fintelia commented Jan 5, 2024

The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.

While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.

@anforowicz
Copy link
Contributor Author

Thanks for looking!

The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.

Ack.

I won't have time for it this week, but I plan to eventually try to understand master...anforowicz:image-png:cap-incremental-decompression-size better:

  • try to come up with a hypothesis why it improves Transparency (I think the other commit also helps to keep things in the L1 cache, but I don't understand why the current-PR-under-review makes this necessary; and I want to have higher confidence in the other commit because I suspect that it will also help to avoid Read/BufRead regressions and I don't want to advocate blindly for the other commit without understanding why it helps on its own)
  • try to understand why 16k helps but 8k doesn't (+ check what happens with 32k even though this feels wrong because it would fill the whole L1 cache)
  • measure performance comparison against 1) baseline and 2) this PR (i.e. current-PR-under-review)
  • maybe measure stalled backend cycles (and L1 misses) before and after the other commit

While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.

Ack.

Question: is this something independent from the PR-under-review? Or is this a problem that you think the PR-under-review may be introducing or making worse?

I can help with adding a test for this and checking what happens. I think the new test can mostly mimic test_idat_bigger_than_image_size_from_ihdr (but instead of writing an IDAT chunk with a bigger image, it can just write the right-size-image and append some extraneous bytes).


BTW, I forgot to say that I hope that the PR-under-review doesn't get in the way for your idea from #429 (comment) of "changing ZlibStream to directly write into a caller provided buffer instead of using its own out_buffer". Avoiding the 32kB lag seems orthogonal/independent from this idea. Changing how out_buffer compaction works is somewhat related, but hopefully still doesn't get in the way too much.

FWIW, I mostly got motivated to look here because of how the Read/BufRead PR happens to interfere with L1 cache, hardware prefetches, etc. (#427)

Avoiding 32kB decompression lag
===============================

Before this commit, decompressed data would be accumulated in
`ZlibStream::out_buffer` and returned via `image_data` with 32kB lag
corresponding to `CHUNCK_BUFFER_SIZE`:

    ```
    fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
        let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE);
        image_data.extend(self.out_buffer.drain(..safe));
        ...
    ```

32kB is a typical size of L1 cache, so the lag would mean that the data
passed to `image_data.extend(...)` would be already cold and evicted
from the L1 cache.

This commit avoids the lag by always returning into `image_data` all the
data from `out_buffer` (i.e. data up to `out_pos`):

    ```
    fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
        let transferred = &self.out_buffer[self.read_pos..self.out_pos];
        image_data.extend_from_slice(transferred);
        self.read_pos = self.out_pos;
        ...
    ```

Compacting less often
=====================

The changes above mean that `Vec::drain` no longer compacts `out_buffer`.
Therefore this commit also refactors how this compaction works.

Before this commit, not-yet-returned data would be shifted to the
beginning of `out_buffer` every time `transfer_finished_data` is called.
This could potentially mean that for 1 returned byte, N bytes have to be
copied during compaction.

After this commit, compaction is only done when the compaction cost if
offset by many read bytes - for 3 returned bytes 1 byte has to be copied
during compaction.

Performance impact
==================

The commit has a positive impact on performance, except for:

* `decode/Transparency.png` - regression between 15% and 20% is reported
  in 3-out-of-3 measurements.
* `decode/kodim17.png` - a regression of 2.1% has been reported in
  1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been
  reported in the other 2-out-of-3 measurements).
* `generated-noncompressed-64k-idat/128x128.png` - a regression of 25%
  has been reported in 1-out-of-3 measurements (an improvement of 21% -
  29% has been reported in the other 2-out-of-3 measurements).

The results below have been gathered by running the `decoder` benchmark.
First a baseline was saved before this commit, and then a comparison
was done after the commit.  This (the baseline + the comparison) was
repeated a total of 3 times.  All results below are for the relative
impact on the runtime.  All results are with p = 0.00 < 0.05.

* decode/kodim23.png:
    * [-2.9560% -2.7112% -2.4009%]
    * [-3.4876% -3.3406% -3.1928%]
    * [-3.0559% -2.9208% -2.7787%]

* decode/kodim07.png:
    * [-1.2527% -1.0110% -0.6780%]
    * [-1.7851% -1.6558% -1.5164%]
    * [-1.6576% -1.5216% -1.3856%]

* decode/kodim02.png:
    * [-0.5108% -0.2806% -0.0112%]
    * [-1.0885% -0.9493% -0.8118%]
    * [-0.5563% -0.4239% -0.2874%]

* decode/kodim17.png:
    * [+1.8649% +2.1138% +2.4169%] (**regression**)
    * [-1.2891% -1.1322% -0.9736%]
    * [-0.7753% -0.6276% -0.4866%]

* decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
    * [-1.7165% -1.4968% -1.2650%]
    * [-1.7051% -1.4473% -1.2229%]
    * [-1.2544% -1.0457% -0.8375%]

* decode/Transparency.png:
    * [+19.329% +19.789% +20.199%] (**regression**)
    * [+15.337% +15.798% +16.294%] (**regression**)
    * [+18.694% +19.106% +19.518%] (**regression**)

* generated-noncompressed-4k-idat/8x8.png:
    * [-2.3295% -1.9940% -1.5912%]
    * [-6.1285% -5.8872% -5.6091%]
    * [-2.8814% -2.6787% -2.4820%]

* generated-noncompressed-4k-idat/128x128.png:
    * [-59.793% -59.599% -59.417%]
    * [-63.930% -63.846% -63.756%]
    * [-62.377% -62.248% -62.104%]

* generated-noncompressed-4k-idat/2048x2048.png:
    * [-67.678% -67.579% -67.480%]
    * [-65.616% -65.519% -65.429%]
    * [-65.824% -65.647% -65.413%]

* generated-noncompressed-4k-idat/12288x12288.png:
    * [-60.932% -60.774% -60.528%]
    * [-62.088% -62.016% -61.940%]
    * [-61.663% -61.604% -61.546%]

* generated-noncompressed-64k-idat/128x128.png:
    * [-22.237% -21.975% -21.701%]
    * [-29.656% -29.480% -29.311%]
    * [+24.812% +25.190% +25.571%] (**regression**)

* generated-noncompressed-64k-idat/2048x2048.png:
    * [-21.826% -21.499% -21.087%]
    * [-54.279% -54.049% -53.715%]
    * [-11.174% -10.828% -10.482%]

* generated-noncompressed-64k-idat/12288x12288.png:
    * [-40.421% -40.311% -40.180%]
    * [-39.496% -39.183% -38.871%]
    * [-41.443% -41.367% -41.295%]

* generated-noncompressed-2g-idat/2048x2048.png:
    * [-40.136% -40.010% -39.865%]
    * [-58.507% -58.333% -58.060%]
    * [-35.822% -35.457% -35.038%]

* generated-noncompressed-2g-idat/12288x12288.png:
    * [-37.196% -37.107% -37.014%]
    * [-36.125% -36.049% -35.970%]
    * [-35.636% -35.477% -35.350%]
@anforowicz anforowicz force-pushed the no-32kB-delay-2nd-attempt branch from 7228d12 to cb0b595 Compare January 6, 2024 00:24
@fintelia fintelia merged commit 1636b55 into image-rs:master Jan 6, 2024
19 checks passed
@anforowicz anforowicz deleted the no-32kB-delay-2nd-attempt branch January 8, 2024 20:58
@anforowicz
Copy link
Contributor Author

Just a quick follow-up to #447 (comment) - I re-measured the performance and the results were mostly the same (i.e. as before addressing the comment / as with truncating and re-zero-ing out out_buffer). For the record, the results against the landed commit look as follows:

decode/kodim23.png:
[-3.1886% -3.0384% -2.8277%]
[-3.6101% -3.5022% -3.3806%]
[-3.9165% -3.7417% -3.6003%]

decode/kodim07.png:
[-1.1782% -1.0010% -0.8080%]
[-1.8537% -1.7999% -1.7421%]
[-1.5936% -1.3810% -1.2046%]

decode/kodim02.png:
[-1.9088% -1.7352% -1.5557%]
[-2.0096% -1.9522% -1.8895%]
[-2.2580% -2.0380% -1.8558%]

decode/kodim17.png:
[-0.9286% -0.7875% -0.5969%]
[-0.6255% -0.5532% -0.4817%]
[-1.5831% -1.3538% -1.1580%]

decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
[-1.9583% -1.6850% -1.3973%]
[-1.5138% -1.3103% -1.1106%]
[-1.4271% -1.1675% -0.8888%]

decode/Transparency.png:
[+14.271% +14.950% +15.689%]
[+14.740% +15.057% +15.443%]
[+13.514% +14.260% +15.206%]

generated-noncompressed-4k-idat/8x8.png:
[-1.1170% -0.8610% -0.6001%]
[-3.3542% -3.2026% -3.0460%]
[-6.8021% -6.6249% -6.4542%]

generated-noncompressed-4k-idat/128x128.png:
[-62.153% -62.038% -61.913%]
[-60.042% -59.859% -59.711%]
[-61.204% -61.130% -61.037%]

generated-noncompressed-4k-idat/2048x2048.png:
[-67.161% -67.121% -67.082%]
[-67.566% -67.470% -67.370%]
[-67.706% -67.663% -67.614%]

generated-noncompressed-4k-idat/12288x12288.png:
[-63.990% -63.956% -63.921%]
[-63.580% -63.504% -63.427%]
[-63.302% -63.266% -63.234%]

generated-noncompressed-64k-idat/128x128.png:
[+28.550% +28.972% +29.330%]
[+45.749% +46.366% +46.873%]
[-13.531% -13.292% -13.085%]

generated-noncompressed-64k-idat/2048x2048.png:
[-13.441% -12.212% -11.403%]
[-22.578% -22.234% -21.879%]
[-20.424% -20.222% -19.999%

generated-noncompressed-64k-idat/12288x12288.png:
[-45.895% -45.826% -45.749%]
[-41.836% -41.724% -41.617%]
[-41.325% -41.252% -41.165%]

generated-noncompressed-2g-idat/2048x2048.png:
[-40.284% -39.676% -39.277%]
[-38.249% -37.947% -37.701%]
[-42.957% -42.806% -42.651%]

generated-noncompressed-2g-idat/12288x12288.png:
[-38.550% -38.469% -38.388%]
[-37.827% -37.691% -37.509%]
[-37.335% -37.269% -37.207%]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants