Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SipHasher takes up lots of time in incremental builds #51054

Closed
nnethercote opened this issue May 25, 2018 · 8 comments
Closed

SipHasher takes up lots of time in incremental builds #51054

nnethercote opened this issue May 25, 2018 · 8 comments
Labels
A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance

Comments

@nnethercote
Copy link
Contributor

Here's Cachegrind data for clap-rs from rustc-benchmarks on a "base incremental" "check" build (i.e. the first incremental build):

--------------------------------------------------------------------------------
            Ir
--------------------------------------------------------------------------------
39,792,587,805  PROGRAM TOTALS

6,008,282,218  librustc_data_structures/sip128.rs:rustc_data_structures::sip128::SipHasher128::short_write
1,655,965,059  libcore/num/mod.rs:rustc_data_structures::sip128::SipHasher128::short_write
  319,188,771  libcore/cmp.rs:rustc_data_structures::sip128::SipHasher128::short_write

That's over 20% of the instructions under short_write.

Here are the annotations for the hot pieces of code in librustc_data_structures/sip128.rs.

          .  macro_rules! compress {
          .      ($state:expr) => ({
967,427,173          compress!($state.v0, $state.v1, $state.v2, $state.v3)
          .      });
          .      ($v0:expr, $v1:expr, $v2:expr, $v3:expr) =>
          .      ({
          .          $v0 = $v0.wrapping_add($v1); $v1 = $v1.rotate_left(13); $v1 ^= $v0;
          .          $v0 = $v0.rotate_left(32);
          .          $v2 = $v2.wrapping_add($v3); $v3 = $v3.rotate_left(16); $v3 ^= $v2;
          .          $v0 = $v0.wrapping_add($v3); $v3 = $v3.rotate_left(21); $v3 ^= $v0;
          .          $v2 = $v2.wrapping_add($v1); $v1 = $v1.rotate_left(17); $v1 ^= $v2;
 79,051,764          $v2 = $v2.rotate_left(32);
          .      });
          .  }
          .
          .  /// Load an integer of the desired type from a byte stream, in LE order. Uses
          .  /// `copy_nonoverlapping` to let the compiler generate the most efficient way
          .  /// to load it from a possibly unaligned address.
          .  ///
          .  /// Unsafe because: unchecked indexing at i..i+size_of(int_ty)
          .  macro_rules! load_int_le {
          .      ($buf:expr, $i:expr, $int_ty:ident) =>
          .      ({
          .         debug_assert!($i + mem::size_of::<$int_ty>() <= $buf.len());
          .         let mut data = 0 as $int_ty;
 13,166,995         ptr::copy_nonoverlapping($buf.get_unchecked($i),
          .                                  &mut data as *mut _ as *mut u8,
          .                                  mem::size_of::<$int_ty>());
          .         data.to_le()
          .      });
          .  }
          .
          .  /// Load an u64 using up to 7 bytes of a byte slice.
          .  ///
          .  /// Unsafe because: unchecked indexing at start..start+len
          .  #[inline]
          .  unsafe fn u8to64_le(buf: &[u8], start: usize, len: usize) -> u64 {
          .      debug_assert!(len < 8);
          .      let mut i = 0; // current byte index (from LSB) in the output u64
          .      let mut out = 0;
345,951,546      if i + 3 < len {
 80,328,075          out = load_int_le!(buf, start + i, u32) as u64;
          .          i += 4;
          .      }
747,722,073      if i + 1 < len {
480,400,389          out |= (load_int_le!(buf, start + i, u16) as u64) << (i * 8);
 87,344,805          i += 2
          .      }
345,951,546      if i < len {
211,374,738          out |= (*buf.get_unchecked(start + i) as u64) << (i * 8);
          .          i += 1;
          .      }
          .      debug_assert_eq!(i, len);
          .      out
          .  }

and

          .      #[inline]
          .      fn short_write(&mut self, msg: &[u8]) {
          .          debug_assert!(msg.len() <= 8);
          .          let length = msg.len();
212,792,515          self.length += length;
          .
319,188,774          let needed = 8 - self.ntail;
          .          let fill = cmp::min(length, needed);
212,792,515          if fill == 8 {
 38,954,670              self.tail = unsafe { load_int_le!(msg, 0, u64) };
          .          } else {
560,468,202              self.tail |= unsafe { u8to64_le(msg, 0, fill) } << (8 * self.ntail);
186,822,734              if length < needed {
 55,081,556                  self.ntail += length;
          .                  return;
          .              }
          .          }
 78,855,480          self.state.v3 ^= self.tail;
          .          Sip24Rounds::c_rounds(&mut self.state);
157,710,960          self.state.v0 ^= self.tail;
          .
          .          // Buffered tail is now flushed, process new input.
157,710,958          self.ntail = length - needed;
 78,855,480          self.tail = unsafe { u8to64_le(msg, needed, self.ntail) };
212,792,514      }

And from libcore/num/mod.rs:

            .          #[inline]
            .          pub fn rotate_left(self, n: u32) -> Self {
            .              // Protect against undefined behaviour for over-long bit shifts
            .              let n = n % $BITS;
1,187,178,937              (self << n) | (self >> (($BITS - n) % $BITS))
            .          }

I stared at this for a while but wasn't able to come up with any notable improvements.

Hashing is the hottest part of most incremental check builds. If we can't speed up this code, perhaps we could use a different hasher, or find a way to hash less data.

CC @michaelwoerister

@nnethercote
Copy link
Contributor Author

or find a way to hash less data

A small example: the impls of HashStable for [T], str, and IndexVec all hash the length and then the elements. Is the length needed? I guess it's necessary because you might have a structure with two Vecs in a row, and (vec![1,2,3], vec![4,5]) must hash differently to (vec![1,2], vec![3,4,5])?

I guess I could retry the trick from #37427 of leb128-encoding integers. I did some measurements while compiling the compiler and 66% of the values hashed were u64, and 27% were u8, so there's potential there.

@sfackler
Copy link
Member

I guess it's necessary because you might have a structure with two Vecs in a row, and (vec![1,2,3], vec![4,5]) must hash differently to (vec![1,2], vec![3,4,5])?

Yeah, that's the reason.

@michaelwoerister
Copy link
Member

You could give leb128 encoding a try, especially for usize which usually represents small length values. Leb128 encoding has been optimized a bit since it was last used during hashing (#46919).

The most effective optimizations we've done here is caching of hashes, like in 45bd091.

Using a different hasher is an option if it supports 128 bit output and we can be sure that it has a similarly low collision probability as SipHash 2-4. However, I'd only like to change it if the results are compelling since with iterated on this quite a bit already (see #41215).

@Mark-Simulacrum Mark-Simulacrum added I-compiletime Issue: Problems and improvements with respect to compile times. A-incr-comp Area: Incremental compilation WG-compiler-performance Working group: Compiler Performance labels May 28, 2018
@michaelwoerister
Copy link
Member

The highwayhash authors claim to have a faster implementation of SipHash which still produces the same outputs: https://github.com/google/highwayhash.

@nnethercote
Copy link
Contributor Author

One thing that's not clear to me is what needs to be hashed and what doesn't. For example, ScopeTrees are hashed, and for clap-rs that hashing takes up 12% of the time of a "baseline incremental-check" build, i.e. a lot. But why are ScopeTrees hashed? It seems redundant, because they are derived from HIR, and HIR is hashed.

Here is a lightly edited transcript of an IRC discussion about this, between me and @Zoxc.

<njn> so I am wondering why `parent_map` (and the entire scope tree) is hashed at all,
      because it seems to me to be "derived data", in the sense that it'll only change if
      the HIR above it also changes
<Zoxc> I'm not sure why it's hashed, it seems redundant to me if it is treated as an input
       along with the HIR
<njn> Right. So I'm wondering if there's a clear notion of what should be hashed and what
      shouldn't
<Zoxc> We probably want to introduce a query kind which does not recheck the result for
       freshness for these cases. That would avoid the hashing
<njn> what is the definition of "these cases" -- derived data?
<Zoxc> No. It would be queries D, where query D uses queries A, and the users of D also
       use A.
<Zoxc> So if such a query D is executed, comparing its result to the previous session and
       marking it up to date will have no affect on its users since they use the same
       queries A and will execute anyway.
<njn> how does one identify these cases? Does it require domain expertise on a
      case-by-case basis?
<Zoxc> It probably does
<njn> :(
<njn> I guess if a handful of structures account for most of the hashing cost, that'll make
      it easier
<Zoxc> You could also just change one query at a time and benchmark things, starting
       with those with the largest query results

@michaelwoerister: what do you think?

@michaelwoerister
Copy link
Member

Copying my comments from IRC:

njn, Zoxc: every query result needs to be hashed, otherwise incremental compilation could not recover from false positive. We started out with just hashing the HIR but re-use was very poor
njn, Zoxc: I think there's a bug report for everything that we forgot to hash properly :)  
njn, Zoxc: https://github.com/rust-lang/rust/issues/47389 contains some background on why spans are hashed and how things could be improved a little.

It might be possible optimize specific cases like this:

      A <---+
      ^     |
      |     |
      |     B
      |     ^
      |     |
      +--C--+

That is, if you have a query C that depends A and B and B is derived from A. Then hashing B is indeed redundant. However, if there was one more query D that depends only on B, then not hashing B might lead to redundantly executing D if A changed but B was not affected by the change.

      A <---+
      ^     |
      |     |
      |     B <---- D
      |     ^
      |     |
      +--C--+

Also note that we don't re-hash query result where we already know the hash from a previous compilation session.

@XAMPPRocky XAMPPRocky added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 2, 2018
@Kobzol
Copy link
Contributor

Kobzol commented Oct 4, 2021

FWIW, I tried replacing SipHash128 with HighwayHash. With the out-of-the-box implementation, it resulted in ~10 % instruction count regression and mixed cycle count results across the board locally on my PC.

I also tried experimenting with different buffer sizes for SipHash. It seems that a larger buffer size (16, 32) has a positive impact on cycle counts and walltimes on my PC, but this will probably differ a lot on various architectures.

@nnethercote
Copy link
Contributor Author

I tried seahash and got instruction count increases of up to 29%, even though its docs claim it is more than 10x faster than SipHash :(

It may at least partly be explained by the fact that I managed to speed up SipHasher128 (the one used in incremental compilation) quite a lot last year in #68914. In fact, based on that, this issue doesn't need to be open any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance
Projects
None yet
Development

No branches or pull requests

6 participants