Use a faster safe bitwise rotate implementation. #13

nmathewson · 2016-03-05T18:52:52Z

This branch replaces the rotation operation with one that compiles down to a single branchless rotate instruction (on my gcc and clang compilers). The magic trick here is the fact that :

  (v >> (c&mask)) | (v << (-c&mask))

works even when c==0, is conformant c, doesn't require a branch, and many compilers can recognize it and replace it with a rotate instruction.

I have tested this, but I haven't benchmarked it yet or tested with super-old or super-weird compilers. I would assume that eliminating these branch instructions is always a good idea, though, and it doesn't make the timeout.c code uglier.

This is in preparation for using faster versions.

On modern compilers, under my tests, this turns all of the rotate functions into a single rotl/rotr instruction. (For some links, see BLAKE2/BLAKE2#14 . The cryptopp library uses similar tricks.) The previous definitions, on the other hand, include a conditional branch instruction, which was probably slowing things down a bit. Also add a comment to explain why we don't just do "(v>>c)|(v<<(bits-c))" (because c can be 0). Also add a comment to assert that we don't need to actually handle c >= bits. Removing all of these branch instructions ought to improve performance a bit.

nmathewson · 2016-03-06T14:35:06Z

Hm, we'd better be careful here. I just ran the benchmarks over master and all 3 of these branches, and to my surprise they don't show a very large improvement. (I wasn't very careful in my setup, though, so maybe better benchmark run would show results better.)

wahern · 2016-03-11T22:18:03Z

Yikes. That's alot of code for something simple. I'd rather hold off on this. I was thinking it was a small diff that reduced the overall SLoC.

Regarding branching, I recently stumbled upon the very enlightening paper, Branch Prediction and the Performance of Interpreters - Don't Trust Folklore. The takeaway is that predictors have been getting much better across the board. Haswell is so good at it that often times it's wasted effort to even think about branch optimization.

Regardless of performance, I do prefer removing conditionals as removing them them tends to make code easier to follow and manage overall, even if the details of why a condition is unneeded isn't obvious. But this patch adds more (preprocessor) conditionals and other code.

Is it okay if I cherry pick (e.g. moving rotr and rotl to timeout-bitops) so I can merge the Makefile and regression test patches?

nmathewson · 2016-03-11T23:43:56Z

Of course it's okay to cherry-pick. :)

nmathewson added 3 commits March 5, 2016 13:36

Move rotr,rotl into timeout-bitops

052a609

This is in preparation for using faster versions.

Add MSVC versions of the bitwise rotation intrinsics.

b51a3aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a faster safe bitwise rotate implementation. #13

Use a faster safe bitwise rotate implementation. #13

nmathewson commented Mar 5, 2016

nmathewson commented Mar 6, 2016

wahern commented Mar 11, 2016

nmathewson commented Mar 11, 2016

Use a faster safe bitwise rotate implementation. #13

Are you sure you want to change the base?

Use a faster safe bitwise rotate implementation. #13

Conversation

nmathewson commented Mar 5, 2016

nmathewson commented Mar 6, 2016

wahern commented Mar 11, 2016

nmathewson commented Mar 11, 2016