Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a faster safe bitwise rotate implementation. #13

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nmathewson
Copy link

This branch replaces the rotation operation with one that compiles down to a single branchless rotate instruction (on my gcc and clang compilers). The magic trick here is the fact that :

  (v >> (c&mask)) | (v << (-c&mask))

works even when c==0, is conformant c, doesn't require a branch, and many compilers can recognize it and replace it with a rotate instruction.

I have tested this, but I haven't benchmarked it yet or tested with super-old or super-weird compilers. I would assume that eliminating these branch instructions is always a good idea, though, and it doesn't make the timeout.c code uglier.

This is in preparation for using faster versions.
On modern compilers, under my tests, this turns all of the rotate
functions into a single rotl/rotr instruction.  (For some links, see
BLAKE2/BLAKE2#14 . The cryptopp library
uses similar tricks.)

The previous definitions, on the other hand, include a conditional
branch instruction, which was probably slowing things down a bit.

Also add a comment to explain why we don't just do
"(v>>c)|(v<<(bits-c))" (because c can be 0).

Also add a comment to assert that we don't need to actually handle
c >= bits.

Removing all of these branch instructions ought to improve
performance a bit.
@nmathewson
Copy link
Author

Hm, we'd better be careful here. I just ran the benchmarks over master and all 3 of these branches, and to my surprise they don't show a very large improvement. (I wasn't very careful in my setup, though, so maybe better benchmark run would show results better.)

@wahern
Copy link
Owner

wahern commented Mar 11, 2016

Yikes. That's alot of code for something simple. I'd rather hold off on this. I was thinking it was a small diff that reduced the overall SLoC.

Regarding branching, I recently stumbled upon the very enlightening paper, Branch Prediction and the Performance of Interpreters - Don't Trust Folklore. The takeaway is that predictors have been getting much better across the board. Haswell is so good at it that often times it's wasted effort to even think about branch optimization.

Regardless of performance, I do prefer removing conditionals as removing them them tends to make code easier to follow and manage overall, even if the details of why a condition is unneeded isn't obvious. But this patch adds more (preprocessor) conditionals and other code.

Is it okay if I cherry pick (e.g. moving rotr and rotl to timeout-bitops) so I can merge the Makefile and regression test patches?

@nmathewson
Copy link
Author

Of course it's okay to cherry-pick. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants