-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use popcount + PDEP for uniform random slot selection on x86 for Haswell and later #26
Comments
I don't have much context about your overall problem, but you can do select reasonably efficiently without PDEP too. See folly for example (https://github.com/facebook/folly/blob/master/folly/experimental/Select64.h). |
The current code is here: https://github.com/GrapheneOS/hardened_malloc/blob/master/h_malloc.c#L353-L372 |
The non-random implementation is quite fast already and the only reason we don't manually unroll the loop is because the compiler generates code that's as good or better from this. I do have an implementation of unrolling it around somewhere. The random implementation could be faster and ideally it would be a uniform random choice, not simply starting the search in a random location as we're currently doing. I don't want to pay any additional performance cost to make it better though. |
Ideally, we would have an implementation that is faster and does uniform random choice everywhere, not only modern x86_64. |
As explained on matrix by @thestinger :
|
Just noting that PDEP is terrible on AMD processors before Zen 3. It essentially is emulated in the microcode and gets slower (8 uops) for every bit set in the PDEP mask operand. More discussion/info about it here: |
There are similar optimisations for Arm NEON, as used by isoalloc |
PDEP allows selecting the nth unset bit efficiently (a couple cycles) so it's a fantastic way of implementing this. There's no clear way to do it at all efficiently elsewhere, which is why the current portable implementation only randomizes the search start index and then uses the ffs intrinsic.
The text was updated successfully, but these errors were encountered: