-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve __clzsi2
performance
#366
Improve __clzsi2
performance
#366
Conversation
What is up with |
Thanks! Would you be up to adding a microbenchmark and also expanding the test suite for this intrinsic at the same time as well? |
f995d43
to
5f3c4f7
Compare
I don't see any benchmarks in |
ones = !0 >> r0; | ||
let r1: u32 = bit_indexing_mask & random::<u32>(); | ||
let mask = ones.rotate_left(r1); | ||
match (random(), random()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should be deterministic. Otherwise they could fail one time and succed another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I use a seedable RNG like rand_xoshiro::Xoroshiro128StarStar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also using this fuzzing technique in PR #332.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Randomized testing is fine in my opinion but this should also print out the input on failure so if it happens in CI it can be reproduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using something like https://docs.rs/quickcheck/0.9.2/quickcheck/ something we can consider for tests?
for _ in 0..63 { | ||
assert_eq!(__clzsi2(i) as u32, i.leading_zeros()); | ||
i <<= 2; | ||
i += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these tests be left in because there's presumably no harm in deterministically trying to catch edge cases?
There's also examples of benchmarks on #365, but the general idea is that if this is done for performance reasons it would be good to confirm that the performance did indeed improve. |
I'm confused. When I look at this commit, I can see benches being added under a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me.
ones = !0 >> r0; | ||
let r1: u32 = bit_indexing_mask & random::<u32>(); | ||
let mask = ones.rotate_left(r1); | ||
match (random(), random()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using something like https://docs.rs/quickcheck/0.9.2/quickcheck/ something we can consider for tests?
IMO looking at something like |
I did some
(lower Block RThroughput is better and is probably the best indication of real-world performance you can expect) With the caveat that this implementation is not relevant on superscalar CPUs which is the only thing llvm-mca, apparently, supports, the old version is better in 2 out of 3 cases. Cursory look suggests that both the new and old implementation are both pretty good about their register dependencies and don’t spend much time overall waiting on anything, just that LLVM ends up just generating fewer instructions on targets that happen to have a "conditional set/move" kind of instructions. For tested targets the assembly also ends up non-branchy with either new and old implementations, which makes sense since the generated code is using conditional moves. For example here’s assembly for MIPS: Old srl $1, $4, 16
addiu $2, $zero, 16
addiu $3, $zero, 32
movz $2, $3, $1
addiu $3, $2, -8
movz $1, $4, $1
srl $4, $1, 8
movz $3, $2, $4
addiu $2, $3, -4
movz $4, $1, $4
srl $1, $4, 4
movz $2, $3, $1
addiu $3, $zero, -2
addiu $5, $2, -2
movz $1, $4, $1
srl $4, $1, 2
movz $5, $2, $4
movz $4, $1, $4
negu $1, $4
sltiu $2, $4, 2
movz $1, $3, $2
jr $ra
addu $2, $1, $5 New ori $1, $zero, 65535
sltu $1, $1, $4
sll $1, $1, 4
srlv $2, $4, $1
addiu $3, $zero, 255
sltu $3, $3, $2
sll $3, $3, 3
srlv $2, $2, $3
or $1, $1, $3
addiu $3, $zero, 32
addiu $4, $zero, 1
addiu $5, $zero, 3
addiu $6, $zero, 15
sltu $6, $6, $2
sll $6, $6, 2
srlv $2, $2, $6
sltu $5, $5, $2
sll $5, $5, 1
srlv $2, $2, $5
sltu $4, $4, $2
srlv $2, $2, $4
or $1, $1, $6
or $1, $1, $5
or $1, $1, $4
addu $1, $1, $2
jr $ra
subu $2, $3, $1 So overall conclusion at least from me is that its looking like if there's any improvement with the new implementation, its probably going to be limited to RISC-V... (though also remember that targets like x86_64 and aarch64 have no use for this builtin as they have a native instruction) |
Thank you very much for the spelunking. I looked at RISC-V specifically because it was giving me a hard time optimizing division functions. It gave such a assembly improvement to RISC-V and only slightly degraded local benchmarks for an OOO machine, so I assumed it would be the most performant for in-order CPUs in general and especially RISC-V. I thought I remembered testing the assembly for ARM also, but looking at it again, the old version is significantly better there too. I think we should have both implementations, but enable the new one only on RISC-V. What do you think? |
Yeah, I think its fine to conditionally switch between implementations in compiler-builtins. Its a library that's used everywhere, after all. |
If we do have multiple implementations, however, I'd like to ensure that they're implemented in such a way that they're both tested thoroughly, because otherwise it's a situation that's very easy to have a risc-v-only bug. |
I think that the code is large enough to move the My current plan I have of testing both variations of counting leading zeros thoroughly is to have two functions, Would a better solution be having a CI test for RISC-V? I read this post that says "the full Rust test suite passes for a RISC-V cross-compilation toolchain". I see a wide variety of CI tests being performed currently, I wonder why RISC-V isn't included. edit: RISC-V GNU/Linux is being added as a host platform soon. |
I'm not sure what's up with modules, nothing's really that fancy in this crate and it should in general act similarly to the rest of Rust's module system. Are you sure that you reexported the intrinsics in For testing adding RISC-V should be fine.
This crate is old. RISC-V targets didn't exist when this crate was created. No one's added it since then. |
3a22529
to
7f00717
Compare
What do you think about my testing setup? I don't know if there is a better solution. Does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, thanks! I've got a few minor comments but otherwise seems good to me to go.
Can you also add comments for why the riscv version isn't used on other platforms?
} | ||
|
||
#[cfg(any(target_arch = "riscv32", target_arch = "riscv64"))] | ||
intrinsics! { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the duplication here could this perhaps use if cfg!(...)
internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
src/int/mod.rs
Outdated
mod leading_zeros; | ||
|
||
#[cfg(feature = "expose-alt-impls")] | ||
pub mod leading_zeros; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be fine to just always export this module, there shouldn't be a need to make it #[cfg]
pub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it pub
always will expose the usize_leading_zeros...
functions by default, but the only reason they should ever be public is for testing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exporting that is fine, there are many exported symbols from this library other than the intrinsics due to traits and such, so we don't need to be super strict about what exactly is exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made it public always and removed unneeded configuration
I found out that
is huge (much larger than it should be even before this PR) but changing the last line to be
fixes it. I think there is a similar problem somewhere. |
f2dcdcf
to
4a976e0
Compare
4a976e0
to
db013ec
Compare
This was something I originally made for PR 332, but I realized I should PR it by itself so I can remove it from my
specialized-div-rem
crate. As an example, the assembly generation for RISC-V used to bebut is now:
This is completely branchless. In fact, I could make it
const
. A previous comment mentioned making__clzsi2
const
, butintrinsics!
doesn't seem to supportconst
.