Improve `__clzsi2` performance #366

AaronKutch · 2020-07-10T22:31:31Z

This was something I originally made for PR 332, but I realized I should PR it by itself so I can remove it from my specialized-div-rem crate. As an example, the assembly generation for RISC-V used to be

 addi    a1, zero, 64
 beqz    a2, .LBB202_2
 addi    a1, zero, 32
 add     a0, zero, a2
.LBB202_2:
 srli    a2, a0, 16
 beqz    a2, .LBB202_7
 addi    a1, a1, -16
 srli    a0, a2, 8
 bnez    a0, .LBB202_8
.LBB202_4:
 add     a0, zero, a2
 srli    a2, a0, 4
 bnez    a2, .LBB202_9
.LBB202_5:
 add     a2, zero, a0
 srli    a0, a2, 2
 bnez    a0, .LBB202_10
.LBB202_6:
 add     a0, zero, a2
 addi    a3, zero, 1
 addi    a2, zero, -2
 bgeu    a3, a0, .LBB202_11
 j       .LBB202_12
.LBB202_7:
 add     a2, zero, a0
 srli    a0, a2, 8
 beqz    a0, .LBB202_4
.LBB202_8:
 addi    a1, a1, -8
 srli    a2, a0, 4
 beqz    a2, .LBB202_5
.LBB202_9:
 addi    a1, a1, -4
 srli    a0, a2, 2
 beqz    a0, .LBB202_6
.LBB202_10:
 addi    a1, a1, -2
 addi    a3, zero, 1
 addi    a2, zero, -2
 bltu    a3, a0, .LBB202_12
.LBB202_11:
 neg     a2, a0
.LBB202_12:
 add     a0, a2, a1
 ret

but is now:

 srli    a1, a0, 32
 snez    a1, a1
 slli    a1, a1, 5
 srl     a0, a0, a1
 srli    a2, a0, 16
 snez    a2, a2
 slli    a2, a2, 4
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 255
 sltu    a2, a2, a0
 slli    a2, a2, 3
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 15
 sltu    a2, a2, a0
 slli    a2, a2, 2
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 3
 sltu    a2, a2, a0
 slli    a2, a2, 1
 srl     a0, a0, a2
 or      a1, a1, a2
 addi    a2, zero, 1
 sltu    a2, a2, a0
 srl     a0, a0, a2
 or      a1, a1, a2
 add     a0, a0, a1
 addi    a1, zero, 64
 sub     a0, a1, a0
 ret

This is completely branchless. In fact, I could make it const. A previous comment mentioned making __clzsi2 const, but intrinsics! doesn't seem to support const.

AaronKutch · 2020-07-10T22:37:35Z

What is up with i686-pc-windows-gnu builds spuriously failing on me? I have encountered this multiple times before.

alexcrichton · 2020-07-13T13:55:24Z

Thanks! Would you be up to adding a microbenchmark and also expanding the test suite for this intrinsic at the same time as well?

AaronKutch · 2020-07-13T22:41:58Z

I don't see any benchmarks in compiler-builtins. Where would I put a microbenchmark?

bjorn3 · 2020-07-14T10:26:53Z

You could add benches to https://github.com/rust-lang/compiler-builtins/blob/master/testcrate/tests/count_leading_zeros.rs

bjorn3 · 2020-07-14T10:27:53Z

testcrate/tests/count_leading_zeros.rs

+            ones = !0 >> r0;
+            let r1: u32 = bit_indexing_mask & random::<u32>();
+            let mask = ones.rotate_left(r1);
+            match (random(), random()) {


Tests should be deterministic. Otherwise they could fail one time and succed another time.

Could I use a seedable RNG like rand_xoshiro::Xoroshiro128StarStar?

I am also using this fuzzing technique in PR #332.

I guess you could.

Randomized testing is fine in my opinion but this should also print out the input on failure so if it happens in CI it can be reproduced.

I wonder if using something like https://docs.rs/quickcheck/0.9.2/quickcheck/ something we can consider for tests?

alexcrichton · 2020-07-14T17:51:01Z

testcrate/tests/count_leading_zeros.rs

-    for _ in 0..63 {
-        assert_eq!(__clzsi2(i) as u32, i.leading_zeros());
-        i <<= 2;
-        i += 1;


Can these tests be left in because there's presumably no harm in deterministically trying to catch edge cases?

alexcrichton · 2020-07-14T17:52:05Z

There's also examples of benchmarks on #365, but the general idea is that if this is done for performance reasons it would be good to confirm that the performance did indeed improve.

AaronKutch · 2020-07-14T20:34:03Z

I'm confused. When I look at this commit, I can see benches being added under a testcrate/benches folder. This folder doesn't look like it exists in the github browser or my local clone. When I search for the #[bench] keyword, I don't see benchmarks anywhere in compiler-builtins. Is there some kind of git submodule magic happening here? edit: Is my PR and #365 going to be the first PRs to ever add benchmarks to compiler-builtins?

nagisa

The implementation looks good to me.

nagisa · 2020-07-15T00:45:07Z

testcrate/tests/count_leading_zeros.rs

+            ones = !0 >> r0;
+            let r1: u32 = bit_indexing_mask & random::<u32>();
+            let mask = ones.rotate_left(r1);
+            match (random(), random()) {


I wonder if using something like https://docs.rs/quickcheck/0.9.2/quickcheck/ something we can consider for tests?

nagisa · 2020-07-15T00:49:15Z

There's also examples of benchmarks on #365, but the general idea is that if this is done for performance reasons it would be good to confirm that the performance did indeed improve.

IMO looking at something like llvm-mca output for cases like this gives significantly more information and also, in my opinion, is a more rigorous way to evaluate performance of functions in compiler-builtins. So, while I think benchmarks would be cool, I’d also be fine with accepting with with just an evaluation of llvm-mca or similar function latency/throughput analysis for targets that might use this function.

nagisa · 2020-07-15T01:36:16Z

I did some llvm-mca spelunking:

	uOps Per Cycle	IPC	Block RThroughput
OLD x86_64 znver1	3.05	3.05	9.3
NEW x86_64 znver1	1.48	1.48	10.0
OLD Apple-A13	1.58	1.58	7.8
NEW Apple-A13	1.10	1.10	11.5
OLD Cortex-A73	1.66	1.66	15.0
NEW Cortex-A73	1.22	1.22	14.5

(lower Block RThroughput is better and is probably the best indication of real-world performance you can expect)

With the caveat that this implementation is not relevant on superscalar CPUs which is the only thing llvm-mca, apparently, supports, the old version is better in 2 out of 3 cases.

Cursory look suggests that both the new and old implementation are both pretty good about their register dependencies and don’t spend much time overall waiting on anything, just that LLVM ends up just generating fewer instructions on targets that happen to have a "conditional set/move" kind of instructions. For tested targets the assembly also ends up non-branchy with either new and old implementations, which makes sense since the generated code is using conditional moves.

For example here’s assembly for MIPS:

Old

	srl	$1, $4, 16
	addiu	$2, $zero, 16
	addiu	$3, $zero, 32
	movz	$2, $3, $1
	addiu	$3, $2, -8
	movz	$1, $4, $1
	srl	$4, $1, 8
	movz	$3, $2, $4
	addiu	$2, $3, -4
	movz	$4, $1, $4
	srl	$1, $4, 4
	movz	$2, $3, $1
	addiu	$3, $zero, -2
	addiu	$5, $2, -2
	movz	$1, $4, $1
	srl	$4, $1, 2
	movz	$5, $2, $4
	movz	$4, $1, $4
	negu	$1, $4
	sltiu	$2, $4, 2
	movz	$1, $3, $2
	jr	$ra
	addu	$2, $1, $5

New

	ori	$1, $zero, 65535
	sltu	$1, $1, $4
	sll	$1, $1, 4
	srlv	$2, $4, $1
	addiu	$3, $zero, 255
	sltu	$3, $3, $2
	sll	$3, $3, 3
	srlv	$2, $2, $3
	or	$1, $1, $3
	addiu	$3, $zero, 32
	addiu	$4, $zero, 1
	addiu	$5, $zero, 3
	addiu	$6, $zero, 15
	sltu	$6, $6, $2
	sll	$6, $6, 2
	srlv	$2, $2, $6
	sltu	$5, $5, $2
	sll	$5, $5, 1
	srlv	$2, $2, $5
	sltu	$4, $4, $2
	srlv	$2, $2, $4
	or	$1, $1, $6
	or	$1, $1, $5
	or	$1, $1, $4
	addu	$1, $1, $2
	jr	$ra
	subu	$2, $3, $1

So overall conclusion at least from me is that its looking like if there's any improvement with the new implementation, its probably going to be limited to RISC-V...

(though also remember that targets like x86_64 and aarch64 have no use for this builtin as they have a native instruction)

AaronKutch · 2020-07-15T04:14:51Z

Thank you very much for the spelunking. I looked at RISC-V specifically because it was giving me a hard time optimizing division functions. It gave such a assembly improvement to RISC-V and only slightly degraded local benchmarks for an OOO machine, so I assumed it would be the most performant for in-order CPUs in general and especially RISC-V. I thought I remembered testing the assembly for ARM also, but looking at it again, the old version is significantly better there too. I think we should have both implementations, but enable the new one only on RISC-V. What do you think?

nagisa · 2020-07-15T11:55:48Z

Yeah, I think its fine to conditionally switch between implementations in compiler-builtins. Its a library that's used everywhere, after all.

alexcrichton · 2020-07-15T14:15:52Z

If we do have multiple implementations, however, I'd like to ensure that they're implemented in such a way that they're both tested thoroughly, because otherwise it's a situation that's very easy to have a risc-v-only bug.

AaronKutch · 2020-07-17T01:39:12Z

I think that the code is large enough to move the __clzsi2 function under a leading_zeros.rs file instead of being an exceptional case in mod.rs, but it seems that the test crate cannot see the new file despite the docs clearly showing one. I presume that compiler-builtins is doing something hacky with its exports. I guess we do not want a breaking change to solve such a small inelegance anyway.

My current plan I have of testing both variations of counting leading zeros thoroughly is to have two functions, usize_leading_zeros_default and usize_leading_zeros_riscv, make __clzsi2 conditionally use one or the other, and add a feature flag the testcrate can set in order to see both underlying functions.

Would a better solution be having a CI test for RISC-V? I read this post that says "the full Rust test suite passes for a RISC-V cross-compilation toolchain". I see a wide variety of CI tests being performed currently, I wonder why RISC-V isn't included.

edit: RISC-V GNU/Linux is being added as a host platform soon.

alexcrichton · 2020-07-17T19:08:14Z

I'm not sure what's up with modules, nothing's really that fancy in this crate and it should in general act similarly to the rest of Rust's module system. Are you sure that you reexported the intrinsics in mod.rs?

For testing adding RISC-V should be fine.

I wonder why RISC-V isn't included.

This crate is old. RISC-V targets didn't exist when this crate was created. No one's added it since then.

AaronKutch · 2020-07-23T02:02:12Z

What do you think about my testing setup? I don't know if there is a better solution. Does expose-alt-impls look like a flag that could be used similarly by other specialization tests?

alexcrichton

Looks reasonable to me, thanks! I've got a few minor comments but otherwise seems good to me to go.

Can you also add comments for why the riscv version isn't used on other platforms?

alexcrichton · 2020-07-23T15:30:28Z

src/int/leading_zeros.rs

+}
+
+#[cfg(any(target_arch = "riscv32", target_arch = "riscv64"))]
+intrinsics! {


To avoid the duplication here could this perhaps use if cfg!(...) internally?

alexcrichton · 2020-07-23T15:30:46Z

src/int/mod.rs

+mod leading_zeros;
+
+#[cfg(feature = "expose-alt-impls")]
+pub mod leading_zeros;


I think it'd be fine to just always export this module, there shouldn't be a need to make it #[cfg] pub

Making it pub always will expose the usize_leading_zeros... functions by default, but the only reason they should ever be public is for testing purposes.

I think exporting that is fine, there are many exported symbols from this library other than the intrinsics due to traits and such, so we don't need to be super strict about what exactly is exported.

I have made it public always and removed unneeded configuration

AaronKutch · 2020-07-23T21:49:09Z

I found out that u128::leading_zeros has a huge code size. This PR will help, but I think there is a problem upstream wherever it is defined.
the assembly of

pub fn lz(x: u128) -> usize {
    let mut x = x;
    let mut z = 0;
    if ((x >> 64) as u64) != 0 {
        z += 64;
    } else {
        x >>= 64;
    }
    z + x.leading_zeros() as usize
}

is huge (much larger than it should be even before this PR) but changing the last line to be

   z + (x as u64).leading_zeros() as usize

fixes it. I think there is a similar problem somewhere.

AaronKutch force-pushed the better-leading-zeros branch from f995d43 to 5f3c4f7 Compare July 13, 2020 22:38

bjorn3 reviewed Jul 14, 2020

View reviewed changes

alexcrichton reviewed Jul 14, 2020

View reviewed changes

nagisa reviewed Jul 15, 2020

View reviewed changes

AaronKutch force-pushed the better-leading-zeros branch 2 times, most recently from 3a22529 to 7f00717 Compare July 22, 2020 04:25

alexcrichton reviewed Jul 23, 2020

View reviewed changes

AaronKutch force-pushed the better-leading-zeros branch 2 times, most recently from f2dcdcf to 4a976e0 Compare July 28, 2020 17:17

Improve __clzsi2 performance

db013ec

AaronKutch force-pushed the better-leading-zeros branch from 4a976e0 to db013ec Compare July 28, 2020 17:19

alexcrichton merged commit f4c7940 into rust-lang:master Jul 28, 2020

AaronKutch mentioned this pull request Jun 1, 2021

Extremely bad codegen of leading_zeros on several architectures rust-lang/rust#85879

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `__clzsi2` performance #366

Improve `__clzsi2` performance #366

AaronKutch commented Jul 10, 2020

AaronKutch commented Jul 10, 2020 •

edited

Loading

alexcrichton commented Jul 13, 2020

AaronKutch commented Jul 13, 2020

bjorn3 commented Jul 14, 2020

bjorn3 Jul 14, 2020

AaronKutch Jul 14, 2020

AaronKutch Jul 14, 2020

bjorn3 Jul 14, 2020

alexcrichton Jul 14, 2020

nagisa Jul 15, 2020

alexcrichton Jul 14, 2020

alexcrichton commented Jul 14, 2020

AaronKutch commented Jul 14, 2020 •

edited

Loading

nagisa left a comment

nagisa Jul 15, 2020

nagisa commented Jul 15, 2020

nagisa commented Jul 15, 2020

AaronKutch commented Jul 15, 2020

nagisa commented Jul 15, 2020

alexcrichton commented Jul 15, 2020

AaronKutch commented Jul 17, 2020 •

edited

Loading

alexcrichton commented Jul 17, 2020

AaronKutch commented Jul 23, 2020

alexcrichton left a comment

alexcrichton Jul 23, 2020

AaronKutch Jul 23, 2020

alexcrichton Jul 23, 2020

AaronKutch Jul 23, 2020

alexcrichton Jul 28, 2020

AaronKutch Jul 28, 2020

AaronKutch commented Jul 23, 2020

Improve __clzsi2 performance #366

Improve __clzsi2 performance #366

Conversation

AaronKutch commented Jul 10, 2020

AaronKutch commented Jul 10, 2020 • edited Loading

alexcrichton commented Jul 13, 2020

AaronKutch commented Jul 13, 2020

bjorn3 commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Jul 14, 2020

AaronKutch commented Jul 14, 2020 • edited Loading

nagisa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa commented Jul 15, 2020

nagisa commented Jul 15, 2020

AaronKutch commented Jul 15, 2020

nagisa commented Jul 15, 2020

alexcrichton commented Jul 15, 2020

AaronKutch commented Jul 17, 2020 • edited Loading

alexcrichton commented Jul 17, 2020

AaronKutch commented Jul 23, 2020

alexcrichton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AaronKutch commented Jul 23, 2020

Improve `__clzsi2` performance #366

Improve `__clzsi2` performance #366

AaronKutch commented Jul 10, 2020 •

edited

Loading

AaronKutch commented Jul 14, 2020 •

edited

Loading

AaronKutch commented Jul 17, 2020 •

edited

Loading