Implementation of Ord for integers is suboptimal #63758

nagisa · 2019-08-20T20:55:11Z

The current implementation of Ord::cmp for integral types results in less than optimal code. Currently the implementation looks something like this:

fn cmp(&self, other: &$t) -> Ordering {
    if *self == *other { Equal }
    else if *self < *other { Less }
    else { Greater }
}

This results in the following IR:

  %2 = icmp eq i64 %0, %1
  %3 = icmp ult i64 %0, %1
  %..i = select i1 %3, i8 -1, i8 1
  %_0.0.i = select i1 %2, i8 0, i8 %..i
  ret i8 %_0.0.i

which, on x86, becomes

	cmpq	%rsi, %rdi
	setae	%al
	cmpq	%rsi, %rdi
	je	.LBB0_1
	addb	%al, %al
	addb	$-1, %al
        ; exit (retq)
    .LBB0_1:
	xorl	%eax, %eax
        ; exit (retq)

where the critical path looks like this (courtesy of llvm-mca -mcpu=broadwell):

              Instruction                                 Dependency Information
 +----< 0.    cmpq	%rsi, %rdi
 +----> 1.    setae	%al                               ## REGISTER dependency:  %flags
 |      2.    cmpq	%rsi, %rdi
 |      3.    je	.LBB0_1
 +----> 4.    addb	%al, %al                          ## REGISTER dependency:  %al
 +----> 5.    addb	$-1, %al                          ## REGISTER dependency:  %al
 |      6.    ; exit
 |      7.    xorl	%eax, %eax
 |      8.    ; exit
 |
 |    < loop carried > 
 |
 +----> 2.    cmpq	%rsi, %rdi                        ## RESOURCE interference:  BWPort1 [ probability: 25% ]

if the implementation instead was:

    if a < b {
        std::cmp::Ordering::Less
    } else if a > b {
        std::cmp::Ordering::Greater
    } else {
        std::cmp::Ordering::Equal
    }

the IR would become

  %0 = icmp ult i64 %a, %b
  %1 = icmp ugt i64 %a, %b
  %. = zext i1 %1 to i8
  %_0.0 = select i1 %0, i8 -1, i8 %.
  ret i8 %_0.0

which in turn would compile down to

	movb	$-1, %al
	cmpq	%rsi, %rdi
	seta	%cl
	jb	.LBB1_2
	movl	%ecx, %eax
    .LBB1_2:
        ; exit (retq)

for which the critical path (as expected) only becomes visible on a 2nd iteration:

              Instruction                                 Dependency Information
 +----< 2.    seta	%cl
 |
 |    < loop carried > 
 |
 |      0.    movb	$-1, %al
 +----> 1.    cmpq	%rsi, %rdi                        ## RESOURCE interference:  BWPort6 [ probability: 97% ]
 +----> 2.    seta	%cl                               ## REGISTER dependency:  %flags
 |      3.    jb	.LBB1_2
 +----> 4.    movl	%ecx, %eax                        ## REGISTER dependency:  %cl

llvm-mca reports that the reciprocal throughput for the improved version is 1.5 or lower (getting as low as 1.3 on znver1; lower is better for reciprocal throughput) for various x86 architectures whereas the old code always exceeds 2.0, reaching 3.0 on core2.

The text was updated successfully, but these errors were encountered:

nagisa added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 20, 2019

jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Aug 20, 2019

tesuji mentioned this issue Aug 21, 2019

Use more optimal Ord implementation for integers #63767

Merged

bors closed this as completed in 30fd79c Aug 22, 2019

scottmcm mentioned this issue Dec 5, 2023

Add opt patterns for 3-way comparison (Ord::cmp or <=>) bytecodealliance/wasmtime#7636

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Ord for integers is suboptimal #63758

Implementation of Ord for integers is suboptimal #63758

nagisa commented Aug 20, 2019 •

edited

Loading

Implementation of Ord for integers is suboptimal #63758

Implementation of Ord for integers is suboptimal #63758

Comments

nagisa commented Aug 20, 2019 • edited Loading

nagisa commented Aug 20, 2019 •

edited

Loading