Replace jump branch with skip branch optimization #17

nt314p · 2024-09-19T15:57:38Z

One other optimization you can do is eliminating some expensive jumps when setting the data register. This does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization (a skip branch instruction), which is actually faster.

uint8_t d = *localDataOutRegister;
d |= outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & 0x01) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d;

RobTillaart · 2024-09-19T18:09:01Z

reference = 0.4.0 version no loop unroll

Performance - time in us
        write: 15.34
        write: 29.43
        Delta: 14.10

Proposed optimization - only added in the not unrolled version for quick test.

Performance - time in us
        write: 15.08
        write: 28.92
        Delta: 13.84

shorter code is equally fast.

    //  process one bit
    uint8_t d = *localDataOutRegister | outmask1; // always set bit
    asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
    if ((value & m) == 0) d &= outmask2; // only reset if needed
    *localDataOutRegister = d;

RobTillaart · 2024-09-19T18:35:23Z

@nt314p

(still not unrolled loop as that is easiest to patch)

Assuming that dataOutRegister does not change

clock bit is set back to what it was,
there are no interrupts.

  uint8_t d0 = *localDataOutRegister & outmask2;  //  cache 0
  uint8_t d1 = d0 | outmask1;                     //  cache 1
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    //  process one bit
    uint8_t d = d1;               //  always set bit
    asm volatile("" : "+r" (d));  //  prevent compiler optimization. generates no assembly
    if ((value & m) == 0) d = d0; //  only reset if needed
    *localDataOutRegister = d;
    // if ((value & m) == 0) *localDataOutRegister &= outmask2;
    // else                  *localDataOutRegister |= outmask1;
    uint8_t r = *localClockRegister;
    *localClockRegister = r | cbmask1;  //  set one bit
    *localClockRegister = r;            //  reset it
  }

==>

Performance - time in us
        write: 14.33
        write: 27.42
        Delta: 13.09

looks like an improvement, can you shoot a hole in it or is it stable?

RobTillaart · 2024-09-19T18:38:42Z

@nt314p

A step back, no asm required.

  uint8_t d0 = *localDataOutRegister & outmask2;  //  cache 0
  uint8_t d1 = d0 | outmask1;                     //  cache 1
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    //  process one bit
    if ((value & m) == 0) *localDataOutRegister = d0;
    else *localDataOutRegister = d1;
    // if ((value & m) == 0) *localDataOutRegister &= outmask2;
    // else                  *localDataOutRegister |= outmask1;
    uint8_t r = *localClockRegister;
    *localClockRegister = r | cbmask1;  //  set one bit
    *localClockRegister = r;            //  reset it
  }

==>

Performance - time in us
        write: 14.08
        write: 26.91
        Delta: 12.83

From 14.10 => 12.83 means the unrolled loop could go from 11.51 => 10.2 something.

Opinion?

nt314p · 2024-09-19T20:51:49Z

Oh yep shortening the code like that works too. I can rework my assembly to take into account your proposed optimization, as it is currently clashing with it I believe.

One thing would be to initialize the clock to 0. This is so that if the data and clock lines are on the same register, when you cache the data you also have clock low.

RobTillaart · 2024-09-20T04:56:29Z

The clock is set low in the constructor.
It is changed to high and back to low after data is set.
So the start condition of the loop for relevant pins is restored correctly.(Unless I missed something)

If time permits I will check with scope again.

RobTillaart · 2024-09-20T07:48:17Z

FYI
Had a look on the scope this morning and the signals looked similar to those in #15.

RobTillaart · 2024-10-31T14:00:28Z

@nt314p

Hi Nick,

Found some time to create a develop branch + PR for the last optimization.
If you have time, please have a look at the develop branch.

Thanks!

RobTillaart self-assigned this Sep 19, 2024

RobTillaart added the enhancement New feature or request label Sep 19, 2024

RobTillaart added a commit that referenced this issue Oct 31, 2024

fix #17, more optimizations

17740cd

RobTillaart added a commit that referenced this issue Oct 31, 2024

fix #17, more optimizations

1344127

RobTillaart closed this as completed in 5bc8571 Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace jump branch with skip branch optimization #17

Replace jump branch with skip branch optimization #17

nt314p commented Sep 19, 2024

RobTillaart commented Sep 19, 2024

RobTillaart commented Sep 19, 2024

RobTillaart commented Sep 19, 2024 •

edited

Loading

nt314p commented Sep 19, 2024

RobTillaart commented Sep 20, 2024

RobTillaart commented Sep 20, 2024

RobTillaart commented Oct 31, 2024

Replace jump branch with skip branch optimization #17

Replace jump branch with skip branch optimization #17

Comments

nt314p commented Sep 19, 2024

RobTillaart commented Sep 19, 2024

RobTillaart commented Sep 19, 2024

RobTillaart commented Sep 19, 2024 • edited Loading

nt314p commented Sep 19, 2024

RobTillaart commented Sep 20, 2024

RobTillaart commented Sep 20, 2024

RobTillaart commented Oct 31, 2024

RobTillaart commented Sep 19, 2024 •

edited

Loading