Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace jump branch with skip branch optimization #17

Closed
nt314p opened this issue Sep 19, 2024 · 7 comments
Closed

Replace jump branch with skip branch optimization #17

nt314p opened this issue Sep 19, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@nt314p
Copy link

nt314p commented Sep 19, 2024

One other optimization you can do is eliminating some expensive jumps when setting the data register. This does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization (a skip branch instruction), which is actually faster.

uint8_t d = *localDataOutRegister;
d |= outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & 0x01) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d;
@RobTillaart RobTillaart self-assigned this Sep 19, 2024
@RobTillaart RobTillaart added the enhancement New feature or request label Sep 19, 2024
@RobTillaart
Copy link
Owner

reference = 0.4.0 version no loop unroll

Performance - time in us
        write: 15.34
        write: 29.43
        Delta: 14.10

Proposed optimization - only added in the not unrolled version for quick test.

Performance - time in us
        write: 15.08
        write: 28.92
        Delta: 13.84

shorter code is equally fast.

    //  process one bit
    uint8_t d = *localDataOutRegister | outmask1; // always set bit
    asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
    if ((value & m) == 0) d &= outmask2; // only reset if needed
    *localDataOutRegister = d;

@RobTillaart
Copy link
Owner

@nt314p

(still not unrolled loop as that is easiest to patch)

Assuming that dataOutRegister does not change

  • clock bit is set back to what it was,
  • there are no interrupts.
  uint8_t d0 = *localDataOutRegister & outmask2;  //  cache 0
  uint8_t d1 = d0 | outmask1;                     //  cache 1
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    //  process one bit
    uint8_t d = d1;               //  always set bit
    asm volatile("" : "+r" (d));  //  prevent compiler optimization. generates no assembly
    if ((value & m) == 0) d = d0; //  only reset if needed
    *localDataOutRegister = d;
    // if ((value & m) == 0) *localDataOutRegister &= outmask2;
    // else                  *localDataOutRegister |= outmask1;
    uint8_t r = *localClockRegister;
    *localClockRegister = r | cbmask1;  //  set one bit
    *localClockRegister = r;            //  reset it
  }

==>

Performance - time in us
        write: 14.33
        write: 27.42
        Delta: 13.09

looks like an improvement, can you shoot a hole in it or is it stable?

@RobTillaart
Copy link
Owner

RobTillaart commented Sep 19, 2024

@nt314p

A step back, no asm required.

  uint8_t d0 = *localDataOutRegister & outmask2;  //  cache 0
  uint8_t d1 = d0 | outmask1;                     //  cache 1
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    //  process one bit
    if ((value & m) == 0) *localDataOutRegister = d0;
    else *localDataOutRegister = d1;
    // if ((value & m) == 0) *localDataOutRegister &= outmask2;
    // else                  *localDataOutRegister |= outmask1;
    uint8_t r = *localClockRegister;
    *localClockRegister = r | cbmask1;  //  set one bit
    *localClockRegister = r;            //  reset it
  }

==>

Performance - time in us
        write: 14.08
        write: 26.91
        Delta: 12.83

From 14.10 => 12.83 means the unrolled loop could go from 11.51 => 10.2 something.

Opinion?

@nt314p
Copy link
Author

nt314p commented Sep 19, 2024

Oh yep shortening the code like that works too. I can rework my assembly to take into account your proposed optimization, as it is currently clashing with it I believe.

One thing would be to initialize the clock to 0. This is so that if the data and clock lines are on the same register, when you cache the data you also have clock low.

@RobTillaart
Copy link
Owner

The clock is set low in the constructor.
It is changed to high and back to low after data is set.
So the start condition of the loop for relevant pins is restored correctly.(Unless I missed something)

If time permits I will check with scope again.

@RobTillaart
Copy link
Owner

FYI
Had a look on the scope this morning and the signals looked similar to those in #15.

RobTillaart added a commit that referenced this issue Oct 31, 2024
RobTillaart added a commit that referenced this issue Oct 31, 2024
@RobTillaart
Copy link
Owner

@nt314p

Hi Nick,

Found some time to create a develop branch + PR for the last optimization.
If you have time, please have a look at the develop branch.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants