-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace jump branch with skip branch optimization #17
Comments
reference = 0.4.0 version no loop unroll
Proposed optimization - only added in the not unrolled version for quick test.
shorter code is equally fast. // process one bit
uint8_t d = *localDataOutRegister | outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & m) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d; |
(still not unrolled loop as that is easiest to patch) Assuming that dataOutRegister does not change
uint8_t d0 = *localDataOutRegister & outmask2; // cache 0
uint8_t d1 = d0 | outmask1; // cache 1
for (uint8_t m = 1; m > 0; m <<= 1)
{
// process one bit
uint8_t d = d1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & m) == 0) d = d0; // only reset if needed
*localDataOutRegister = d;
// if ((value & m) == 0) *localDataOutRegister &= outmask2;
// else *localDataOutRegister |= outmask1;
uint8_t r = *localClockRegister;
*localClockRegister = r | cbmask1; // set one bit
*localClockRegister = r; // reset it
} ==>
looks like an improvement, can you shoot a hole in it or is it stable? |
A step back, no asm required. uint8_t d0 = *localDataOutRegister & outmask2; // cache 0
uint8_t d1 = d0 | outmask1; // cache 1
for (uint8_t m = 1; m > 0; m <<= 1)
{
// process one bit
if ((value & m) == 0) *localDataOutRegister = d0;
else *localDataOutRegister = d1;
// if ((value & m) == 0) *localDataOutRegister &= outmask2;
// else *localDataOutRegister |= outmask1;
uint8_t r = *localClockRegister;
*localClockRegister = r | cbmask1; // set one bit
*localClockRegister = r; // reset it
} ==>
From 14.10 => 12.83 means the unrolled loop could go from 11.51 => 10.2 something. Opinion? |
Oh yep shortening the code like that works too. I can rework my assembly to take into account your proposed optimization, as it is currently clashing with it I believe. One thing would be to initialize the clock to 0. This is so that if the data and clock lines are on the same register, when you cache the data you also have clock low. |
The clock is set low in the constructor. If time permits I will check with scope again. |
FYI |
Hi Nick, Found some time to create a develop branch + PR for the last optimization. Thanks! |
One other optimization you can do is eliminating some expensive jumps when setting the data register. This does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization (a skip branch instruction), which is actually faster.
The text was updated successfully, but these errors were encountered: