You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a case where uiCA predictons for SKL seem to be pretty far off. Pretty much all tools I know of get this one wrong, despite it only using reg-reg operations.
uiCA predicts 4c/iteration throughput, actual observed throughput on a Skylake laptop (i7-6560U) is 6c/iteration. If you take out one instruction on the non-PSADBW critical path (say comment out the paddd xmm2, xmm3), this does run at 4c/iteration on real HW, and uiCA agrees.
The actual computation here is nonsense, I was just trying to come up with a small repro.
The case this is setting up into is two vector instructions with different latencies on the same port (p5 in this case) that would have to finish in the same cycle. They can't - the vector RF and bypass network can accept one result per port per cycle, no more, as far as I know. I do not know what the exact criteria are, nor why the penalty here is two cycles and not one. I do not know how often this occurs in practice but I do know that I have hit cases in the past where this seems to be a factor.
The text was updated successfully, but these errors were encountered:
This is a case where uiCA predictons for SKL seem to be pretty far off. Pretty much all tools I know of get this one wrong, despite it only using reg-reg operations.
Test case: https://bit.ly/3jlvOOJ
uiCA predicts 4c/iteration throughput, actual observed throughput on a Skylake laptop (i7-6560U) is 6c/iteration. If you take out one instruction on the non-PSADBW critical path (say comment out the
paddd xmm2, xmm3
), this does run at 4c/iteration on real HW, and uiCA agrees.The actual computation here is nonsense, I was just trying to come up with a small repro.
The case this is setting up into is two vector instructions with different latencies on the same port (p5 in this case) that would have to finish in the same cycle. They can't - the vector RF and bypass network can accept one result per port per cycle, no more, as far as I know. I do not know what the exact criteria are, nor why the penalty here is two cycles and not one. I do not know how often this occurs in practice but I do know that I have hit cases in the past where this seems to be a factor.
The text was updated successfully, but these errors were encountered: