Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IR Interpreter optimization ideas #19143

Open
hrydgard opened this issue May 14, 2024 · 0 comments
Open

IR Interpreter optimization ideas #19143

hrydgard opened this issue May 14, 2024 · 0 comments
Labels
IRInterpreter Occurs with IR Interpreter but not with another CPU backend.
Milestone

Comments

@hrydgard
Copy link
Owner

hrydgard commented May 14, 2024

The IR Interpreter (code link) is more important now that iOS has joined the primary supported platforms.

Mainly, this will be about reducing the amount of instructions interpreted (to reduce interpreter overhead) and to make simpler versions of instructions that are faster to run.

Ideas

We should also not allocate blocks individually, instead use a large bump allocator. This will slightly help cache coherency, and we can use offsets into this buffer in the pseudo-instructions to avoid a lookup per block.

  • Make Downcount metadata on the block, when there's only one Downcount instruction and it's before any branches or branch targets (the latter conditions may be needed for some future optimizations). Or just move them to the top of the block, assuming that they're there.
  • Multi-load/multi-store instructions for consecutive registers (to replace long sequences of say Load32 s0, sp, 0x14, Load32 s1, sp, 0x18, Load32 s0, sp, 0x1C, etc). Additionally, consecutive sequences of storing zero is common. This will need a sorting pass first. For floats, it's common to see Store32 f20, sp, 4; Store32 f22, sp, 8;. That register stride is probably an artifact of a MIPS compiler for a CPU that had double precision support (f20:f21 would be one register then).
  • Vec4Scale + Vec4Add can often be merged into a new Vec4ScaleAdd without adding more operands
  • Super-specialized instructions, for example:
    • AddConst sp, sp, 0x30 is very common, it might be very slightly beneficial to add Increment sp, 0x30
    • sw zero, sp, 0x10 is quite common, can save a register file access
    • Vec4Shuffle: specialize the most common shuffle patterns
    • Vec4Blend: specialize the most common blend patterns
    • Matrix multiplication should be done in one IR instruction (even if broken apart for the other backends to share logic)
    • FCmp + FPCondToReg in-one
  • Write the IRInterpreter in assembler (hopefully not needed, but would love to get rid of the range check that the switch emits, at least on x86)
  • Avoid breaking apart some instructions that require a lot of tiny instructions as output, like VDot. We do want to break these apart for JitIR compilation, but not for interpretation, so we might need to compile into IR differently depending on context.
  • Syscall could merge with RestoreRoundingMode/ApplyRoundingMode
  • lv.s should not become a complicated Shuffle/Blend thing

There's a bit of tension between keeping the IR easy to run passes on, and making it fast to interpret. Additionally, if we make too many instructions they'll put higher load on the instruction cache...

Maybe we should translate further to an even more specialized interpreter IR, or, only ever use the super-specialized instructions in the very final optimization pass so only the interpreter needs to care about them. In that case they should be marked clearly.

Additionally, things like inlining of small blocks (that would also apply to the JIT) may help.

Pass optimization:

  • Imm tracking for floats could enable replacing long streaks of SetFloatConst f12, 0; StoreFloat f14, a0, offset; StoreFloat f14. a0, offset+4; etc

Missed peephole optimizations:

From GTA, This simply converts fixedpoint 16-bit 4-vectors to float, but generates a lot of bloat:
vs2i.p C100, C200
vi2f.q C100, C100, 23

AddConst a2, sp, 70
AddConst a2, a2, 8
> Should be AddConst a2, sp, 78

FMovFromGPR f12, a1
FCvtSW f12, f12
> should be combined to a move-and-convert instruction. common in syphon filter

sll t0, t0, 0x18
sra t0, t0, 0x18
> this is just a sign extension byte->word

Common when writing floats to display list:

FMovToGPR a2, f12
ShrImm a2, 08
Or a2, a2, t3
Store32 a2, a1, 0000000

This one takes four inputs though.

In the Wipeout games, for some reason (can just omit the load):
sv.q C400, 0x90(sp)
lv.q C400, 0x90(sp)

Test cases

Curiously heavy on the CPU:

  • Crash of the Titans - seems to be both spending a lot of time in an horrendously complex idle loop, and underutilizing the VFPU. Lots of inefficient FPU code, spending lots of time storing and loading regs to/from the stack.
@hrydgard hrydgard added this to the v1.18.0 milestone May 14, 2024
@hrydgard hrydgard added the IRInterpreter Occurs with IR Interpreter but not with another CPU backend. label May 14, 2024
@hrydgard hrydgard modified the milestones: v1.18.0, v1.19.0 Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IRInterpreter Occurs with IR Interpreter but not with another CPU backend.
Projects
None yet
Development

No branches or pull requests

1 participant