You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The IR Interpreter (code link) is more important now that iOS has joined the primary supported platforms.
Mainly, this will be about reducing the amount of instructions interpreted (to reduce interpreter overhead) and to make simpler versions of instructions that are faster to run.
Ideas
We should also not allocate blocks individually, instead use a large bump allocator. This will slightly help cache coherency, and we can use offsets into this buffer in the pseudo-instructions to avoid a lookup per block.
Make Downcount metadata on the block, when there's only one Downcount instruction and it's before any branches or branch targets (the latter conditions may be needed for some future optimizations). Or just move them to the top of the block, assuming that they're there.
Multi-load/multi-store instructions for consecutive registers (to replace long sequences of say Load32 s0, sp, 0x14, Load32 s1, sp, 0x18, Load32 s0, sp, 0x1C, etc). Additionally, consecutive sequences of storing zero is common. This will need a sorting pass first. For floats, it's common to see Store32 f20, sp, 4; Store32 f22, sp, 8;. That register stride is probably an artifact of a MIPS compiler for a CPU that had double precision support (f20:f21 would be one register then).
Vec4Scale + Vec4Add can often be merged into a new Vec4ScaleAdd without adding more operands
Super-specialized instructions, for example:
AddConst sp, sp, 0x30 is very common, it might be very slightly beneficial to add Increment sp, 0x30
sw zero, sp, 0x10 is quite common, can save a register file access
Vec4Shuffle: specialize the most common shuffle patterns
Vec4Blend: specialize the most common blend patterns
Matrix multiplication should be done in one IR instruction (even if broken apart for the other backends to share logic)
FCmp + FPCondToReg in-one
Write the IRInterpreter in assembler (hopefully not needed, but would love to get rid of the range check that the switch emits, at least on x86)
Avoid breaking apart some instructions that require a lot of tiny instructions as output, like VDot. We do want to break these apart for JitIR compilation, but not for interpretation, so we might need to compile into IR differently depending on context.
Syscall could merge with RestoreRoundingMode/ApplyRoundingMode
lv.s should not become a complicated Shuffle/Blend thing
There's a bit of tension between keeping the IR easy to run passes on, and making it fast to interpret. Additionally, if we make too many instructions they'll put higher load on the instruction cache...
Maybe we should translate further to an even more specialized interpreter IR, or, only ever use the super-specialized instructions in the very final optimization pass so only the interpreter needs to care about them. In that case they should be marked clearly.
Additionally, things like inlining of small blocks (that would also apply to the JIT) may help.
Pass optimization:
Imm tracking for floats could enable replacing long streaks of SetFloatConst f12, 0; StoreFloat f14, a0, offset; StoreFloat f14. a0, offset+4; etc
Missed peephole optimizations:
From GTA, This simply converts fixedpoint 16-bit 4-vectors to float, but generates a lot of bloat:
vs2i.p C100, C200
vi2f.q C100, C100, 23
AddConst a2, sp, 70
AddConst a2, a2, 8
> Should be AddConst a2, sp, 78
FMovFromGPR f12, a1
FCvtSW f12, f12
> should be combined to a move-and-convert instruction. common in syphon filter
sll t0, t0, 0x18
sra t0, t0, 0x18
> this is just a sign extension byte->word
Common when writing floats to display list:
FMovToGPR a2, f12
ShrImm a2, 08
Or a2, a2, t3
Store32 a2, a1, 0000000
This one takes four inputs though.
In the Wipeout games, for some reason (can just omit the load):
sv.q C400, 0x90(sp)
lv.q C400, 0x90(sp)
Test cases
Curiously heavy on the CPU:
Crash of the Titans - seems to be both spending a lot of time in an horrendously complex idle loop, and underutilizing the VFPU. Lots of inefficient FPU code, spending lots of time storing and loading regs to/from the stack.
The text was updated successfully, but these errors were encountered:
The IR Interpreter (code link) is more important now that iOS has joined the primary supported platforms.
Mainly, this will be about reducing the amount of instructions interpreted (to reduce interpreter overhead) and to make simpler versions of instructions that are faster to run.
Ideas
We should also not allocate blocks individually, instead use a large bump allocator. This will slightly help cache coherency, and we can use offsets into this buffer in the pseudo-instructions to avoid a lookup per block.
Store32 f20, sp, 4; Store32 f22, sp, 8;
. That register stride is probably an artifact of a MIPS compiler for a CPU that had double precision support (f20:f21 would be one register then).There's a bit of tension between keeping the IR easy to run passes on, and making it fast to interpret. Additionally, if we make too many instructions they'll put higher load on the instruction cache...
Maybe we should translate further to an even more specialized interpreter IR, or, only ever use the super-specialized instructions in the very final optimization pass so only the interpreter needs to care about them. In that case they should be marked clearly.
Additionally, things like inlining of small blocks (that would also apply to the JIT) may help.
Pass optimization:
SetFloatConst f12, 0; StoreFloat f14, a0, offset; StoreFloat f14. a0, offset+4; etc
Missed peephole optimizations:
Test cases
Curiously heavy on the CPU:
The text was updated successfully, but these errors were encountered: