Skip to content

Latest commit

 

History

History
296 lines (285 loc) · 12.8 KB

ARM-Mali_PC.md

File metadata and controls

296 lines (285 loc) · 12.8 KB

References

Notes

ALU - arithmetic unit in the execution engine, contains pipes: FMA, CVT, SFU, MSG.
Binning phase - includes vertex position shading, culling, and primitive binning.
Back-end - queue which placed after Execution core, only fragment queue has backend.
CSF - Command Stream Front-end.
CS0 - ?
CEU - ?
DVFS - Dynamic Voltage and Frequency Scaling.
Diverged instructions - in 'false' branch.
E/L ZS - early/late depth stencil test.
EC - Execution Core.
FPK - Forward Pixel Kill.
Front-end - queue which placed before Execution core, can be NonFragFrontend / FragFrontend.
Fragment Task - region of 32x32 (64x64 for Valhal 5gen) pixels.
IRQ - Interrupt Queue ?
Iterator - ?
Job - GPU command, executed on Job Manager, it tracks inter-job dependencies, distributes jobs across shader cores, splits jobs into per-core tasks.
Load/store unit (LS, LSU) - used for general-purpose memory accesses, and includes vertex attribute access, buffer access, work group shared memory access, and stack access, also implements imageLoad/Store and atomic access functionality. (The load/store pipeline handles all non-texture memory access, including buffer access, image access, and atomic operations.)
MCU - Microcontroller Unit / Multi-level Memory Cache Unit ?
MMU - Memory management unit.
Main phase - includes any deferred vertex processing and all fragment shading.
PE - Processing Engine.
Quad - 2x2 pixels
RTU - Ray Tracing Unit.
Task - part of job, executed on core.
Tiler - responsible for coordinating geometry processing and providing the fixed-function tiling needed for the tile-based rendering pipeline. It can run in parallel to vertex shading and fragment shading.
TLB - Translation Look-aside Buffer ?
Varying unit (V) - The varying pipeline is a dedicated pipeline which implements the varying interpolator.
Vertex position shading - part of vertex shader which outputs only vertex position.
Warp - group of 4-16 threads.

Execution core / Processing unit / ALU:

  • FMA pipe - Arithmetic fused multiply accumulate unit (FMA). The FMA pipelines are the main arithmetic pipelines, implementing the floating-point multipliers that are widely used in shader code. Each FMA pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle. Most programs that are arithmetic-limited are limited by the performance of the FMA pipeline.
  • CVT pipe - Arithmetic convert unit (CVT). The CVT pipelines implement simple operations, such as format conversion and integer addition. Each CVT pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.
  • SFU pipe - Arithmetic special functions unit (SFU). The SFU pipelines implement a special functions unit for computation of complex functions such as reciprocals and transcendental functions. Each SFU pipeline implements a 4-wide issue path, executing a 16-wide warp over 4 clock cycles.
  • MSG pipe - memory access ?

Mali G57

Average cycles per pixel (76)
Pixels (75)
Tile unit write bytes (233)
Load/store unit bytes written to L2 per access cycle (232)
Load/store unit write bytes (231)
Load/store unit write beats to L2 memory system (230)
Texture unit bytes read from external memory per texture cycle (229)
Texture unit read bytes from external memory (228)
Texture unit bytes read from L2 per texture cycle (227)
Texture unit read bytes from L2 cache (226)
Load/store unit bytes read from external memory per access cycle (225)
Load/store unit read bytes from external memory (224)
Load/store unit bytes read from L2 per access cycle (223)
Load/store unit read bytes from L2 cache (222)
Front-end unit read bytes from external memory (221)
Front-end unit read bytes from L2 cache (220)
Varying unit utilization (219)
Varying unit issue cycles (218)
16-bit interpolation active cycles (217)
32-bit interpolation active cycles (216)
Load/store unit utilization (74)
Load/store unit issue cycles (30)
Load/store unit write issues (97)
Load/store unit read issues (96)
Texture full speed filtering percentage (271)
Texture output bus utilization (270)
Texture input bus utilization (269)
Texture unit issue cycles (73)
Texture quads (128)
Texture unit utilization (72)
Texture filtering cycles per instruction (71)
Texture samples (70)
Arithmetic unit utilization (69)
Shader blend percentage (268)
Warp divergence percentage (211)
Arithmetic instruction issue cycles (267)
SFU pipe utilization (266)
CVT pipe utilization (265)
FMA pipe utilization (264)
Full quad warp rate (237)
All registers warp rate (236)
Fragment threads (17)
Non-fragment threads (27)
Execution core utilization (68)
Unchanged tile kill rate (67)
Fragments per pixel (66)
Partial coverage rate (263)
Fragment FPK buffer utilization (101)
Tiler utilization (51)
Output external outstanding writes 75-100% (196)
Output external outstanding reads 75-100% (195)
Output external read latency 384+ cycles (194)
Output external write stall rate (50)
Output external read stall rate (49)
Output external write bytes (48)
Output external read bytes (47)
L2 cache write miss rate (193)
L2 cache read miss rate (192)
Non-fragment queue utilization (46)
Fragment queue utilization (45)
Interrupt pending utilization (44)
Tiler varying shading stall cycles (115)
Tiler varying shading requests (114)
Varying cache misses (113)
Varying cache hits (112)
Position cache miss requests (100)
Position cache hit requests (99)
Tiler position FIFO full cycles (111)
Tiler position shading stall cycles (110)
Tiler position shading requests (109)
Output internal read beats (107)
Sample test culled primitives (106)
Z plane culled primitives (42)
Facing or XY plane test culled primitives (41)
Visible primitives (40)
Visible back-facing primitives (39)
Point primitives (36)
Triangle primitives (35)
Tiler active cycles (43)
Load/store unit write-back write beats (151)
Tile unit write beats to L2 memory system (152)
Load/store unit other write beats (153)
Miscellaneous read beats from L2 cache (150)
Texture unit read beats from external memory (149)
Texture unit read beats from L2 cache (148)
Load/store unit read beats from external memory (147)
Load/store unit read beats from L2 cache (146)
Fragment front-end read beats from external memory (145)
Fragment front-end read beats from L2 cache (144)
Attribute instructions (143)
16-bit interpolation slots (142)
32-bit interpolation slots (141)
Varying unit instructions (140)
Load/store unit atomic issues (98)
Load/store unit partial write issues (139)
Load/store unit full write issues (138)
Load/store unit partial read issues (137)
Load/store unit full read issues (136)
Late ZS killed thread percentage (65)
Texture message write beats (262)
Late ZS tested thread percentage (64)
Texture messages (261)
Early ZS killed quad percentage (63)
Texture filtering cycles using full trilinear (260)
Early ZS tested quad percentage (62)
Texture filtering cycles using full bilinear (259)
Texture filtering cycles (32)
Texture filtering stall cycles (258)
Average cycles per fragment thread (60)
Texture fetch stall cycles (257)
Fragment utilization (59)
Texture descriptor stall cycles (256)
Average cycles per non-fragment thread (58)
Texture message read beats (255)
Non-fragment utilization (57)
Blend shader instructions (254)
Execution engine starvation cycles (127)
Z plane test cull rate (56)
Instruction cache misses (253)
Diverged instructions (126)
Facing or XY plane test cull rate (55)
Arithmetic SFU instructions (252)
Culled primitives (54)
Arithmetic CVT instructions (251)
Visible primitives rate (53)
Arithmetic FMA instructions (250)
Execution core active cycles (28)
Non-fragment warps (124)
Non-fragment core tasks (123)
Non-fragment active cycles (26)
Visible front-facing primitives (38)
Full quad warps (235)
Occluding quads (122)
Killed unchanged tiles (25)
Line primitives (37)
Warps using more than 32 registers (234)
Tiles (24)
Late ZS killed quads (121)
Late ZS tested quads (120)
Early ZS killed quads (21)
Early ZS updated quads (119)
Total input primitives (52)
Partial rasterized fine quads (249)
Early ZS tested quads (20)
Fragment warps (117)
Forward pixel kill buffer active cycles (95)
Rasterized fine quads (19)
Rasterized primitives (116)
Fragment primitives loaded (16)
Fragment active cycles (15)
Output external write stall cycles (14)
Early ZS updated quad percentage (208)
Output external write beats (11)
FPK killed quad percentage (210)
Output external read stall cycles (13)
FPK killed quads (209)
Output external read beats (12)
Output external ReadUnique transactions (173)
Output external ReadNoSnoop transactions (172)
Output external read transactions (171)
Input external snoop lookup requests (170)
Input external snoop stall cycles (191)
Write lookup requests (94)
Input external snoop transactions (190)
Read lookup requests (93)
Output external outstanding writes 50-75% (189)
Any lookup requests (92)
Output internal write requests (169)
Output internal read stall cycles (168)
Output internal read requests (167)
Input internal snoop stall cycles (166)
Input internal snoop requests (165)
Input internal write stall cycles (164)
Input internal write requests (163)
Input internal read stall cycles (162)
Input internal read requests (161)
MMU stage 2 L2 lookup TLB hits (160)
MMU stage 2 L3 lookup TLB hits (159)
MMU stage 2 L2 lookup requests (158)
MMU stage 2 L3 lookup requests (157)
MMU stage 2 lookup requests (156)
Output external WriteSnoopPartial transactions (186)
MMU L2 lookup TLB hits (89)
MMU L3 lookup TLB hits (155)
Output external outstanding writes 0-25% (187)
MMU L2 table read requests (90)
MMU L3 table read requests (154)
Output external outstanding writes 25-50% (188)
MMU lookup requests (91)
Non-occluding quads (205)
Reserved queue jobs (8)
L2 cache flush requests (105)
Output external WriteSnoopFull transactions (185)
Reserved queue job finish wait cycles (88)
Output external WriteNoSnoopPartial transactions (184)
Reserved queue job dependency wait cycles (87)
Output external WriteNoSnoopFull transactions (183)
Reserved queue job issue wait cycles (86)
Output external write transactions (182)
Reserved queue job descriptor read wait cycles (85)
Output external read latency 320-383 cycles (181)
Non-fragment queue job finish wait cycles (84)
Output external read latency 256-319 cycles (180)
Non-fragment queue job dependency wait cycles (83)
Output external read latency 192-255 cycles (179)
Non-fragment queue job issue wait cycles (82)
Reserved active cycles (10)
Occluding quad percentage (204)
Non-fragment queue active cycles (7)
Reserved queue cache flush wait cycles (104)
Output external read latency 128-191 cycles (178)
Non-fragment queue job descriptor read wait cycles (81)
Non-fragment queue cache flush wait cycles (103)
Varying cache hit rate (203)
Non-fragment tasks (6)
Fragment queue cache flush wait cycles (102)
Varying threads per input primitive (202)
Non-fragment jobs (5)
Varying shader thread invocations (201)
Fragment queue active cycles (4)
Shaded coarse quads (206)
Reserved queue tasks (9)
Output external read latency 0-127 cycles (177)
Fragment queue job finish wait cycles (80)
Output external outstanding reads 50-75% (176)
Fragment queue job dependency wait cycles (79)
Position cache hit rate (200)
Fragment tasks (3)
Position threads per input primitive (199)
Fragment jobs (2)
Output external outstanding reads 25-50% (175)
Fragment queue job issue wait cycles (78)
Position shader thread invocations (198)
GPU interrupt pending cycles (1)
Output external outstanding reads 0-25% (174)
Fragment queue job descriptor read wait cycles (77)
Sample test cull rate (197)
GPU active cycles (0)