Revamp `-tritonintelgpu-optimize-reduction-locality` #2752

victor-eds · 2024-11-19T10:28:22Z

-tritonintelgpu-optimize-reduction-locality is incorrect as register reordering may lead to incorrect results. Also, it can be greatly improved so optimal layouts are propagated instead of unoptimal sliced ones.

A DPAS layout that "covers" the tensor in dimension 0 can be represented as a 7D layout:

#triton_gpu.blocked<{
    sizePerThread = [1, repeat_count, rep_cluster[1], rep_cluster[0], 1, 
                     shape[1]/(execution_size*rep_cluster[1]*warps_per_cta[1]), 1], 
    threadsPerWarp = [16, 1, 1, 1, 1, 1, 1], 
    warpsPerCTA = [1, 1, 1, 1, warps_per_cta[1], 1, warps_per_cta[0]], 
    order = [0, 1, 2, 3, 4, 5, 6]}>

In which dimensions 0, 2, 4 and 5 correspond to dimension 1 in the original layout.

A reduction on the original DPAS layout axis 1 (fast changing axis) can be represented as follows in this new layout:

Reduction on axis 2 and 5: elementwise
Sub-group transpose so the original axis 0 has 16 threads per warp and the original axis 1 has 16 elements per thread
Reduction on the remaining dimensions representing the original axis 1
Go back to original type

This last step is crucial to get right. It should be split exactly as:

reshape to original shape
convert_layout to original layout

As the original layout is suboptimal and reshape operations propagate layouts, swapping the reshape and layout conversions would lead to propagating the suboptimal layout. This is what we are getting wrong (in addition to semantics change due to register ordering) in the current pass.

The text was updated successfully, but these errors were encountered:

victor-eds added bug Something isn't working performance codegen: attention labels Nov 19, 2024

victor-eds self-assigned this Nov 19, 2024

victor-eds mentioned this issue Nov 19, 2024

Enable -tritonintelgpu-optimize-reduction-locality by default #2748

Open

vlad-penkin added this to the 4.0 [Performance] Core milestone Nov 19, 2024

victor-eds linked a pull request Nov 22, 2024 that will close this issue

[XPU][OptRed] Revamp -tritonintelgpu-optimize-reduction-locality #2800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp `-tritonintelgpu-optimize-reduction-locality` #2752

Revamp `-tritonintelgpu-optimize-reduction-locality` #2752

victor-eds commented Nov 19, 2024

Revamp -tritonintelgpu-optimize-reduction-locality #2752

Revamp -tritonintelgpu-optimize-reduction-locality #2752

Comments

victor-eds commented Nov 19, 2024

Revamp `-tritonintelgpu-optimize-reduction-locality` #2752

Revamp `-tritonintelgpu-optimize-reduction-locality` #2752