Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp -tritonintelgpu-optimize-reduction-locality #2752

Open
victor-eds opened this issue Nov 19, 2024 · 0 comments · May be fixed by #2800
Open

Revamp -tritonintelgpu-optimize-reduction-locality #2752

victor-eds opened this issue Nov 19, 2024 · 0 comments · May be fixed by #2800
Assignees
Labels

Comments

@victor-eds
Copy link
Contributor

-tritonintelgpu-optimize-reduction-locality is incorrect as register reordering may lead to incorrect results. Also, it can be greatly improved so optimal layouts are propagated instead of unoptimal sliced ones.

A DPAS layout that "covers" the tensor in dimension 0 can be represented as a 7D layout:

#triton_gpu.blocked<{
    sizePerThread = [1, repeat_count, rep_cluster[1], rep_cluster[0], 1, 
                     shape[1]/(execution_size*rep_cluster[1]*warps_per_cta[1]), 1], 
    threadsPerWarp = [16, 1, 1, 1, 1, 1, 1], 
    warpsPerCTA = [1, 1, 1, 1, warps_per_cta[1], 1, warps_per_cta[0]], 
    order = [0, 1, 2, 3, 4, 5, 6]}>

In which dimensions 0, 2, 4 and 5 correspond to dimension 1 in the original layout.

A reduction on the original DPAS layout axis 1 (fast changing axis) can be represented as follows in this new layout:

  • Reduction on axis 2 and 5: elementwise
  • Sub-group transpose so the original axis 0 has 16 threads per warp and the original axis 1 has 16 elements per thread
  • Reduction on the remaining dimensions representing the original axis 1
  • Go back to original type

This last step is crucial to get right. It should be split exactly as:

  • reshape to original shape
  • convert_layout to original layout

As the original layout is suboptimal and reshape operations propagate layouts, swapping the reshape and layout conversions would lead to propagating the suboptimal layout. This is what we are getting wrong (in addition to semantics change due to register ordering) in the current pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants