Skip to content

Commit

Permalink
[AMDGPU] Simplify the exclusive scan used for optimized atomics
Browse files Browse the repository at this point in the history
Summary:
Change the scan algorithm to use only power-of-two shifts (1, 2, 4, 8,
16, 32) instead of starting off shifting by 1, 2 and 3 and then doing
a 3-way ADD, because:

1. It simplifies the compiler a little.
2. It minimizes vgpr pressure because each instruction is now of the
   form vn = vn + vn << c.
3. It is more friendly to the DPP combiner, which currently can't
   combine into an ADD3 instruction.

Because of #2 and ROCm#3 the end result is improved from this:

  v_add_u32_dpp v4, v3, v3  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
  v_mov_b32_dpp v5, v3  row_shr:2 row_mask:0xf bank_mask:0xf
  v_mov_b32_dpp v1, v3  row_shr:3 row_mask:0xf bank_mask:0xf
  v_add3_u32 v1, v4, v5, v1
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:4 row_mask:0xf bank_mask:0xe
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:8 row_mask:0xf bank_mask:0xc
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:15 row_mask:0xa bank_mask:0xf
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:31 row_mask:0xc bank_mask:0xf

To this:

  v_add_u32_dpp v1, v1, v1  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:4 row_mask:0xf bank_mask:0xe
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_shr:8 row_mask:0xf bank_mask:0xc
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:15 row_mask:0xa bank_mask:0xf
  s_nop 1
  v_add_u32_dpp v1, v1, v1  row_bcast:31 row_mask:0xc bank_mask:0xf

I.e. two fewer computational instructions, one extra nop where we could
schedule something else.

Reviewers: arsenm, sheredom, critson, rampitec, vpykhtin

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64411

Change-Id: I79f792d30210974acbd67ae0c5eaff3094263281
  • Loading branch information
foad authored and jayfoad committed Aug 7, 2019
1 parent c518b71 commit 00adc81
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 12 deletions.
18 changes: 8 additions & 10 deletions lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -383,26 +383,24 @@ void AMDGPUAtomicOptimizer::optimizeAtomic(Instruction &I,
CallInst *const SetInactive =
B.CreateIntrinsic(Intrinsic::amdgcn_set_inactive, Ty, {V, Identity});

CallInst *const FirstDPP =
ExclScan =
B.CreateIntrinsic(Intrinsic::amdgcn_update_dpp, Ty,
{Identity, SetInactive, B.getInt32(DPP_WF_SR1),
B.getInt32(0xf), B.getInt32(0xf), B.getFalse()});
ExclScan = FirstDPP;

const unsigned Iters = 7;
const unsigned DPPCtrl[Iters] = {
DPP_ROW_SR1, DPP_ROW_SR2, DPP_ROW_SR3, DPP_ROW_SR4,
DPP_ROW_SR8, DPP_ROW_BCAST15, DPP_ROW_BCAST31};
const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xf, 0xa, 0xc};
const unsigned BankMask[Iters] = {0xf, 0xf, 0xf, 0xe, 0xc, 0xf, 0xf};
const unsigned Iters = 6;
const unsigned DPPCtrl[Iters] = {DPP_ROW_SR1, DPP_ROW_SR2,
DPP_ROW_SR4, DPP_ROW_SR8,
DPP_ROW_BCAST15, DPP_ROW_BCAST31};
const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xa, 0xc};
const unsigned BankMask[Iters] = {0xf, 0xf, 0xe, 0xc, 0xf, 0xf};

// This loop performs an exclusive scan across the wavefront, with all lanes
// active (by using the WWM intrinsic).
for (unsigned Idx = 0; Idx < Iters; Idx++) {
Value *const UpdateValue = Idx < 3 ? FirstDPP : ExclScan;
CallInst *const DPP = B.CreateIntrinsic(
Intrinsic::amdgcn_update_dpp, Ty,
{Identity, UpdateValue, B.getInt32(DPPCtrl[Idx]),
{Identity, ExclScan, B.getInt32(DPPCtrl[Idx]),
B.getInt32(RowMask[Idx]), B.getInt32(BankMask[Idx]), B.getFalse()});

ExclScan = buildNonAtomicBinOp(B, Op, ExclScan, DPP);
Expand Down
2 changes: 0 additions & 2 deletions test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ entry:
; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf
Expand Down Expand Up @@ -115,7 +114,6 @@ entry:
; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf
Expand Down

0 comments on commit 00adc81

Please sign in to comment.