[CUDA] dense_tensorcore/batch_matmul_tensorcore support int8/int4 #8402

wyc-ruiker · 2021-07-05T06:38:50Z

Let dense_tensorcore and batch_matmul_tensorcore support int8/int4.
Before this pr, the vision transform (vit) latency (#7814) in Tesla T4 is:
vit int4: 4.71 ms
vit int8: 3.48 ms
After this pr:
vit int4: 2.93 ms
vit int8: 2.97 ms

@jcf94 @jwfromm @huochaitiantang could you help review this pr?

…o vit

jcf94 · 2021-07-05T07:10:54Z

Thanks for your continue contribution on the tensor core schedule! @wyc-ruiker I'll help reivew when I have time.

p.s. Recently I added a new op nn.matmul which extend the nn.dense to support data tensor and weight tensor to be in transposed or non-transposed format. For a model from frameworks like TensorFlow, TVM will insert an extra transpose for nn.dense while use nn.matmul can get rid of that.
I'm not sure but maybe it will be beneficial to use it if your model suffered from some performance issue on the inserted transpose.

wyc-ruiker · 2021-07-05T08:54:31Z

Thanks for your continue contribution on the tensor core schedule! @wyc-ruiker I'll help reivew when I have time.

p.s. Recently I added a new op nn.matmul which extend the nn.dense to support data tensor and weight tensor to be in transposed or non-transposed format. For a model from frameworks like TensorFlow, TVM will insert an extra transpose for nn.dense while use nn.matmul can get rid of that.
I'm not sure but maybe it will be beneficial to use it if your model suffered from some performance issue on the inserted transpose.

  %1552 = reshape(%1551, newshape=[-1, 64, 50]) /* ty=Tensor[(12, 64, 50), float32] */;
  %1553 = transpose(%1552, axes=[0, 2, 1]) /* ty=Tensor[(12, 50, 64), float32] */;
  %1554 = multiply(%1553, 16f /* ty=float32 */) /* ty=Tensor[(12, 50, 64), float32] */;
  %1555 = round(%1554) /* ty=Tensor[(12, 50, 64), float32] */;
  %1556 = clip(%1555, a_min=-127f, a_max=127f) /* ty=Tensor[(12, 50, 64), float32] */;
  %1557 = cast(%1549, dtype="int8") /* ty=Tensor[(12, 50, 64), int8] */;
  %1558 = cast(%1556, dtype="int8") /* ty=Tensor[(12, 50, 64), int8] */;
  %1559 = nn.batch_matmul(%1557, %1558, meta[relay.attrs.BatchMatmulAttrs][61]) /* ty=Tensor[(12, 50, 50), int32] */;

Thanks, But in our vit network, it looks like we have some performance issues before nn.batch_matmul. Waiting for your adding full transpose support for nn.batch_matmul!

jcf94

Thanks! @wyc-ruiker Overall looks great to me.

Just some nit-pick.

python/tvm/topi/cuda/batch_matmul_tensorcore.py

python/tvm/topi/cuda/dense_tensorcore.py

python/tvm/topi/cuda/tensorcore_alter_op.py

Co-authored-by: Chenfan <[email protected]>

jcf94

Thanks! @wyc-ruiker

jcf94 · 2021-07-09T02:12:40Z

Push agian to re-triggle the CI? @wyc-ruiker

…ache#8402) * add int8/int tensorcore for dense/batch_matmul * fix bug * fix lint * Apply suggestions from code review Co-authored-by: Chenfan <[email protected]> * fix for reviewer * fix lint Co-authored-by: Chenfan <[email protected]>

wyc-ruiker added 7 commits July 2, 2021 17:02

add int8/int tensorcore for dense/batch_matmul

13649ce

fix bug

500443d

fix

0cb64ff

add int8/int tensorcore for dense/batch_matmul

9652947

fix bug

9d90652

fix

184199f

Merge branch 'vit' of https://github.com/wyc-ruiker/incubator-tvm int…

7a9f61d

…o vit

wyc-ruiker changed the title ~~[CUDA] add int8/int4 tensorcore for dense/batch_matmul~~ [CUDA] dense_tensorcore/batch_matmul_tensorcore support int8/int4 Jul 5, 2021

wyc-ruiker added 2 commits July 5, 2021 14:49

fix lint

0c013af

fix lint

94f3f0a

jcf94 self-assigned this Jul 7, 2021

jcf94 reviewed Jul 7, 2021

View reviewed changes

wyc-ruiker and others added 3 commits July 7, 2021 19:59

Apply suggestions from code review

74365f6

Co-authored-by: Chenfan <[email protected]>

fix for reviewer

26eb176

fix lint

cbd0044

wyc-ruiker requested a review from jcf94 July 7, 2021 12:33

jcf94 approved these changes Jul 7, 2021

View reviewed changes

Merge branch 'main' into vit

b0b1d25

jcf94 merged commit 0fa4396 into apache:main Jul 9, 2021

wyc-ruiker deleted the vit branch July 9, 2021 11:33

jcf94 mentioned this pull request Jul 22, 2021

[TOPI] Add transpose_a/b & dynamic shape support for batch matmul #8527

Merged

1 task

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] dense_tensorcore/batch_matmul_tensorcore support int8/int4 #8402

[CUDA] dense_tensorcore/batch_matmul_tensorcore support int8/int4 #8402

wyc-ruiker commented Jul 5, 2021 •

edited

Loading

jcf94 commented Jul 5, 2021 •

edited

Loading

wyc-ruiker commented Jul 5, 2021

jcf94 left a comment

jcf94 left a comment

jcf94 commented Jul 9, 2021

[CUDA] dense_tensorcore/batch_matmul_tensorcore support int8/int4 #8402

[CUDA] dense_tensorcore/batch_matmul_tensorcore support int8/int4 #8402

Conversation

wyc-ruiker commented Jul 5, 2021 • edited Loading

jcf94 commented Jul 5, 2021 • edited Loading

wyc-ruiker commented Jul 5, 2021

jcf94 left a comment

Choose a reason for hiding this comment

jcf94 left a comment

Choose a reason for hiding this comment

jcf94 commented Jul 9, 2021

wyc-ruiker commented Jul 5, 2021 •

edited

Loading

jcf94 commented Jul 5, 2021 •

edited

Loading