-
Notifications
You must be signed in to change notification settings - Fork 533
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Warp-specialized FP8 rowsise GEMM kernel (#3532)
Summary: X-link: facebookresearch/FBGEMM#614 Pull Request resolved: #3532 Adding a warp-specialize GEMM kernel. A couple highlights: - `tl.async_task` warp partition. Rewrote the well-known flattened 1D-loop GEMM persistent kernel with a more intuitive natural Warp-specialized 2D-loop persistent kernel with producer/consumer co-operative model. - One kernel for both, enabled autotuning across both WS and non-WS with same kernel, since. WS may not always be beneficial. We have also updated the tutorial to support multi-modes, including non-WS. - Use regular load instead of TMA for scale loading within each consumer - Compiler-automated accumulator initialization omission. The first iteration of the matmul K-loop does not need to an accumulator when computing the output which is fed to the next iteration as the accumulator. Therefore peeling the first iteration out of the K-loop can avoid the zero initialization of the accumulator. `tl.assume` is used to deal with the case when the loop doesn’t run at all. Reviewed By: jianyuh Differential Revision: D67676051 fbshipit-source-id: 4c552e37c358dc48d19b26ea3019c7afcb1ef18a
- Loading branch information
1 parent
0c93fd0
commit 921e305
Showing
1 changed file
with
341 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters