-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Optimization: Optimized TileShape Configuration for f8 #3617
base: main
Are you sure you want to change the base?
Performance Optimization: Optimized TileShape Configuration for f8 #3617
Conversation
- Change TileShape from 128x128x128 to 128x256x128 - Add cooperative kernel by default for f8 kernels
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Thanks for the contribution, @MatrixAssembler ! Wonder if you observe optimization opportunities with M <= 128 || N <= 128? |
@jiawenliu64 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu
Outdated
Show resolved
Hide resolved
fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu
Outdated
Show resolved
Hide resolved
Indeed @jiawenliu64, the bandwidth of H100 and B100 is so large f8f8bf16_rowwise (M = N = K = 8,192)
We can deduce that for a configuration where 128 >= M > 64 M being the batch size, this can be a common GEMM This is the next contribution I'm preparing, I would have liked to do the same for B100, to prepare for |
@jiawenliu64 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Performance Issue with Current F8 TileShape Configuration
The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense f8 tensor core on H100 is m64n256k32.
The current configuration leads to suboptimal performance for
tensor cores and bandwidth usage.
Optimized TileShape (128x256x128) Implementation
Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 for f8.
Benchmark Results on H100 GPU
Benchmark configuration:
PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks
Benchmark
f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)
f8f8bf16_rowwise (M = N = K = 8,192)
f8f8bf16_tensorwise (M=N=K = 8,192)
Technical Implementation
Modified TileShape from 128-128-128 to 128-256-128 for:
Added cooperative kernel by default for:
f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise
The modifications only affect large where M > 128 and N > 128 and M or N > 2,048.
The matrices are divided into tiles twice as large, but with kernels using 3
SMs instead of 2. The smaller heuristics of large kernels may experience a
slight reduced efficiency compared to the previous configuration.
An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM.
These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.
Test Coverage
Unit Tests Results
The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.
@jiawenliu64 @jwfromm Thank you!