Rowwise F8F8BF16 GEMMs - Auto-generate kernel library, auto-generated heuristics cache, add to FBGEMM quantize_bench #3210
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Summary
Performance Improvements
DisaggBench
Cultass
Prefill B=1 T=2048: Elapsed: 109.13ms FLOPs: 333.74TF/s
Prefill B=1 T=4928: Elapsed: 272.55ms FLOPs: 338.62TF/s
Prefill B=1 T=6336: Elapsed: 354.93ms FLOPs: 342.55TF/s
Prefill B=1 T=8192: Elapsed: 468.64ms FLOPs: 346.06TF/s
Cultass extensions
Prefill B=1 T=2048: Elapsed: 108.83ms FLOPs: 334.66TF/s
Prefill B=1 T=4928: Elapsed: 260.46ms FLOPs: 354.34TF/s
Prefill B=1 T=6336: Elapsed: 336.39ms FLOPs: 361.43TF/s
Prefill B=1 T=8192: Elapsed: 442.64ms FLOPs: 366.39TF/s
Differential Revision: D63744054