-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Benchmarking H200 #2450
Comments
Thank you @antferdom! We are very interested in the H200's performance metrics. @Ying1123 has been coordinating the relevant resources, and we greatly appreciate you bringing this matter to our attention. We are monitoring the situation closely and will provide updates on any progress or discussions in this thread. Thank you for your contribution! |
The H200 shows great promise, particularly for extremely large models. For models exceeding 600B parameters, even FP8 precision cannot enable single-machine execution. In such scenarios, the H200's 144GB memory capacity becomes crucial. Furthermore, the improvements you noted, such as higher TFLOPS and increased memory bandwidth, will significantly impact current kernel configurations and performance. It's worth noting that neither FlashInfer nor the Triton backend has been specifically optimized for the H200 yet. Once we acquire the necessary hardware resources, we will promptly begin optimization efforts in this area. cc @yzh119 @ispobock |
Thanks @zhyncs for your rapid and positive feedback! Great point you highlight about H200’s on-chip memory for > 600B parameters model. For LLaMA 405B it allows to handle Regarding kernel configuration, in particular Torch Triton codegen, I’m not aware of the latest changes to handle, as you said, the additional memory-bandwidth. We could do some experiments to torch/_inductor/kernel/mm.py. What immediate next steps would you recommend? I can already run whatever benchmark we consider relevant and save the output files. Regarding resources, would a single H200 node be sufficient for you? Happy to explore this co-research. |
Yeah I am eager to collaborate on researching and enhancing the H200's performance. Your help in coordinating resources would be greatly appreciated! |
Hi @antferdom the link is 404. Is that repo private |
Checklist
Motivation
Research Questions
Models of Interest
0.4
data parallelism attention for MLA. Focus on:Preliminar Results
Following the benchmarks from sglang benchmarks
Environment Configuration
Using the latest Docker image
lmsysorg/sglang:latest
with SGLangv0.4
Online benchmark results
Llama 3.1 70B Instruct 4 x H200 141GB
Offline benchmark results
Llama 3.1 70B Instruct 4 x H200 141GB
Llama 3.1 70B Instruct 8 x H200 141GB
Llama 3.1 405B Instruct 8 x H200 141GB
Q: Where should we place this benchmarking information, in existing docs or create a new one? @merrymercy @zhyncs
Related resources
Hopper GPU HW specs comparison: H100 & H200
The text was updated successfully, but these errors were encountered: