Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Closed
wants to merge 16 commits into from

Conversation

roG0d
Copy link
Contributor

@roG0d roG0d commented Jan 5, 2025

Motivation

  • Having baseline metrics to compare future works using a 2x8xH200 GPU cluster.
  • Explore the tradeoffs of increasing the number of chips with more GPU memory, H200, versus increasing the parallel inference world size when H100.
  • Measure multi-node inference overhead compared to single-node.
  • Explore the benefits of using FP8 quantization.

For output files and logs, please refer to: https://github.com/datacrunch-research/h200-benchmarks

Modifications

  • Added a new folder benchmark_dsv3 resembling a similar structure of other benchmark folders.
  • Added deepseek_v3.shscript containing each benchmark performed.
  • Added a README.md containing the metrics obtained from the benchmarks performed.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Jan 5, 2025

Hi @roG0d Sorry for the late response, may you try the latest version v0.4.1.post4

@roG0d
Copy link
Contributor Author

roG0d commented Jan 7, 2025

Sure, @zhyncs! If you'd like, we were thinking it might be useful to track the progress made in #2591 and benchmark it for single-node FP8 and BF16.

We could create another folder called benchmark_v0_4_1_post to store these benchmark results, similar to what was done for this PR.

@roG0d
Copy link
Contributor Author

roG0d commented Jan 7, 2025

We already have the results for v.0.4.1.post4 up to FusedMoE tuning for H200 here. Would you prefer us to create a new PR for v0.4.1.post4 with these results or should I include it in the current one for main so we can continue updating it with future optimizations?

@chongli-uw
Copy link

chongli-uw commented Jan 8, 2025

@roG0d May I also suggest that the interconnect setup between the 2 nodes are also documented in the benchmarks? For example whether the interconnect between 2 nodes is Nvidia Infiniband or Amazon EFA, the NCCL version, etc. It would be easier for broader audience to follow or replicate the benchmarking result. Thanks so much!

@bong-furiosa
Copy link

Hello @roG0d !
First of all, thank you for sharing good resources for measuring the differences when running DeepSeek V3 using FP8 and BF16 methods.

However, if the FP8 and BF16 methods mentioned in the benchmark refer specifically to gemm FP8 and gemm BF16 approaches, it would be delightful if you could address my questions regarding the measurement methods used in this benchmark.


Question 1 the arguments passed when running the server.

The arguments you (and some other sglang users) used to launch the server are as follows:

#BF16
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code 
--enable-torch-compile --enable-dp-attention --mem-fraction-static 0.8 --disable-cuda-graph
#FP8
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 
--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention

Here, I identified the following 3 concern:

  1. It seems that torch.compile is used within the CudaGraphRunner created by ModelRunner.
    However, if --disable-cuda-graph is passed, CudaGraphRunner isn't created (link), making torch.compile ignored.
    So, why were both arguments used simultaneously in BF16 server launch? 🤔

  2. Even when running the server in the BF16 or FP8, it seems FP8LinearMethod is used for Linear ops (e.g. ColumnParallelLinear etc). Specifically, the apply function in FP8LinearMethod appears to use the apply_w8a8_block_fp8_linear function.
    This makes me think the BF16 version doesn’t use BF16 gemm. Could you clarify what the BF16 version means in the benchmark?

  3. If BF16 and FP8 are using the same computation method (if both uses gemm fp8), what do you think could be the cause of the performance differences observed in the benchmark? Could the KV cache option be a factor?

If these questions can be clarified, I believe we can have more confidence in the benchmark results and the FP8 performance of DeepSeek V3.

cc. @zhyncs

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details docs and instructions. As we will optimize the performance very rapidly, many results in this PR will be outdated. To reduce our maintenance overhead, it is better to use blog posts / github discussions / github issues to share these results instead of maintaining them inside the repo.

I will close this for now because we won't merge this, but feel free to keep the discussion in this thread.

@merrymercy merrymercy closed this Jan 13, 2025
@antferdom
Copy link

#2450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants