Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

roG0d · 2025-01-05T16:46:11Z

Motivation

Having baseline metrics to compare future works using a 2x8xH200 GPU cluster.
Explore the tradeoffs of increasing the number of chips with more GPU memory, H200, versus increasing the parallel inference world size when H100.
Measure multi-node inference overhead compared to single-node.
Explore the benefits of using FP8 quantization.

For output files and logs, please refer to: https://github.com/datacrunch-research/h200-benchmarks

Modifications

Added a new folder benchmark_dsv3 resembling a similar structure of other benchmark folders.
Added deepseek_v3.shscript containing each benchmark performed.
Added a README.md containing the metrics obtained from the benchmarks performed.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2025-01-05T17:49:48Z

Hi @roG0d Sorry for the late response, may you try the latest version v0.4.1.post4

roG0d · 2025-01-07T09:25:39Z

Sure, @zhyncs! If you'd like, we were thinking it might be useful to track the progress made in #2591 and benchmark it for single-node FP8 and BF16.

We could create another folder called benchmark_v0_4_1_post to store these benchmark results, similar to what was done for this PR.

roG0d · 2025-01-07T11:12:16Z

We already have the results for v.0.4.1.post4 up to FusedMoE tuning for H200 here. Would you prefer us to create a new PR for v0.4.1.post4 with these results or should I include it in the current one for main so we can continue updating it with future optimizations?

chongli-uw · 2025-01-08T23:43:42Z

@roG0d May I also suggest that the interconnect setup between the 2 nodes are also documented in the benchmarks? For example whether the interconnect between 2 nodes is Nvidia Infiniband or Amazon EFA, the NCCL version, etc. It would be easier for broader audience to follow or replicate the benchmarking result. Thanks so much!

…rimentation

bong-furiosa · 2025-01-13T02:20:52Z

Hello @roG0d !
First of all, thank you for sharing good resources for measuring the differences when running DeepSeek V3 using FP8 and BF16 methods.

However, if the FP8 and BF16 methods mentioned in the benchmark refer specifically to gemm FP8 and gemm BF16 approaches, it would be delightful if you could address my questions regarding the measurement methods used in this benchmark.

Question 1 the arguments passed when running the server.

The arguments you (and some other sglang users) used to launch the server are as follows:

#BF16
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code 
--enable-torch-compile --enable-dp-attention --mem-fraction-static 0.8 --disable-cuda-graph

#FP8
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 
--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention

Here, I identified the following 3 concern:

It seems that torch.compile is used within the CudaGraphRunner created by ModelRunner.
However, if --disable-cuda-graph is passed, CudaGraphRunner isn't created (link), making torch.compile ignored.
So, why were both arguments used simultaneously in BF16 server launch? 🤔
Even when running the server in the BF16 or FP8, it seems FP8LinearMethod is used for Linear ops (e.g. ColumnParallelLinear etc). Specifically, the apply function in FP8LinearMethod appears to use the apply_w8a8_block_fp8_linear function.
This makes me think the BF16 version doesn’t use BF16 gemm. Could you clarify what the BF16 version means in the benchmark?
If BF16 and FP8 are using the same computation method (if both uses gemm fp8), what do you think could be the cause of the performance differences observed in the benchmark? Could the KV cache option be a factor?

If these questions can be clarified, I believe we can have more confidence in the benchmark results and the FP8 performance of DeepSeek V3.

cc. @zhyncs

merrymercy

Thanks for the details docs and instructions. As we will optimize the performance very rapidly, many results in this PR will be outdated. To reduce our maintenance overhead, it is better to use blog posts / github discussions / github issues to share these results instead of maintaining them inside the repo.

I will close this for now because we won't merge this, but feel free to keep the discussion in this thread.

antferdom · 2025-01-13T07:33:23Z

#2450

roG0d and others added 11 commits January 2, 2025 09:43

Included Multinode DeepSeekv3

7c6e609

Reincluded H20 example

5b809e6

Updated --nccl-init for --dist-init-addr

640b41c

Merge branch 'main' into main

1eaf209

upd

9d8c2b4

upd

2770fe9

upd

438cf62

upd

92b4911

Merge branch 'sgl-project:main' into main

3d2d50a

(ADDED): benchmarks results for deepseekv3

5035232

Update link to results

62ef626

Merge branch 'sgl-project:main' into main

ec759fc

roG0d and others added 4 commits January 12, 2025 14:04

Update infiniband bandwidth and nccl version

26db155

(ADDED): outputs and logs for DeepSeekv3 and sglang v0.4.1.post4 expe…

49f7815

…rimentation

Merge branch 'sgl-project:main' into main

8d78888

(DELETED): logs for DeepSeekv3 and v0.4.1.post4

c62b0bf

merrymercy reviewed Jan 13, 2025

View reviewed changes

merrymercy closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

roG0d commented Jan 5, 2025

zhyncs commented Jan 5, 2025

roG0d commented Jan 7, 2025

roG0d commented Jan 7, 2025

chongli-uw commented Jan 8, 2025 •

edited

Loading

bong-furiosa commented Jan 13, 2025

merrymercy left a comment

antferdom commented Jan 13, 2025

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Conversation

roG0d commented Jan 5, 2025

Motivation

Modifications

Checklist

zhyncs commented Jan 5, 2025

roG0d commented Jan 7, 2025

roG0d commented Jan 7, 2025

chongli-uw commented Jan 8, 2025 • edited Loading

bong-furiosa commented Jan 13, 2025

merrymercy left a comment

Choose a reason for hiding this comment

antferdom commented Jan 13, 2025

chongli-uw commented Jan 8, 2025 •

edited

Loading