chore: bump v0.4.1.post2 #2643

zhyncs · 2024-12-29T15:50:50Z

Motivation

TLDR

Using CUDA Graph and FP8 GEMM Tuning

batch 1, input 128, output 256: 30.29 token/s
batch 8, input 128, output 256: 206.32 token/s
batch 32, input 128, output 256: 760.52 token/s
ShareGPT 5k: 2673.55 token/s

In the coming days, we will continue to optimize DeepSeek V3 and release it quickly. Please stay tuned.

Offline

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     5000
Benchmark duration (s):                  366.11
Total input tokens:                      1146018
Total generated tokens:                  978825
Total generated tokens (retokenized):    974535
Request throughput (req/s):              13.66
Input token throughput (tok/s):          3130.21
Output token throughput (tok/s):         2673.55
Total token throughput (tok/s):          5803.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   196528.46
Median E2E Latency (ms):                 193378.59
---------------Time to First Token----------------
Mean TTFT (ms):                          82926.36
Median TTFT (ms):                        85147.69
P99 TTFT (ms):                           133377.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1536.74
Median TPOT (ms):                        674.43
P99 TPOT (ms):                           15539.16
---------------Inter-token Latency----------------
Mean ITL (ms):                           585.02
Median ITL (ms):                         253.53
P99 ITL (ms):                            2111.55
==================================================

Online

python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Prefill. latency: 1.76146 s, throughput:     72.67 token/s
Decode.  latency: 0.03916 s, throughput:     25.53 token/s
Decode.  latency: 0.03299 s, throughput:     30.31 token/s
Decode.  latency: 0.03288 s, throughput:     30.41 token/s
Decode.  latency: 0.03292 s, throughput:     30.38 token/s
Decode.  latency: 0.03302 s, throughput:     30.29 token/s
Decode.  median latency: 0.03302 s, median throughput:     30.29 token/s
Total. latency:  1.999 s, throughput:     68.05 token/s
Benchmark ...
Prefill. latency: 0.11257 s, throughput:   1137.05 token/s
Decode.  latency: 0.03293 s, throughput:     30.37 token/s
Decode.  latency: 0.03289 s, throughput:     30.40 token/s
Decode.  latency: 0.03285 s, throughput:     30.44 token/s
Decode.  latency: 0.03295 s, throughput:     30.34 token/s
Decode.  latency: 0.03288 s, throughput:     30.42 token/s
Decode.  median latency: 0.03305 s, median throughput:     30.26 token/s
Total. latency:  8.533 s, throughput:     45.00 token/s

python3 -m sglang.bench_one_batch --batch-size 8 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Prefill. latency: 3.83831 s, throughput:    266.78 token/s
Decode.  latency: 0.04153 s, throughput:    192.63 token/s
Decode.  latency: 0.03799 s, throughput:    210.61 token/s
Decode.  latency: 0.03806 s, throughput:    210.19 token/s
Decode.  latency: 0.03862 s, throughput:    207.15 token/s
Decode.  latency: 0.03877 s, throughput:    206.32 token/s
Decode.  median latency: 0.03877 s, median throughput:    206.32 token/s
Total. latency:  4.112 s, throughput:    264.62 token/s
Benchmark ...
Prefill. latency: 0.15405 s, throughput:   6647.37 token/s
Decode.  latency: 0.03783 s, throughput:    211.49 token/s
Decode.  latency: 0.03792 s, throughput:    210.97 token/s
Decode.  latency: 0.03804 s, throughput:    210.29 token/s
Decode.  latency: 0.03859 s, throughput:    207.31 token/s
Decode.  latency: 0.03871 s, throughput:    206.64 token/s
Decode.  median latency: 0.03953 s, median throughput:    202.38 token/s
Total. latency: 10.226 s, throughput:    300.41 token/s

python3 -m sglang.bench_one_batch --batch-size 32 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Prefill. latency: 5.27629 s, throughput:    776.30 token/s
Decode.  latency: 0.04033 s, throughput:    793.45 token/s
Decode.  latency: 0.03985 s, throughput:    802.93 token/s
Decode.  latency: 0.04084 s, throughput:    783.50 token/s
Decode.  latency: 0.04208 s, throughput:    760.52 token/s
Decode.  latency: 0.04270 s, throughput:    749.42 token/s
Decode.  median latency: 0.04208 s, median throughput:    760.52 token/s
Total. latency:  5.570 s, throughput:    781.29 token/s
Benchmark ...
Prefill. latency: 0.32520 s, throughput:  12595.37 token/s
Decode.  latency: 0.03992 s, throughput:    801.63 token/s
Decode.  latency: 0.03984 s, throughput:    803.18 token/s
Decode.  latency: 0.04085 s, throughput:    783.32 token/s
Decode.  latency: 0.04195 s, throughput:    762.82 token/s
Decode.  latency: 0.04294 s, throughput:    745.28 token/s
Decode.  median latency: 0.04818 s, median throughput:    664.23 token/s
Total. latency: 12.552 s, throughput:    978.96 token/s

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2024-12-29T16:11:09Z

gsm8k eval

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000

Accuracy: 0.942
Invalid: 0.000
Latency: 89.116 s
Output throughput: 1500.299 token/s

chore: bump v0.4.1.post2

756df4e

zhyncs added the performance label Dec 29, 2024

zhyncs requested review from Ying1123, merrymercy, ispobock, BBuf and HandH1998 December 29, 2024 15:50

zhyncs self-assigned this Dec 29, 2024

zhyncs marked this pull request as draft December 29, 2024 15:50

zhyncs marked this pull request as ready for review December 29, 2024 16:11

zhyncs merged commit 3ccf566 into main Dec 29, 2024
16 of 17 checks passed

zhyncs deleted the zhyncs/v0.4.1.post2 branch December 29, 2024 16:11

fsygd mentioned this pull request Dec 31, 2024

Release 0.4.1.post3 - upload the config.json to PyPI #2647

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: bump v0.4.1.post2 #2643

chore: bump v0.4.1.post2 #2643

zhyncs commented Dec 29, 2024 •

edited

Loading

zhyncs commented Dec 29, 2024

chore: bump v0.4.1.post2 #2643

chore: bump v0.4.1.post2 #2643

Conversation

zhyncs commented Dec 29, 2024 • edited Loading

Motivation

TLDR

Offline

Online

Modifications

Checklist

zhyncs commented Dec 29, 2024

zhyncs commented Dec 29, 2024 •

edited

Loading