Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump v0.4.1.post2 #2643

Merged
merged 1 commit into from
Dec 29, 2024
Merged

chore: bump v0.4.1.post2 #2643

merged 1 commit into from
Dec 29, 2024

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Dec 29, 2024

Motivation

TLDR

Using CUDA Graph and FP8 GEMM Tuning

batch 1, input 128, output 256: 30.29 token/s
batch 8, input 128, output 256: 206.32 token/s
batch 32, input 128, output 256: 760.52 token/s
ShareGPT 5k: 2673.55 token/s

In the coming days, we will continue to optimize DeepSeek V3 and release it quickly. Please stay tuned.

Offline

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     5000
Benchmark duration (s):                  366.11
Total input tokens:                      1146018
Total generated tokens:                  978825
Total generated tokens (retokenized):    974535
Request throughput (req/s):              13.66
Input token throughput (tok/s):          3130.21
Output token throughput (tok/s):         2673.55
Total token throughput (tok/s):          5803.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   196528.46
Median E2E Latency (ms):                 193378.59
---------------Time to First Token----------------
Mean TTFT (ms):                          82926.36
Median TTFT (ms):                        85147.69
P99 TTFT (ms):                           133377.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1536.74
Median TPOT (ms):                        674.43
P99 TPOT (ms):                           15539.16
---------------Inter-token Latency----------------
Mean ITL (ms):                           585.02
Median ITL (ms):                         253.53
P99 ITL (ms):                            2111.55
==================================================

Online

python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Prefill. latency: 1.76146 s, throughput:     72.67 token/s
Decode.  latency: 0.03916 s, throughput:     25.53 token/s
Decode.  latency: 0.03299 s, throughput:     30.31 token/s
Decode.  latency: 0.03288 s, throughput:     30.41 token/s
Decode.  latency: 0.03292 s, throughput:     30.38 token/s
Decode.  latency: 0.03302 s, throughput:     30.29 token/s
Decode.  median latency: 0.03302 s, median throughput:     30.29 token/s
Total. latency:  1.999 s, throughput:     68.05 token/s
Benchmark ...
Prefill. latency: 0.11257 s, throughput:   1137.05 token/s
Decode.  latency: 0.03293 s, throughput:     30.37 token/s
Decode.  latency: 0.03289 s, throughput:     30.40 token/s
Decode.  latency: 0.03285 s, throughput:     30.44 token/s
Decode.  latency: 0.03295 s, throughput:     30.34 token/s
Decode.  latency: 0.03288 s, throughput:     30.42 token/s
Decode.  median latency: 0.03305 s, median throughput:     30.26 token/s
Total. latency:  8.533 s, throughput:     45.00 token/s

python3 -m sglang.bench_one_batch --batch-size 8 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Prefill. latency: 3.83831 s, throughput:    266.78 token/s
Decode.  latency: 0.04153 s, throughput:    192.63 token/s
Decode.  latency: 0.03799 s, throughput:    210.61 token/s
Decode.  latency: 0.03806 s, throughput:    210.19 token/s
Decode.  latency: 0.03862 s, throughput:    207.15 token/s
Decode.  latency: 0.03877 s, throughput:    206.32 token/s
Decode.  median latency: 0.03877 s, median throughput:    206.32 token/s
Total. latency:  4.112 s, throughput:    264.62 token/s
Benchmark ...
Prefill. latency: 0.15405 s, throughput:   6647.37 token/s
Decode.  latency: 0.03783 s, throughput:    211.49 token/s
Decode.  latency: 0.03792 s, throughput:    210.97 token/s
Decode.  latency: 0.03804 s, throughput:    210.29 token/s
Decode.  latency: 0.03859 s, throughput:    207.31 token/s
Decode.  latency: 0.03871 s, throughput:    206.64 token/s
Decode.  median latency: 0.03953 s, median throughput:    202.38 token/s
Total. latency: 10.226 s, throughput:    300.41 token/s

python3 -m sglang.bench_one_batch --batch-size 32 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code

Prefill. latency: 5.27629 s, throughput:    776.30 token/s
Decode.  latency: 0.04033 s, throughput:    793.45 token/s
Decode.  latency: 0.03985 s, throughput:    802.93 token/s
Decode.  latency: 0.04084 s, throughput:    783.50 token/s
Decode.  latency: 0.04208 s, throughput:    760.52 token/s
Decode.  latency: 0.04270 s, throughput:    749.42 token/s
Decode.  median latency: 0.04208 s, median throughput:    760.52 token/s
Total. latency:  5.570 s, throughput:    781.29 token/s
Benchmark ...
Prefill. latency: 0.32520 s, throughput:  12595.37 token/s
Decode.  latency: 0.03992 s, throughput:    801.63 token/s
Decode.  latency: 0.03984 s, throughput:    803.18 token/s
Decode.  latency: 0.04085 s, throughput:    783.32 token/s
Decode.  latency: 0.04195 s, throughput:    762.82 token/s
Decode.  latency: 0.04294 s, throughput:    745.28 token/s
Decode.  median latency: 0.04818 s, median throughput:    664.23 token/s
Total. latency: 12.552 s, throughput:    978.96 token/s

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs zhyncs self-assigned this Dec 29, 2024
@zhyncs zhyncs marked this pull request as draft December 29, 2024 15:50
@zhyncs
Copy link
Member Author

zhyncs commented Dec 29, 2024

gsm8k eval

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000

Accuracy: 0.942
Invalid: 0.000
Latency: 89.116 s
Output throughput: 1500.299 token/s

@zhyncs zhyncs marked this pull request as ready for review December 29, 2024 16:11
@zhyncs zhyncs merged commit 3ccf566 into main Dec 29, 2024
16 of 17 checks passed
@zhyncs zhyncs deleted the zhyncs/v0.4.1.post2 branch December 29, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant