Release 0.4.1.post3 - upload the config.json to PyPI #2647

merrymercy · 2024-12-29T21:34:37Z

We need to add this line to upload the *.json config files to pypi. Otherwise, they will be ignored.

[tool.setuptools.package-data]
"sglang" = ["srt/layers/moe/fused_moe_triton/configs/*.json", "srt/layers/quantization/configs/*.json"]

zhyncs · 2024-12-30T04:05:59Z

Thanks!

zhyncs · 2024-12-30T05:45:13Z

pip install "sglang[all]==0.4.1.post3" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

fsygd · 2024-12-31T05:22:17Z

I am trying to 2 nodes H800 according to #2643
I use the docker image lmsysorg/sglang:latest, and launch sglang server by

# suppose 10.0.0.1 is the ip of node 1

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

And here comes the results:

Offline

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     5000      
Benchmark duration (s):                  1641.99   
Total input tokens:                      1146018   
Total generated tokens:                  978825    
Total generated tokens (retokenized):    974472    
Request throughput (req/s):              3.05      
Input token throughput (tok/s):          697.94    
Output token throughput (tok/s):         596.12    
Total token throughput (tok/s):          1294.07   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   839240.96 
Median E2E Latency (ms):                 857887.76 
---------------Time to First Token----------------
Mean TTFT (ms):                          614201.97 
Median TTFT (ms):                        605960.78 
P99 TTFT (ms):                           1344705.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1354.36   
Median TPOT (ms):                        1257.51   
P99 TPOT (ms):                           5069.49   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1158.52   
Median ITL (ms):                         1167.58   
P99 ITL (ms):                            2400.85   
==================================================

gsm8k eval

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000

Accuracy: 0.935
Invalid: 0.000
Latency: 192.189 s
Output throughput: 697.094 token/s

fsygd · 2025-01-01T15:12:12Z

The throughput degraded slightly after tuning with the following commands.

python3 benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py --model deepseek-ai/DeepSeek-V3 --tp-size 16 --dtype fp8_w8a8 --tune

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     5000      
Benchmark duration (s):                  1748.06   
Total input tokens:                      1146018   
Total generated tokens:                  978825    
Total generated tokens (retokenized):    974473    
Request throughput (req/s):              2.86      
Input token throughput (tok/s):          655.59    
Output token throughput (tok/s):         559.95    
Total token throughput (tok/s):          1215.54   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   905093.80 
Median E2E Latency (ms):                 931110.87 
---------------Time to First Token----------------
Mean TTFT (ms):                          665337.84 
Median TTFT (ms):                        651847.50 
P99 TTFT (ms):                           1450611.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1433.36   
Median TPOT (ms):                        1332.12   
P99 TPOT (ms):                           5300.05   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1234.29   
Median ITL (ms):                         1234.97   
P99 ITL (ms):                            2651.80   
==================================================

I think it's due to the network between the two nodes. @zhyncs

zhyncs · 2025-01-01T15:13:44Z

What about the online latency?

fsygd · 2025-01-01T15:56:06Z

What about the online latency?

I might work on it later, but the scripts from #2643 for online latency cannot be directly used for multi-nodes.

release new version

cdbedd5

merrymercy merged commit 03d5fbf into main Dec 29, 2024
17 checks passed

merrymercy deleted the pr-release branch December 29, 2024 22:25

zhyncs mentioned this pull request Dec 30, 2024

[Feature] DeepSeek V3 optimization #2591

Open

15 tasks

fsygd mentioned this pull request Dec 31, 2024

[Bug] deepseek v3 cannot run in multi-node #2658

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.4.1.post3 - upload the config.json to PyPI #2647

Release 0.4.1.post3 - upload the config.json to PyPI #2647

merrymercy commented Dec 29, 2024 •

edited

Loading

zhyncs commented Dec 30, 2024

zhyncs commented Dec 30, 2024

fsygd commented Dec 31, 2024 •

edited

Loading

fsygd commented Jan 1, 2025

zhyncs commented Jan 1, 2025

fsygd commented Jan 1, 2025

Release 0.4.1.post3 - upload the config.json to PyPI #2647

Release 0.4.1.post3 - upload the config.json to PyPI #2647

Conversation

merrymercy commented Dec 29, 2024 • edited Loading

zhyncs commented Dec 30, 2024

zhyncs commented Dec 30, 2024

fsygd commented Dec 31, 2024 • edited Loading

Offline

gsm8k eval

fsygd commented Jan 1, 2025

zhyncs commented Jan 1, 2025

fsygd commented Jan 1, 2025

merrymercy commented Dec 29, 2024 •

edited

Loading

fsygd commented Dec 31, 2024 •

edited

Loading