Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.4.1.post3 - upload the config.json to PyPI #2647

Merged
merged 1 commit into from
Dec 29, 2024
Merged

Conversation

merrymercy
Copy link
Contributor

@merrymercy merrymercy commented Dec 29, 2024

We need to add this line to upload the *.json config files to pypi. Otherwise, they will be ignored.

[tool.setuptools.package-data]
"sglang" = ["srt/layers/moe/fused_moe_triton/configs/*.json", "srt/layers/quantization/configs/*.json"]

@merrymercy merrymercy merged commit 03d5fbf into main Dec 29, 2024
17 checks passed
@merrymercy merrymercy deleted the pr-release branch December 29, 2024 22:25
@zhyncs
Copy link
Member

zhyncs commented Dec 30, 2024

Thanks!

@zhyncs zhyncs mentioned this pull request Dec 30, 2024
15 tasks
@zhyncs
Copy link
Member

zhyncs commented Dec 30, 2024

pip install "sglang[all]==0.4.1.post3" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

@fsygd
Copy link
Contributor

fsygd commented Dec 31, 2024

I am trying to 2 nodes H800 according to #2643
I use the docker image lmsysorg/sglang:latest, and launch sglang server by

# suppose 10.0.0.1 is the ip of node 1

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

And here comes the results:

Offline

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     5000      
Benchmark duration (s):                  1641.99   
Total input tokens:                      1146018   
Total generated tokens:                  978825    
Total generated tokens (retokenized):    974472    
Request throughput (req/s):              3.05      
Input token throughput (tok/s):          697.94    
Output token throughput (tok/s):         596.12    
Total token throughput (tok/s):          1294.07   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   839240.96 
Median E2E Latency (ms):                 857887.76 
---------------Time to First Token----------------
Mean TTFT (ms):                          614201.97 
Median TTFT (ms):                        605960.78 
P99 TTFT (ms):                           1344705.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1354.36   
Median TPOT (ms):                        1257.51   
P99 TPOT (ms):                           5069.49   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1158.52   
Median ITL (ms):                         1167.58   
P99 ITL (ms):                            2400.85   
==================================================

gsm8k eval

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000

Accuracy: 0.935
Invalid: 0.000
Latency: 192.189 s
Output throughput: 697.094 token/s

@fsygd
Copy link
Contributor

fsygd commented Jan 1, 2025

The throughput degraded slightly after tuning with the following commands.

python3 benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py --model deepseek-ai/DeepSeek-V3 --tp-size 16 --dtype fp8_w8a8 --tune
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     5000      
Benchmark duration (s):                  1748.06   
Total input tokens:                      1146018   
Total generated tokens:                  978825    
Total generated tokens (retokenized):    974473    
Request throughput (req/s):              2.86      
Input token throughput (tok/s):          655.59    
Output token throughput (tok/s):         559.95    
Total token throughput (tok/s):          1215.54   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   905093.80 
Median E2E Latency (ms):                 931110.87 
---------------Time to First Token----------------
Mean TTFT (ms):                          665337.84 
Median TTFT (ms):                        651847.50 
P99 TTFT (ms):                           1450611.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1433.36   
Median TPOT (ms):                        1332.12   
P99 TPOT (ms):                           5300.05   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1234.29   
Median ITL (ms):                         1234.97   
P99 ITL (ms):                            2651.80   
==================================================

I think it's due to the network between the two nodes. @zhyncs

@zhyncs
Copy link
Member

zhyncs commented Jan 1, 2025

What about the online latency?

@fsygd
Copy link
Contributor

fsygd commented Jan 1, 2025

What about the online latency?

I might work on it later, but the scripts from #2643 for online latency cannot be directly used for multi-nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants