Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update SGLang status #173

Merged
merged 1 commit into from
Jan 2, 2025
Merged

docs: update SGLang status #173

merged 1 commit into from
Jan 2, 2025

Conversation

zhyncs
Copy link
Contributor

@zhyncs zhyncs commented Jan 2, 2025

SGLang now has finished supported EAGLE speculative decoding. The following results are obtained on a single H100.
This work was done by @merrymercy and @yukavio

The EAGLE implemented in SGLang is likely the most efficient among open-source LLM engines.

cc @Liyuhui-12 @hongyanz

Official eagle code: 200 token/s

see https://github.com/SafeAILab/EAGLE

Normal decoding speed (SGLang): 156 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf

Eagle decoding speed (SGLang): 297 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --enable-torch-compile --cuda-graph-max-bs 2

Benchmark script

import time
import requests

tic = time.time()
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "[INST] Give me a simple FastAPI server. Show the python code. [/INST]",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
    },
)
latency = time.time() - tic
ret = response.json()

print(ret["text"])
speed = ret["meta_info"]["completion_tokens"]
print(f"speed: {speed / latency:.2f} token/s")

sgl-project/sglang#2150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants