docs: update SGLang status #173

zhyncs · 2025-01-02T14:26:59Z

SGLang now has finished supported EAGLE speculative decoding. The following results are obtained on a single H100.
This work was done by @merrymercy and @yukavio

The EAGLE implemented in SGLang is likely the most efficient among open-source LLM engines.

cc @Liyuhui-12 @hongyanz

Official eagle code: 200 token/s

see https://github.com/SafeAILab/EAGLE

Normal decoding speed (SGLang): 156 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf

Eagle decoding speed (SGLang): 297 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --enable-torch-compile --cuda-graph-max-bs 2

Benchmark script

import time
import requests

tic = time.time()
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "[INST] Give me a simple FastAPI server. Show the python code. [/INST]",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
    },
)
latency = time.time() - tic
ret = response.json()

print(ret["text"])
speed = ret["meta_info"]["completion_tokens"]
print(f"speed: {speed / latency:.2f} token/s")

sgl-project/sglang#2150

docs: update SGLang status

d395a6a

Liyuhui-12 merged commit 3653144 into SafeAILab:main Jan 2, 2025

daviddl9 mentioned this pull request Jan 23, 2025

[Feature] docs: Improve documentation on how to use EAGLE speculative docoding sgl-project/sglang#3077

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: update SGLang status #173

docs: update SGLang status #173

zhyncs commented Jan 2, 2025

docs: update SGLang status #173

docs: update SGLang status #173

Conversation

zhyncs commented Jan 2, 2025

Official eagle code: 200 token/s

Normal decoding speed (SGLang): 156 token/s

Eagle decoding speed (SGLang): 297 token/s

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

Benchmark script