Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Merged
merged 94 commits into from
Jan 2, 2025

Conversation

yukavio
Copy link
Collaborator

@yukavio yukavio commented Nov 24, 2024

Support eagle speculative decoding. The following results are obtained on a single H100.

Official eagle code: 200 token/s

see https://github.com/SafeAILab/EAGLE

Normal decoding speed (SGLang): 156 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf

Eagle decoding speed (SGLang): 297 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --enable-torch-compile --cuda-graph-max-bs 2

Benchmark script

import time
import requests

tic = time.time()
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "[INST] Give me a simple FastAPI server. Show the python code. [/INST]",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
    },
)
latency = time.time() - tic
ret = response.json()

print(ret["text"])
speed = ret["meta_info"]["completion_tokens"]
print(f"speed: {speed / latency:.2f} token/s")

Some sub PRs:

@merrymercy
Copy link
Contributor

merrymercy commented Jan 2, 2025

TODO:

  • support tp > 1
  • support temp != 0
  • chunked prefill
  • misc: logprob, stop condition, log, metrics, ...
  • (optional) radix cache
  • (minor) support larger top-k
  • (advanced) dynamically set spec parameters according to batch size

@merrymercy merrymercy changed the title Speculative EAGLE2 Eagle speculative decoding part 4: Add EAGLE2 worker Jan 2, 2025
@merrymercy merrymercy merged commit 815dce0 into sgl-project:main Jan 2, 2025
15 checks passed
@zhyncs
Copy link
Member

zhyncs commented Jan 2, 2025

🎉🎉🎉

YAMY1234 pushed a commit to YAMY1234/sglang that referenced this pull request Jan 2, 2025
XiaotongJiang pushed a commit to XiaotongJiang/sglang that referenced this pull request Jan 3, 2025
@Xu-Chen
Copy link
Contributor

Xu-Chen commented Jan 16, 2025

When the batch size increases, the time taken for eagle_verify_retrive increases considerably. When the batch_size is increased to 10, the eagle_verify_retrive time increases to 0.15s for the 70b model on 4*A100, resulting in a slow overall throughput speed.

@yukavio
Copy link
Collaborator Author

yukavio commented Jan 17, 2025

当 batch size 增加时,eagle_verify_retrive 所需的时间会大幅增加。当 batch_size 增加到 10 时,对于 4*A100 上的 70b 模型,eagle_verify_retrive 时间增加到 0.15 秒,导致整体吞吐速度变慢。

Thanks for your report. I'll go confirm this. and look for possible solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants