Eagle speculative decoding part 4: Add EAGLE2 worker #2150

yukavio · 2024-11-24T02:23:54Z

Support eagle speculative decoding. The following results are obtained on a single H100.

Official eagle code: 200 token/s

see https://github.com/SafeAILab/EAGLE

Normal decoding speed (SGLang): 156 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf

Eagle decoding speed (SGLang): 297 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --enable-torch-compile --cuda-graph-max-bs 2

Benchmark script

import time
import requests

tic = time.time()
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "[INST] Give me a simple FastAPI server. Show the python code. [/INST]",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
    },
)
latency = time.time() - tic
ret = response.json()

print(ret["text"])
speed = ret["meta_info"]["completion_tokens"]
print(f"speed: {speed / latency:.2f} token/s")

Some sub PRs:

merrymercy · 2025-01-02T04:43:56Z

TODO:

support tp > 1
support temp != 0
chunked prefill
misc: logprob, stop condition, log, metrics, ...
(optional) radix cache
(minor) support larger top-k
(advanced) dynamically set spec parameters according to batch size

zhyncs · 2025-01-02T11:31:00Z

🎉🎉🎉

Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>

Xu-Chen · 2025-01-16T12:15:54Z

When the batch size increases, the time taken for eagle_verify_retrive increases considerably. When the batch_size is increased to 10, the eagle_verify_retrive time increases to 0.15s for the 70b model on 4*A100, resulting in a slow overall throughput speed.

yukavio · 2025-01-17T02:06:01Z

当 batch size 增加时，eagle_verify_retrive 所需的时间会大幅增加。当 batch_size 增加到 10 时，对于 4*A100 上的 70b 模型，eagle_verify_retrive 时间增加到 0.15 秒，导致整体吞吐速度变慢。

Thanks for your report. I'll go confirm this. and look for possible solutions.

kavioyu added 30 commits October 13, 2024 15:49

temp

70135d6

migrated to new upstream, need implement evict memory

65fae7b

prove single req

064cca6

fix bug for long generate due to eagle_verify_retrive kernel

cb01c64

fix bug of eagle spec verify

df3de9d

support cuda graph

b7628f2

support batch inference

e2634e9

temp

f557a06

fix memeory leak

9987741

add sampling score

dcbc11c

support target model cuda graph

5578b18

disable target model cuda graph

af2e79a

fix batch bug

0fdd0b1

disable cuda graph pad in eagle

923523b

fix server args

0e3fea2

fix cuda graph and split prefill

4faaa31

optimize generate attn arg

33d8aef

fix parent list dtype

11d6e86

fix draft worker memory problem

2b3cb22

need to fix decode error when request retract happend

7aa0aff

remove debug info

404c5ab

fix bug

e095ec0

fix some bug and support target model use cuda graph

9f0a0c2

fix conflict, should solve scheduler and cuda graph problem

35c5678

fix naive cuda graph

b647a70

fix cuda graph

7226987

support split prefill batch

aaf1cae

fix cuda graph padding

dbeaa2c

remove modification of target model and remove some redundant code

b6f45d5

fix cache management

7c4a04c

merrymercy added 5 commits December 30, 2024 23:06

Merge branch 'main' into spec_infer

3cc218f

update cuda graph runner

f0c1c4b

update

040c1e5

update

e779c76

clean up forward_batch_generation

69daa98

merrymercy mentioned this pull request Dec 31, 2024

Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging #2684

Merged

merrymercy added 4 commits December 31, 2024 02:43

Merge branch 'main' into spec_infer

0d1c701

Fix arguments

032eaad

fix port

31343a2

simplify speculative_worker

788c562

merrymercy added 6 commits January 1, 2025 22:11

simplify spec algo

a5fedad

simplify server args

42d08db

Simplify cuda graph

a238c29

simplify position handling

3a6040b

update

2659da9

update

015473c

merrymercy mentioned this pull request Jan 2, 2025

Eagle speculative decoding part 3: small modifications to the general scheduler #2709

Merged

merrymercy added 2 commits January 2, 2025 02:10

Merge branch 'main' into spec_infer

96e3a77

Eagle

39b6d4b

merrymercy changed the title ~~Speculative EAGLE2~~ Eagle speculative decoding part 4: Add EAGLE2 worker Jan 2, 2025

merrymercy merged commit 815dce0 into sgl-project:main Jan 2, 2025
15 checks passed

This was referenced Jan 2, 2025

docs: update SGLang status SafeAILab/EAGLE#173

Merged

How can I play with the speculative decoding which metioned in the paper? deepseek-ai/DeepSeek-V3#14

Closed

chore: bump v0.4.1.post4 #2713

Merged

YAMY1234 pushed a commit to YAMY1234/sglang that referenced this pull request Jan 2, 2025

Eagle speculative decoding part 4: Add EAGLE2 worker (sgl-project#2150)

371305b

Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>

XiaotongJiang pushed a commit to XiaotongJiang/sglang that referenced this pull request Jan 3, 2025

Eagle speculative decoding part 4: Add EAGLE2 worker (sgl-project#2150)

e79d81a

Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

yukavio commented Nov 24, 2024 •

edited by merrymercy

Loading

merrymercy commented Jan 2, 2025 •

edited

Loading

zhyncs commented Jan 2, 2025

Xu-Chen commented Jan 16, 2025

yukavio commented Jan 17, 2025

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Eagle speculative decoding part 4: Add EAGLE2 worker #2150

Conversation

yukavio commented Nov 24, 2024 • edited by merrymercy Loading

Official eagle code: 200 token/s

Normal decoding speed (SGLang): 156 token/s

Eagle decoding speed (SGLang): 297 token/s

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

Benchmark script

merrymercy commented Jan 2, 2025 • edited Loading

zhyncs commented Jan 2, 2025

Xu-Chen commented Jan 16, 2025

yukavio commented Jan 17, 2025

yukavio commented Nov 24, 2024 •

edited by merrymercy

Loading

merrymercy commented Jan 2, 2025 •

edited

Loading