-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eagle speculative decoding part 4: Add EAGLE2 worker #2150
Conversation
TODO:
|
🎉🎉🎉 |
Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: kavioyu <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]>
When the batch size increases, the time taken for eagle_verify_retrive increases considerably. When the batch_size is increased to 10, the eagle_verify_retrive time increases to 0.15s for the 70b model on 4*A100, resulting in a slow overall throughput speed. |
Thanks for your report. I'll go confirm this. and look for possible solutions. |
Support eagle speculative decoding. The following results are obtained on a single H100.
Official eagle code: 200 token/s
see https://github.com/SafeAILab/EAGLE
Normal decoding speed (SGLang): 156 token/s
Eagle decoding speed (SGLang): 297 token/s
Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s
Benchmark script
Some sub PRs: