Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

Closed
cadedaniel opened this issue May 23, 2024 · 5 comments
Closed

Comments

@cadedaniel
Copy link
Collaborator

🚀 The feature, motivation and pitch

Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.

This is a complex task that requires some design, if you're interested please reach out.

Alternatives

No response

Additional context

cc @LiuXiaoxuanPKU @comaniac @rkooo567

@Dbxwz
Copy link

Dbxwz commented Jun 7, 2024

Hello, you mentioned optimizations for scoring time in #4630

P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call

I think multi-query attention kernel is not equal to MQA here, it is more like the append stage in flashinfer, am I right?
And I notice that the calculation process of append is similar to that of chunked prefill's one step. So I use chunked prefill to implement the AppendTop1Scorer which get a 10% speedup compared to BatchExpansionTop1Scorer. It's a dirty solution, since I create a new SequenceGroupMetadata which change the scoring sequence to a chunked prefill sequence. This implementation conflicts with recompute and chunked prefille.

So the perfect implementation should be that ModelRunner and Backend support the append stage, Backend should already support it if it supports chunked prefill.

In addition, is this issue about solving the scheduling problem of speculative decoding? Can you give a detailed introduction to what needs to be done in this issue?

@cadedaniel
Copy link
Collaborator Author

cadedaniel commented Jun 7, 2024

That's awesome. You should chat with @LiuXiaoxuanPKU who is removing batch expansion from vLLM.

FYI this issue is about combining the ITL improvements obtained from chunked prefill scheduling with spec decode.

@NickLucche
Copy link
Contributor

I can look into this

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Dec 24, 2024
@NickLucche
Copy link
Contributor

This issue can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants