-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: hide continuous batching complexity through forward context #9098
Comments
there's one alternative: the top-level model owns and sets the forward context. In the llama case, this approach works better for encoder-decoder models, or multi-modality models. |
I'm a bit worried about the lifetime of the forward context. As in, will it be immutable until the forward pass finishes? In the case of asynchronous scheduling, when can this context be updated? Do we intend to perform the update right before dispatch? How about multistep scheduling? |
the lifetime is the same as the model forward. You can treat it as scratch space for the model. The model can feel free to modify it, but it will be destroyed after model forward.
Note that the lifetime of the forward context is the same as the model forward, and we never execute two model forward passes concurrently. All the scheduler, model runner logic are untouched. They can do whatever they want, just as the current codebase. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Motivation.
take a look at the current llama forward computation logic:
if we don't consider
attn_metadata
andkv_caches
, it can be simplified as:Arguably,
attn_metadata
is the most complicated part in the forward computation logic. And it becomes even more complicated when we consider:torch.compile
logic, where we want to hide the complexity of attention layer from the compilerTherefore, I'm considering to hide the complexity of continuous batching through forward context. The idea is to have a global forward context, which can be set by the model runner during every forward pass. The forward context can be used to store the attention metadata, and the model can access the attention metadata through the forward context.
Proposed Change.
The changes are:
vllm/model_executor/models
will know nothing about attention metadata and kvcache. They will only know about the input tensors and the output tensors, as if they are just doing token-wise computation. Every attention layer will have a newself.layer_index
attribute, which will be used to index the attention metadata and kvcache in the forward context.see #9029 and #9097 for initial steps.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: