-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Merge input processor and input mapper for multi-modal models #10114
Comments
This is great. In the You initiative here will fit very well with the One other note. The Very excited about this! |
I would like to discuss an edge case where passing the input ids and the MultiModal args is rather useful. |
maybe a solution is explicitly including a supeclass in the model definition which will allow such behavior, otherwise deprecating it? |
Maybe we can make a special case and allow token IDs if all other inputs aren't processed by HF. |
This seems a bit premature since this new multi-modal processor isn't even usable yet |
The purpose of that is to direct users to this RFC thread, so we can get more thoughts. |
Sorry for the spam, it has been fixed in #10530 so the message is now only logged once. |
Good news: since we need to apply our own prompt replacement logic anyway, I have opened #11900 to enable this. |
One question here - We use our own multimodal LLM that we deploy and call using VLLM's openai server. I'm forced to name our model the same as an existing supported model (like 'phi3_v') in it's config.json even though the model has its own template, placeholder tokens etc. The only reason for that seems to be because vllm tries to insert placeholder tokens if they don't exist in the prompt. Will this change also allow getting rid of that behaviour? This is the code I'm referring to |
That is a separate issue regarding online serving. |
I'm getting this warning for |
Yes, we are planning to migrate all models eventually. |
Motivation
Background
To provide more control over the model inputs, we currently define two methods for multi-modal models in vLLM:
LLMEngine
to extend the prompt with placeholder tokens which are reserved for vLLM features such as KV cache and chunked prefill.ModelRunner
to transform multi-modal inputs (e.g.PIL
images) into tensor inputs, usually via the modality-specific processor (e.g.AutoImageProcessor
) from HuggingFace.Issues with the current design
AutoTokenizer
, a list of token IDs, instead of the text prompt. Since HFAutoProcessor
doesn’t accept token IDs, we have to write custom code to edit the list of token IDs based on the multi-modal inputs. For some models (such as Phi-3-vision), this means re-implementing code from their HFAutoProcessor
, complicating the process of porting the model to vLLM.ModelRunner
, lies on the critical path of vLLM’s model execution. Even when the input mapper is fast, the tail TTFT and TPOT suffers because of this. As the input mapper takes up more time, our overall throughput decreases proportionally which can be avoided if we move it outside of the critical path. Nevertheless, we can do little if theAutoProcessor
inside input mapper is very slow, like in #9238. Hope that huggingface/transformers#33810 can help with that!AutoProcessor
that already performs most of the work for calculating the number of placeholder tokens.Proposed Change
Unified multi-modal processor
We plan to merge our input processor and input mapper into a unified multi-modal processor (
BaseMultiModalProcessor
) that wraps HFAutoProcessor
, and call it inside theLLMEngine
(and thus benefit from #8779), taking the role of the existing tokenizer. After this change, each input type will be processed as follows:AutoTokenizer
) [Unchanged][Deprecated]Pass to vLLM multi-modal processor [NEW]Automatic prompt replacement
BaseMultiModalProcessor._get_prompt_replacements
specifies HF's logic of replacing input placeholder tokens (e.g.<image>
for a single image) with feature placeholder tokens (e.g.<image><image>...<image>
, the number of which equals to the feature size). Given this specification, we can automatically detect whether HF has replaced the input placeholder tokens by checking whether the feature placeholder tokens exist in the prompt.BaseMultiModalProcessor._apply_prompt_replacements
provides model-agnostic code for automatically replacing input placeholder tokens with feature placeholder tokens. This is only called if we find that HF hasn't done so yet.This enables the multi-modal processor to accept text/token prompts and process them separately from the multi-modal data. The detailed logic is shown in
BaseMultiModalProcessor._apply_hf_processor_main
.Processor caching
#11396 caches each item in the multi-modal output of HF processor and links them back to items in the input data.
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
Note that the text/token prompt must be processed separately from the multi-modal data because HF processors expect the input placeholders in the text to correspond to each multi-modal data item, but we only want to process the items that are missing. We can handle this elegantly using automatic prompt replacement (see above).
Deprecate token IDs with multi-modal inputTo be compatible with OpenAI’s (legacy) Completions API, we currently support passing token IDs directly to bothLLM
class and OpenAI-compatible server. However, Completions API doesn’t support multi-modal inputs, so we will deprecate passing token IDs alongside multi-modal inputs to simplify model implementation (see Issue 1 above). Please tell us if you have a use case for this and don’t want to see it removed!Feedback Period
Feel free to comment as the effort progresses!
Timeline
MultiModalInputs
toMultiModalKwargs
#10040The majority of our code will be called inside the existing
InputPreprocessor
which is separated from the vLLM engine, making it easy to integrate with #8779.CC List
@ywang96 @Isotr0py @WoosukKwon @robertgshaw2-neuralmagic
Any Other Things
Multi-modal plugins remain supportedMigrating multi-modal pluginsYou can define additional input modalities (
ModalityDataItems
) and parse them in subclasses ofMultiModalDataParser
on a per-model basis. Afterwards, overrideBaseMultiModalProcessor._get_data_parser
to construct your newly-defined parser.Some users currently use multi-modal plugins to directly pass custom model inputs (#6260). Those inputs can be excluded from HF processing by returning them in
ModalityDataItems.get_passthrough_data
instead ofModalityDataItems.get_processor_data
.No batched preprocessing for nowCurrently, preprocessing is performed per prompt in vLLM. While we can call HF tokenizer and modality-specific processor on batched inputs separately, calling the wrapping HFAutoProcessor
with both list of texts and list of multi-modal data results in the processed multi-modal data (e.g. image) being assigned to every text in the list, rather than the more intuitivezip
-like behavior (e.g. thei
th image only assigned to thei
th text). To support batched preprocessing, we would have to write custom code for each model to combine the outputs of HF tokenizer and modality-specific processors. Given that this can significantly complicate model implementation (see Issue 1 above), we will not consider batched preprocessing at this stage, even with this change.The text was updated successfully, but these errors were encountered: