-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Consolidate prompt arguments to LLM engines #4328
Conversation
d6e9aa2
to
5d42800
Compare
@ywang96 Any thoughts about this? |
Hey @DarkLight1337! Sorry I've been a bit busy lately, but I will surely take a look in the upcoming week! Apologies for the delay! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the contribution @DarkLight1337 and sorry for the delayed review. I left a few comments and questions. Please take a look!
vllm/engine/llm_engine.py
Outdated
# Create the sequence group. | ||
seq_group = SequenceGroup(request_id, [seq], sampling_params, | ||
arrival_time, lora_request, multi_modal_data) | ||
processed_inputs = self.encode_request(request_id=request_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right before when we call encode_request
is where I think we could combine prompt
, prompt_token_ids
and multi_modal_data
to PromptInputs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this directly on the API class (LLM
) so that we do not have to involve the internals when finally deprecating the old API.
- To facilitate equality tests, `CompletionOutput` is now a dataclass
0c57f28
to
37baf38
Compare
I think we should keep prompts/prompt_tokens_ids input for backward compatibility? |
These arguments are still being maintained. I just decorated the relevant methods so we can deprecate them simply by turning on the corresponding flag. |
Pretty sure the CI will pass now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DarkLight1337, I added some comments inline
if isinstance(inputs, str): | ||
inputs = {"prompt": inputs} | ||
|
||
if "prompt_token_ids" not in inputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to do
prompt_token_ids = inputs.get("prompt_token_ids")
if prompt_token_ids is None:
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing so actually fails the type checker (I'm using Pyright on VSCode), so I'm reluctant to change it to this form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it complain about? Couldn't additional hints be added for that e.g.
prompt_token_ids: List[int] = inputs.get("prompt_token_ids")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it complain about? Couldn't additional hints be added for that e.g.
prompt_token_ids: List[int] = inputs.get("prompt_token_ids")
if isinstance(inputs, str): | ||
inputs = {"prompt": inputs} | ||
|
||
if "prompt_token_ids" not in inputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest
prompt_token_ids = inputs.get("prompt_token_ids")
if prompt_tokens_ids is not None:
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
arrival_time: Optional[float] = None, | ||
lora_request: Optional[LoRARequest] = None, | ||
multi_modal_data: Optional[MultiModalData] = None, | ||
) -> AsyncStream: | ||
if self.log_requests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not related to this PR but I feel like the body of this if
should go into a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bothers me as well since it's otherwise redundant to pass both prompt
and prompt_token_ids
at the same time. Perhaps it would be better to move this to OpenAI server in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
vllm/engine/llm_engine.py
Outdated
eos_token_id = self.tokenizer.get_lora_tokenizer( | ||
lora_request).eos_token_id | ||
else: | ||
logger.warning("Use None for EOS token id because tokenizer is " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we want to log a warning on every request. This will be a lot of log output for the no-tokenizer cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - this is probably an oversight from me. I thought this warning was added at engine initialization time, not inference time. IMO we should move it to the initialization time for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be a good idea to warn during initialization time if the user indeed intends the tokenizer to not be used.
I have updated the code so that the warning still occurs on request but is only emitted at most once during the lifetime of the engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be a good idea to warn during initialization time if the user indeed intends the tokenizer to not be used.
I'm not sure that I understand this, I think a single warning is ok in this case, it's only a warning. And a single warning will still be logged with this latest change right? Just at the time of the first request rather than llm engine construction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be a good idea to warn during initialization time if the user indeed intends the tokenizer to not be used.
I'm not sure that I understand this, I think a single warning is ok in this case, it's only a warning. And a single warning will still be logged with this latest change right? Just at the time of the first request rather than llm engine construction
If the user didn't intend to use the tokenizer, then the warning might cause some confusion during startup. True that it's not a big deal either way though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally vote for moving this to engine initialization time since that will make our codebase cleaner. (no need for another attribute of engine just for warning count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon inspection of the docs, it seems that avoiding repeated errors is actually unnecessary:
Warning messages are normally written to sys.stderr, but their disposition can be changed flexibly, from ignoring all warnings to turning them into exceptions. The disposition of warnings can vary based on the warning category, the text of the warning message, and the source location where it is issued. Repetitions of a particular warning for the same source location are typically suppressed.
Therefore, we can keep the warning in the code for handling the request.
@njhill I have responded to your comments. |
@DarkLight1337 I just went though this PR again and made a change to move offline API reference to under developer doc. #4710 was a great addition, but I think we should have links in examples to developer doc instead of putting API reference there directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to give this PR a greenlight, and thank you @DarkLight1337 for working on this PR and addressing the comments from reviewers.
@njhill @rkooo567 If you have any other concern feel free to leave a comment, otherwise I'd appreciate an approval from either of you as well. Thanks!
(P.S. @DarkLight1337 in the future, it'll be nice to break a big PR like this even further down into small PRs, that way it's easier for us to review and get things merged in a progressive way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @DarkLight1337 @ywang96!
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Currently,
LLM.generate
(and similar methods inLLMEngine
andAsyncLLMEngine
) acceptprompt
,prompt_token_ids
andmulti_modal_data
separately. This PR consolidates them intoPromptInputs
so that only a single argument has to be passed in, using type annotations to ensure a consistent format. This reduces the chance for the user to accidentally pass in different lengths ofprompt
,prompt_token_ids
, andmulti_modal_data
(related checks have been removed to avoid redundant code). On the other hand,sampling_params
remains untouched because it is common to only pass a single instance even for multipleprompt
s.This would also make it easier to define the interface for processing the inputs using HuggingFace processor, as mentioned in #4194.
API changes
The APIs of
LLM.generate
andLLM.encode
have been updated, where the parametersprompt
,prompt_token_ids
andmulti_modal_data
are replaced withinputs
. Currently, we still maintain the old API but it may be deprecated in a future release. Users may update their code as follows:Single prompt:
Multiple prompts:
Based on the examples in the documentation, most users should already prefer the first way of calling
LLM.generate
; those users need not make any changes.Other changes
By setting
gpu_memory_utilization
to a smaller value, we can now run multipleLLM
instances and OpenAI servers at the same time. To better manage GPU utilization between entrypoints tests, I have grouped the tests using pytest markers so we no longer have to refer to the file name directly. This makes it easier to edit the commands used to test multiple files in the same pytest session.