-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend][OpenAI] Add support for OpenAI tools calling #4656
Conversation
@Xwdit thanks for taking my feedback! |
efbcf5f
to
eb9bc07
Compare
if stream: | ||
text_message = "" | ||
for chunk in response: | ||
if chunk.choices[0].finish_reason is not None: | ||
if chunk.choices[0].finish_reason == "tool_calls": | ||
tool_calls += chunk.choices[0].delta.tool_calls | ||
# print("TEST : %s" % chunk.choices[0].delta.tool_calls) | ||
break | ||
if chunk.choices[0].delta.content is not None: | ||
text_message += chunk.choices[0].delta.content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, probably the better way to handle this is to handle the response stream one chunk or token at a time. If you get a token indicating a tool call (such as <tool_call>
) at the start of the response then you want to wait for the entire response from the LLM so that you can invoke the tool. if you get a non-meta or non-control token (e.g. a normal streaming text chat response) then you probably want to start showing the streaming tokens to the user immediately, avoiding the latency of the entire response. But, This is also an example so I'm aware it's not necessary for it to be optimized.
I have a couple ideas to add, that I can take a crack at this weekend if you would be open to it!
|
elif isinstance(request.messages, | ||
List) and len(request.messages) >= 1: | ||
request.messages[0].content = request.messages[ | ||
0].content.strip() + "\n\n" + text_inject |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just adds the system prompt to the first message. In almost every case tool calls should be a system prompt message (except for chat formats that don't do system prompts, but that's a marginal amount), so instead of checking for the first message and just adding the tool call prompt, it might be better to do something like this:
if isinstance(request.messages, str):
# suggested change: for string prompts, put the tool usage before the prompt
request.messages = text_inject + '\n\n' + request.messages.strip()
elif isinstance(request.messages, List) and len(request.messages) >= 1:
# suggested change: check to see if the first message is a system message. If so, edit it.
# otherwise, add a system prompt
if request.messages[0].role == 'system':
request.messages[0].content = request.messages[0].content.strip() + '\n\n' + text_inject
else:
request.messages = [{'role': 'system', 'content': text_inject}] + request.messages
Hello, sorry I just saw this message; thank you very much for being willing to contribute to this PR and VLLM. If you are willing, I would greatly appreciate it |
What is the preferred way to contribute to a PR? Should I create a new one with all the changes in this one and additions, or create a PR into your branch on your fork? |
It's fine with me either way, do as you prefer |
1a5acc4
to
664642a
Compare
I have rebased the PR to the latest commit on vllm, don't forget to sync before you start working ;) |
3a2e8ef
to
24e4c51
Compare
I have been thinking about this a lot and I have a few more thoughts: 1. Ensuring that tools are in a system promptI think that this would be a good change to make. I added it in the branch that I'm making, and will PR it into this branch on your fork @Xwdit - then if/when it's merged, it will be reflected in this PR 2. Guided Generation / OutlinesAfter digging through the code some more, I realized that guided output is only forced if you use 3. Tool Choice System Prompt FormattingI was thinking more about how to handle tool usage and tool calling, and I think that it's probably just easiest and best to let people provide a jinja template for the tool usage system prompt, with a FamiliarityUsers are in many cases familiar with Jinja already, since it is commonly used for chat message formatting including in this framework. (for OpenAI API-compatible chat completions, see the Additionally, many models in hugging face specify a chat prompt jinja template in their Flexibility & Future-ProofAllowing vLLM users to specify their own Jinja template for the tool usage system prompt, similar to allowing them to specify a Jinja template for the chat prompt formatting, is flexible and allows for a wide variety of models and templates. Users would not be locked in to the assumptions that we currently make about the tool usage system prompt instruction that are apparent in the structure of the Version ControlUsing Jinja template for function calling system prompt formatting makes it easier for people to contribute additional jinja templates for their favorite models & have those be tracked in version control for easy usage, similar to how popular chat model templates are tracked as jinja templates (e.g. Please let me know what you think @Xwdit and if you agree or disagree. I haven't finished the implementation yet since it won't be a small amount of work, and I don't want to waste effort on it if you feel like jinja templates for tool usage system prompts isn't the right path forward. |
If you agree that it's a good idea, I will be happy to take a crack at doing the jinja implementation myself :) |
Co-Authored-By: FlorianJoncour <[email protected]>
Sorry for any late response 😢; All these ideas look great. Thank you very much for taking the time to work on this, I have no problem with them😉 |
Any updates on this PR? |
work in progress still; I will try to knock this out in the next few days - got busy with some life stuff. |
WIP on implementing the changes in #5649 |
This PR updated from #3237 , supported latest version of vLLM (v0.4.2), added more flexibility and fixed some issues.