-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move lm model to async infer #1425
base: master
Are you sure you want to change the base?
Conversation
glm4-nano-chat-v020-int4 & llm bench (streaming include tokenizer.decode part and put text to queue, printing is not included): <style> </style>
|
It looks like time update happens just because of async/wait is added. But infer takes ~59100ms , streaming takes ~1500 , it seems infer should cover streaming by time . |
@@ -130,7 +131,7 @@ std::pair<EncodedResults, std::optional<int64_t>> get_lm_encoded_results( | |||
beam_offets.insert({sequence_groups.at(i)->get_request_id(), i}); | |||
|
|||
SamplerOutput sampler_output = sampler.sample(sequence_groups, logits); | |||
stream_generated_tokens(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, if we start to stream tokens here in a dedicated thread, it can be faster as streaming will be overlapped with:
- embedding model
- llm model for next token
- sampling for next token
while currently streaming is overlapped with LLM only
we could re-use SynchronizedQueue
from GenAI source where streaming callback will push to queue, streaming dedicated thread will read from this queue using pull
method
glm4-nano-chat-v020-int4 & c++ sample (streaming include tokenizer.decode (TextCallbackStreamer) and print (callback function)): <style> </style>
Streaming include: |
No description provided.