You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for all my requests max_new_tokens=min(4096, 8192 - input_length)
Moreover, Executor additionally checks this invariant.
The only idea is that tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep is setting wrong max_new_tokens for decoder_batch::Request (under certain conditions)
The text was updated successfully, but these errors were encountered:
Hi @akhoroshev, thank you for taking time to report the issue. From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095. The check in the GenericLlmRequest::validate is called only via executor API. Old GptManager API does not call it.
My version
Assertion fails under load
I don't know how it's possible because
input_length <= 7168
max_new_tokens=min(4096, 8192 - input_length)
Moreover, Executor additionally checks this invariant.
The only idea is that
tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep
is setting wrongmax_new_tokens
fordecoder_batch::Request
(under certain conditions)The text was updated successfully, but these errors were encountered: