[bug] forwardAsync assertion failed #2494

akhoroshev · 2024-11-25T06:54:25Z

Assertion fails under load

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: Input length (6973) + max new tokens (4095) + draft tokens (0) must be less than max sequence length (8192). (/sources/contrib/tensorrt-llm/cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp:444)
1       0x7fa8df465992 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 78
2       0x7fa8df66b693 tensorrt_llm::runtime::GptDecoderBatched::newRequest(int, tensorrt_llm::runtime::decoder_batch::Request const&, tensorrt_llm::runtime::SamplingConfig const&) + 4307
3       0x7fa8df66d7cc tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 172
4       0x7fa8e15f93c5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 725
5       0x7fa8e15fbb90 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 3792
6       0x7fa8e1625a71 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 353
7       0x7fa8e162a97f tensorrt_llm::executor::Executor::Impl::executionLoop() + 895
8       0x7fa8bafaba80 /opt/wmcore/lib/libtensorrt_llm_nvrtc_wrapper.so(+0x32c5a80) [0x7fa8bafaba80]
9       0x7fa8720d01ca /lib64/libpthread.so.0(+0x81ca) [0x7fa8720d01ca]
10      0x7fa87140de73 clone + 67

[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191  = maxSequenceLen - 1 since chunked context is enabled

I don't know how it's possible because

for all my requests input_length <= 7168
for all my requests max_new_tokens=min(4096, 8192 - input_length)

Moreover, Executor additionally checks this invariant.

The only idea is that tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep is setting wrong max_new_tokens for decoder_batch::Request (under certain conditions)

The text was updated successfully, but these errors were encountered:

nekorobov · 2024-11-25T09:50:37Z

Hi @akhoroshev, thank you for taking time to report the issue. From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095. The check in the GenericLlmRequest::validate is called only via executor API. Old GptManager API does not call it.

Could you share a reproducer, please?

akhoroshev · 2024-11-25T10:01:09Z

@nekorobov

From just looking at code, the logic seems correct to me. I see no way how max_new_tokens can be equal to 4095

It happens under load, for example it's possible to have two requests (or more):

input_length=4097, max_new_tokens=4095
input_length=6973, max_new_tokens=1219

They are both valid (GenericLlmRequest::validate was called since I use Executor API)

But assertion fails

akhoroshev · 2024-11-25T10:04:40Z

Could you share a reproducer, please?

I can't because it's a closed model.

hello-11 added triaged Issue has been triaged by maintainers runtime labels Nov 25, 2024

nekorobov self-assigned this Nov 25, 2024

nv-guomingz added Generic Runtime and removed runtime labels Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] forwardAsync assertion failed #2494

[bug] forwardAsync assertion failed #2494

akhoroshev commented Nov 25, 2024 •

edited

Loading

nekorobov commented Nov 25, 2024

akhoroshev commented Nov 25, 2024 •

edited

Loading

akhoroshev commented Nov 25, 2024

[bug] forwardAsync assertion failed #2494

[bug] forwardAsync assertion failed #2494

Comments

akhoroshev commented Nov 25, 2024 • edited Loading

nekorobov commented Nov 25, 2024

akhoroshev commented Nov 25, 2024 • edited Loading

akhoroshev commented Nov 25, 2024

akhoroshev commented Nov 25, 2024 •

edited

Loading

akhoroshev commented Nov 25, 2024 •

edited

Loading