Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantize.py fails to export important data to config.json (eg rotary scaling) #1676

Closed
1 of 4 tasks
janpetrov opened this issue May 25, 2024 · 23 comments
Closed
1 of 4 tasks
Assignees
Labels
bug Something isn't working Investigating triaged Issue has been triaged by maintainers

Comments

@janpetrov
Copy link
Contributor

janpetrov commented May 25, 2024

System Info

4x NVIDIA H100, TensorRT-LLM backend 0.9.0

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

(1) Have a HF transformers model with linear rope scaling.

(2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the and ("Rotary"... part)

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

so that the rope scaling model is exported (without crashing on an error that weights cannot be exported form the Rotary scaling layer, see this issue

(3) then run, as recommended here

python examples/quantization/quantize.py \
    --model_dir "$MODEL_DIR" \
    --dtype bfloat16 \
    --output_dir "$TMP_DIR" \
    --tp_size 2 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --calib_size 512

Expected behavior

quantize.py should generate a detailed config.json file in the output dir. The subsequent run of

trtllm-build \
    --checkpoint_dir "$TMP_DIR" \
    --gpt_attention_plugin bfloat16 \
    --gemm_plugin bfloat16 \
    --max_input_len 16384 \
    --max_output_len 16384 \
    --max_batch_size 8 \
    --strongly_typed \
    --workers 2 \
    --output_dir "$OUTPUT_DIR" \
    --multi_block_mode enable

should build a well-working engine.

actual behavior

The config.json generated by quantize.py contains just the following (please note eg the rope scaling missing). The engine built by trtllm-build generates nonsense.

{
    "producer": {
        "name": "ammo",
        "version": "0.7.4"
    },
    "architecture": "LlamaForCausalLM",
    "dtype": "bfloat16",
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "hidden_size": 8192,
    "norm_epsilon": 1e-05,
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_act": "silu",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },
    "head_size": 128,
    "intermediate_size": 28672,
    "position_embedding_type": "rope_gpt_neox",
    "rotary_base": 10000.0
}

additional notes

When I edit the config.json to have the following contents and then re-run trtllm-build, the resulting engine starts to generate fine text.

{
    "producer": {
        "name": "ammo",
        "version": "0.7.4"
    },
    "architecture": "LlamaForCausalLM",
    "dtype": "bfloat16",
    "logits_dtype": "float32",
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_size": 8192,
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "head_size": 128,
    "hidden_act": "silu",
    "intermediate_size": 28672,
    "norm_epsilon": 1e-05,
    "position_embedding_type": "rope_gpt_neox",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "rotary_scaling": {
        "factor": 4.0,
        "type": "linear"
    },
    "moe_normalization_mode": null,
    "rotary_base": 10000.0,
    "moe_num_experts": 0,
    "moe_top_k": 0,
    "moe_tp_mode": 2,
    "attn_bias": false,
    "disable_weight_only_quant_plugin": false,
    "mlp_bias": false
}

Please note that when the input to trtllm-build is generated by examples/llama/convert_checkpoint.py (and not by examples/quantization/quanitize.py) then the config.json looks as follows. This is for the same model but without quantization. Please note much richer data, including rotary scaling.

 {
    "architecture": "LlamaForCausalLM",
    "dtype": "bfloat16",
    "logits_dtype": "float32",
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_size": 8192,
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "head_size": 128,
    "hidden_act": "silu",
    "intermediate_size": 28672,
    "norm_epsilon": 1e-05,
    "position_embedding_type": "rope_gpt_neox",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "mapping": {
        "world_size": 4,
        "tp_size": 4,
        "pp_size": 1
    },
    "quantization": {
        "quant_algo": null,
        "kv_cache_quant_algo": null,
        "group_size": 128,
        "smoothquant_val": null,
        "has_zero_point": false,
        "pre_quant_scale": false,
        "exclude_modules": [
            "lm_head"
        ]
    },
    "kv_dtype": "bfloat16",
    "rotary_scaling": {
        "factor": 4.0,
        "type": "linear"
    },
    "moe_normalization_mode": null,
    "rotary_base": 10000.0,
    "moe_num_experts": 0,
    "moe_top_k": 0,
    "moe_tp_mode": 2,
    "attn_bias": false,
    "disable_weight_only_quant_plugin": false,
    "mlp_bias": false
}
@janpetrov janpetrov added the bug Something isn't working label May 25, 2024
@byshiue
Copy link
Collaborator

byshiue commented May 29, 2024

Could you share what model do you use?

@janpetrov
Copy link
Contributor Author

janpetrov commented May 29, 2024

thank you, https://huggingface.co/meta-llama/Llama-2-70b-hf , finetuned (w/o any change in architecture) and exported in bfloat16

@byshiue
Copy link
Collaborator

byshiue commented May 29, 2024

It looks the rope_scaling of llama-2-70b-hf is NULL

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32000
}

@byshiue byshiue self-assigned this May 29, 2024
@byshiue byshiue added the triaged Issue has been triaged by maintainers label May 29, 2024
@janpetrov
Copy link
Contributor Author

janpetrov commented May 29, 2024

Please excuse that I have not mentioned this earlier explicitly. We have finetuned the model with changing rope scaling. Please see below the config.json for our finetuned model saved in the huggingface format (this is in the $MODEL_DIR directory, as referred above, see the

python examples/quantization/quantize.py \
    --model_dir "$MODEL_DIR"

part.

{
  "_name_or_path": "OUR_PATH_HERE",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.1",
  "use_cache": false,
  "vocab_size": 32000
}

@byshiue
Copy link
Collaborator

byshiue commented May 30, 2024

Thank you for the reply. I try to change the config.json of exising HF model, but it leads to failure during converting. So, it looks I cannot change the config directly to reproduce this issue, but need to have a finetune model which is tuned by rope layer. Do you know any layer which has non-null rope_scaling to help reproducing the issue?

@janpetrov
Copy link
Contributor Author

Thank you for your reply. Please give me few days, I will prepare for you (simple instruction how to obtain) a model with rope_scaling in config.json that converts.

@wxsms
Copy link

wxsms commented Jun 4, 2024

The deepseek-coder 33b model is using rope scaling, and also llama architecture, which has same problem describe here, maybe you can try this model directly: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/config.json

@byshiue
Copy link
Collaborator

byshiue commented Jun 6, 2024

Thank you for the sharing. I will take a try.

@byshiue
Copy link
Collaborator

byshiue commented Jun 7, 2024

Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint.

@wxsms
Copy link

wxsms commented Jun 12, 2024

Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint.

You may use the Llama workflow for Deepseek models. It works for int8 weight only quant (engine build + inference), which provider by llama/convert_checkpoint.py (have to specify the RoPE params).

However the FP8 quant provided by quantization/quantize.py has the same problem described here. i.e. engine build works but inference generate nonsense.

@chenxu2048
Copy link

Thank you for the reply. I try to change the config.json of exising HF model, but it leads to failure during converting. So, it looks I cannot change the config directly to reproduce this issue, but need to have a finetune model which is tuned by rope layer. Do you know any layer which has non-null rope_scaling to help reproducing the issue?

Hi @byshiue.
FYI you can also try this model: https://huggingface.co/Yukang/LongAlpaca-70B.
It is mentioned at https://github.com/NVIDIA/TensorRT-LLM/tree/db4edea/examples/llama#long-context-length.

@byshiue
Copy link
Collaborator

byshiue commented Jun 13, 2024

Thank you for the sharing. We could reproduce the issue now, and we are investigating the issue now.

@fan-niu
Copy link

fan-niu commented Jul 15, 2024

@byshiue Any process? At present, I still have this problem using the latest version of the code (
tensorrtllm: a96ccca)

@wxsms
Copy link

wxsms commented Jul 15, 2024

@byshiue Any process? At present, I still have this problem using the latest version of the code ( tensorrtllm: a96ccca)

it's fixed for my case, after #1793

@fan-niu
Copy link

fan-niu commented Jul 15, 2024

@byshiue Any process? At present, I still have this problem using the latest version of the code ( tensorrtllm: a96ccca)

it's fixed for my case, after #1793

Thanks for the reply. My model was obtained by finetune based on llamav3-8b-instruct. During finetune, rope_scaling was added, which made it impossible to pass the fp8 conversion. Is there any way to solve this problem? thanks @byshiue

@byshiue
Copy link
Collaborator

byshiue commented Jul 17, 2024

We have added it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_modelopt.py#L474-L478. Do you still encounter same issue on latest main branch? If so, could you try printing some debug message there to make sure we add rope_scaling, and double check the config.json after generating the checkpoint.

@fan-niu
Copy link

fan-niu commented Jul 17, 2024

We have added it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_modelopt.py#L474-L478. Do you still encounter same issue on latest main branch? If so, could you try printing some debug message there to make sure we add rope_scaling, and double check the config.json after generating the checkpoint.

@byshiue First thanks for your reply. Yes, I still encounter this problem using the latest code. The version number of tensorrtllm is a96ccca
this is convert error log
Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to /code/trt_weight_h100/converted/1gpu/float16/llama_3_fp8/modelopt_model.0.pth using torch.save for further inspection. Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight' Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 377, in export_tensorrt_llm_checkpoint for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint( File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/model_config_export.py", line 224, in torch_to_tensorrt_llm_checkpoint build_decoder_config(layer, model_metadata_config, decoder_type, dtype) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 1191, in build_decoder_config config.attention = build_attention_config(layer, model_metadata_config, dtype, config) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 659, in build_attention_config config.dense = build_linear_config(layer, LINEAR_ROW, dtype) File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py", line 602, in build_linear_config torch_weight = module.weight.detach() File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1709, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight' Traceback (most recent call last): File "/code/./tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 107, in <module> quantize_and_export( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 459, in quantize_and_export with open(f"{export_path}/config.json", "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/code/trt_weight_h100/converted/1gpu/float16/llama_3_fp8/config.json'

convert code:
python ./tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py --model_dir ${model_path} \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --output_dir ${convert_model_path} \ --calib_size 512 \ --tp_size 1

this is huggingface model config
"architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 4.0, "type": "linear" }, "rope_theta": 1000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.42.3", "use_cache": true, "vocab_size": 128256

@wxsms
Copy link

wxsms commented Jul 17, 2024

@fan-niu you should look for the (2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the and ("Rotary"... part) in the first floor

@fan-niu
Copy link

fan-niu commented Jul 17, 2024

@fan-niu you should look for the (2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the and ("Rotary"... part) in the first floor

@wxsms thanks, but I can't get this package:
ls: cannot access '/usr/local/lib/python3.10/dist-packages/ammo': No such file or directory

need I install this package?

On my side, I manually built the image based on the latest tensorllm_backend (commit 6053a5d) code, and then performed the fp8 conversion in the image.

@wxsms
Copy link

wxsms commented Jul 17, 2024

@fan-niu you should look for the (2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the and ("Rotary"... part) in the first floor

@wxsms thanks, but I can't get this package: ls: cannot access '/usr/local/lib/python3.10/dist-packages/ammo': No such file or directory

need I install this package?

it was renamed to modelopt, i.e. /usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py

@fan-niu
Copy link

fan-niu commented Jul 17, 2024

/usr/local/lib/python3.10/dist-packages/modelopt/torch/export/layer_utils.py

@wxsms Cool thanks, also change this code and convert work well, thank you so much. But when I continue to convert tensorrt engine I get this error

`[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070900
[07/17/2024-09:59:20] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set gemm_plugin to auto.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set lookup_plugin to None.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set lora_plugin to None.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set moe_plugin to auto.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set context_fmha to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set remove_input_padding to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set reduce_fusion to False.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set multi_block_mode to False.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set enable_xqa to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set multiple_profiles to False.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set paged_state to True.
[07/17/2024-09:59:20] [TRT-LLM] [I] Set streamingllm to False.
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.13.1'}
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[07/17/2024-09:59:20] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[07/17/2024-09:59:20] [TRT-LLM] [W] max_seq_len is scaled to 32768.0 by rotary scaling 4.0
[07/17/2024-09:59:20] [TRT-LLM] [I] max_seq_len is not specified, using value 32768.0
[07/17/2024-09:59:20] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[07/17/2024-09:59:20] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 65536
[07/17/2024-09:59:20] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/17/2024-09:59:20] [TRT-LLM] [I] Set dtype to float16.
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[07/17/2024-09:59:20] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 402, GPU 529 (MiB)
[07/17/2024-09:59:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4497, GPU +1242, now: CPU 5046, GPU 1771 (MiB)
[07/17/2024-09:59:24] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/17/2024-09:59:24] [TRT-LLM] [I] Set nccl_plugin to None.
[07/17/2024-09:59:24] [TRT-LLM] [I] Set use_custom_all_reduce to True.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 537, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 359, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 326, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 319, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 912, in build
engine = None if build_config.dry_run else builder.build_engine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 198, in decorated
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 380, in build_engine
self._add_optimization_profile(network, builder_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 275, in _add_optimization_profile
profile.set_shape(input_name, min_shape, opt_shape, max_shape)
TypeError: set_shape(): incompatible function arguments. The following argument types are supported:
1. (self: tensorrt_bindings.tensorrt.IOptimizationProfile, input: str, min: tensorrt_bindings.tensorrt.Dims, opt: tensorrt_bindings.tensorrt.Dims, max: tensorrt_bindings.tensorrt.Dims) -> None

Invoked with: <tensorrt_bindings.tensorrt.IOptimizationProfile object at 0x7f734cd733b0>, 'cache_indirection', [1, 1, 1], [32, 1, 16384.0], [64, 1, 32768.0]
`

this is convert engine script:
trtllm-build --checkpoint_dir $convert_model_path \ --output_dir ${trt_model_path} \ --remove_input_padding enable \ --gemm_plugin auto \ --paged_kv_cache enable \ --max_num_tokens 65536 \ --max_batch_size 32 \ --max_input_len 32768 \ --use_fp8_context_fmha enable

@byshiue Please help to look into this problem, thanks
@byshiue Could you please consider supporting this change in the latest code? Because the modification method of hotfix makes it more difficult to compile the image and install the package. thanks

@fan-niu
Copy link

fan-niu commented Jul 18, 2024

@kaiyux @byshiue Thank you for mentioning this version of tensorrtllm, but when I converted the engine based on this version of code, I still encountered the same error as before. Can you give a solution to this problem? Thanks

tensorrtllm version:
commit ab49b93718b906030bcec0c817b10ebb373d4179 (HEAD -> rel, origin/rel)

convert script:
python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${model_path} --output_dir ${convert_model_path} --dtype ${dtype} --smoothquant 0.5 --per_token --per_channel --1
trtllm-build --checkpoint_dir $convert_model_path \ --output_dir ${trt_model_path} \ --remove_input_padding enable \ --context_fmha enable \ --gemm_plugin float16 \ --paged_kv_cache enable \ --max_num_tokens 65536 \ --max_batch_size 32 \ --max_input_len 32768 \ --gpt_attention_plugin float16

this is huggingface model config
"architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 4.0, "type": "linear" }, "rope_theta": 1000000, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.42.3", "use_cache": true, "vocab_size": 128256

convert engine error log:
[TensorRT-LLM] TensorRT-LLM version: 0.11.0 0.11.0 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:26<00:00, 3.83s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.27k/9.27k [00:00<00:00, 30.3MB/s] Downloading readme: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.9k/13.9k [00:00<00:00, 27.6MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159M/159M [00:01<00:00, 93.7MB/s] Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 376M/376M [00:04<00:00, 81.9MB/s] Downloading data: 2.11MB [00:00, 69.4MB/s] Downloading data: 46.4MB [00:00, 73.4MB/s] Downloading data: 2.43MB [00:00, 77.3MB/s] Generating train split: 287113 examples [01:02, 4616.50 examples/s] Generating validation split: 13368 examples [00:02, 4784.73 examples/s] Generating test split: 11490 examples [00:02, 4388.12 examples/s] calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passing past_key_valuesas a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:47<00:00, 10.74it/s]
Weights loaded. Total time: 00:00:06
Total time of converting checkpoints: 00:04:53
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/18/2024-08:30:28] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set gemm_plugin to float16.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set lookup_plugin to None.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set lora_plugin to None.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set moe_plugin to auto.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set context_fmha to True.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set remove_input_padding to True.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set reduce_fusion to False.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set multi_block_mode to False.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set enable_xqa to True.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set multiple_profiles to False.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set paged_state to True.
[07/18/2024-08:30:28] [TRT-LLM] [I] Set streamingllm to False.
[07/18/2024-08:30:28] [TRT-LLM] [W] max_seq_len is scaled to 32768.0 by rotary scaling 4.0
[07/18/2024-08:30:28] [TRT-LLM] [I] max_seq_len is not specified, using value 32768.0
[07/18/2024-08:30:28] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[07/18/2024-08:30:28] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 65536
[07/18/2024-08:30:28] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/18/2024-08:30:31] [TRT-LLM] [I] Set dtype to float16.
[07/18/2024-08:30:31] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 272, GPU 423 (MiB)
[07/18/2024-08:30:35] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1930, GPU +354, now: CPU 2350, GPU 777 (MiB)
[07/18/2024-08:30:35] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to float16.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set nccl_plugin to None.
[07/18/2024-08:30:35] [TRT-LLM] [I] Set use_custom_all_reduce to True.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 551, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 373, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 340, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 333, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 912, in build
engine = None if build_config.dry_run else builder.build_engine(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 201, in decorated
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 380, in build_engine
self._add_optimization_profile(network, builder_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 275, in _add_optimization_profile
profile.set_shape(input_name, min_shape, opt_shape, max_shape)
TypeError: set_shape(): incompatible function arguments. The following argument types are supported:
1. (self: tensorrt_bindings.tensorrt.IOptimizationProfile, input: str, min: tensorrt_bindings.tensorrt.Dims, opt: tensorrt_bindings.tensorrt.Dims, max: tensorrt_bindings.tensorrt.Dims) -> None

Invoked with: <tensorrt_bindings.tensorrt.IOptimizationProfile object at 0x7f9c42f420f0>, 'cache_indirection', [1, 1, 1], [16, 1, 16384.0], [32, 1, 32768.0]
`

@fan-niu
Copy link

fan-niu commented Jul 18, 2024

@kaiyux @byshiue @janpetrov I also add the hotfix code after /usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py:276 and then I can successfully convert the engine, but why is the converted engine output so poor? I also used vllm to deploy the model before conversion, and the output of vllm was completely normal.

hot fix code :
min_shape = trt.Dims([int(x) for x in min_shape])
opt_shape = trt.Dims([int(x) for x in opt_shape])
max_shape = trt.Dims([int(x) for x in max_shape])

tensorrtllm engine output:
output1: Contract Clarification and Research Discussion Planned */>\\\"\</assistant\_"\>\"\"\\\\"\\\)
output2: Account and Support Instructions Discussion \u00bb] \n"Account Profile and Support Guidance JC Shares"\n """\n"""

vllm output:
output1: Contract Clarification and Research Discussion Planned
output2: Account and Support Instructions Discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Investigating triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants