Fix llama conversion with smooth quant #1650

lopuhin · 2024-05-22T18:52:19Z

This PR fixes a few errors which appear when following the README https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#smoothquant on current latest commit.

Note: the first commit looks quite obvious (although I'm not sure how this could have worked before), while I'm less sure about the second, I just just going by error messages during engine conversion, there might be a better place for the fix. So feel free to treat this as a bug report instead. I verified that engine built in this way has reasonable outputs and expected performance. The model I was testing this on is Mistral 7B (mistral-7b-v0.1-instruct) but I assume other llama 2 and 3 should also work (didn't get to llama 3 yet).

without this it errors out with: Traceback (most recent call last): File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 456, in <module> main() File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 448, in main convert_and_save_hf(args) File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 353, in convert_and_save_hf LLaMAForCausalLM.quantize(args.model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 405, in quantize convert.quantize( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1395, in quantize weights = load_weights_from_hf( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1437, in load_weights_from_hf weights = convert_hf_llama( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1089, in convert_hf_llama convert_layer(l) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 725, in convert_layer get_tllm_linear_sq_weight(int8_weights, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 610, in get_tllm_linear_sq_weight results[prefix + 'per_channel_scale'] = torch.Tensor([ ValueError: only one element tensors can be converted to Python scalars we can also check the shapes: cur_per_channel_value.shape -> torch.Size([6144]) col_shape -> [1, 6144] so it's clear that we meant to convert the tensor without []

with these the model works and provides sensible output

kaiyux · 2024-05-28T12:32:28Z

Hi @lopuhin , the changes are integrated in #1688 and we've credited you as co-author, hence I'm closing this PR now, thanks a lot.

lopuhin · 2024-06-04T15:52:04Z

hi @kaiyux great, thank you! I think only the first commit was integrated, but the two others were not, but they are also required -- although they fix the error which would happen when running the engine. I'm experimenting with smooth quant llama 3 right now and need all commits to get it working. Do you mind having another look?

lopuhin added 3 commits May 22, 2024 11:00

Merge branch 'main' into fix-llama-smooth-quant

b8d424f

two more fixes for int8 sq with llama (mistral 7b)

2c1c648

with these the model works and provides sensible output

kaiyux mentioned this pull request May 28, 2024

Update TensorRT-LLM #1688

Merged

kaiyux closed this May 28, 2024

kaiyux mentioned this pull request Jul 17, 2024

TensorRT-LLM v0.11 Update #1969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llama conversion with smooth quant #1650

Fix llama conversion with smooth quant #1650

lopuhin commented May 22, 2024

kaiyux commented May 28, 2024

lopuhin commented Jun 4, 2024

Fix llama conversion with smooth quant #1650

Fix llama conversion with smooth quant #1650

Conversation

lopuhin commented May 22, 2024

kaiyux commented May 28, 2024

lopuhin commented Jun 4, 2024