Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qserve convert checkpoint raise an error #2507

Open
2 of 4 tasks
anaivebird opened this issue Nov 27, 2024 · 1 comment
Open
2 of 4 tasks

qserve convert checkpoint raise an error #2507

anaivebird opened this issue Nov 27, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@anaivebird
Copy link

anaivebird commented Nov 27, 2024

System Info

  • GPU: NVIDIA H100 80G
  • TensorRT-LLM branch main
  • TensorRT-LLM commit: 535c9cc

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

huggingface-cli download meta-llama/Llama-2-7b-hf --local-dir ./llama2-7b
git clone https://github.com/mit-han-lab/deepcompressor

cd /root/deepcompressor

conda env create -f environment.yml
poetry install

python -m deepcompressor.app.llm.ptq \
    examples/llm/configs/qoq-g128.yaml \
    --model-name llama-2-7b --model-path /root/llama2-7b \
    --smooth-proj-alpha 0 --smooth-proj-beta 1 \
    --smooth-attn-alpha 0.5 --smooth-attn-beta 0 \
    --save-model /root/quantized-llama2-7b

export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quantized-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

Expected behavior

no error

actual behavior


user@/app/tensorrt_llm/examples/llama$ export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quantized-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
0.16.0.dev2024111900
[11/27/2024-11:19:05] [TRT-LLM] [I] Loading weights from lmquant torch checkpoint for QServe W4A8 inference...
[11/27/2024-11:19:12] [TRT-LLM] [I] Processing weights in layer: 0
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 555, in <module>
    main()
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 547, in main
    convert_and_save_hf(args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 488, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 495, in execute
    f(args, rank)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 472, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 416, in from_hugging_face
    weights = load_weights_from_lmquant(quant_ckpt_path, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2086, in load_weights_from_lmquant
    process_weight_and_params(qkv, f'{tllm_prex}.attention.qkv'))
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2015, in process_weight_and_params
    qweight = qserve_quantize_weight_per_group(weight, s1_scales,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 328, in qserve_quantize_weight_per_group
    linear_weight.max() <= 15), "Stage 2: Quantized weight out of range"
AssertionError: Stage 2: Quantized weight out of range

additional notes

no

@anaivebird anaivebird added the bug Something isn't working label Nov 27, 2024
@bobboli
Copy link

bobboli commented Nov 27, 2024

Hi, this is due to the update from lmquant to deepcompressor. We have updated our conversion scripts accordingly and they will be merged in the next release.

For now if you want to try QServe please use the old-version lmquant https://github.com/mit-han-lab/deepcompressor/blob/lmquant-v0.0.0-deprecated/projects/llm/README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants