-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama2-7b bad results for int8-kv-cache + per-channel-int8-weight #967
Comments
How about test without int8_kv_cache? |
@Tracin |
So I think it is similar to #889 |
is int8-weight+int8-kv-cache mmlu accuracy tested on any model in your experiments, since for quantization, this case is the official example of tensorrt-llm? |
Yes, We tested int8-kv and int8 weight-only separately with LLAMA1-7b and MMLU score is similar to FP16. According to your test, LLAMA2-7b + int8kv has bad accuracy, right? I will check it. |
@Tracin |
Thanks! I think we can remove int8 weight-only for better debug. And you mentioned about your code about quantization, did you use the same kv_cache_scaling_factors? |
@Tracin |
@Tracin |
Launching a larger gemm can be more efficient than launching three small kernels. |
k v seperate scales(per-tensor,static),acc is fine. Have you done any experiments on llama2-7b int8-weight+int8-kv-cache?
|
I don't quite understand
|
@brisker I mean when using per-tensor weight quantization mode for SQ, qkv has three different weight scales. |
@Tracin
the code above , which is smoothquant mmlu test, gives me 37.7 accuracy, and fp16 accuracy is 45.9 so until now , smoothquant w8a8 and int8-kv-cache both seem to have bugs, with bad accuracy. Have you confirmed any bugs? |
Thanks for your reply, I got it very clear. I will reproduce and fix it ASAP. |
@Tracin Just regard this as a cross-check. |
@Tracin |
@Tracin I use the bin file generated by the command above to build a weight-only-quantize trt-engine, like this: but the mmlu-test acc is also bad. But if directly build weight-only-quantize trt-engine like this: So this is so weird..... https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/examples/llama/build.py#L690 |
@kaiyux Given the discussions in this issue about Llama2-7B, the acc drop is about 8% in my experiments, which is not reasonable. |
@brisker Hi, since you can reproduce good accuracy on LLAMA-7b and bad accuracy on LLAMA2-7b using SQ and INT8KV respectively, it is clear that different model parameters cause the difference, so there is no actual bugs, right? You can use AMMO to see if they can produce better accuracy. |
@Tracin |
@Tracin |
@Tracin so I do not think "there is no actual bugs" that you mentioned is convincing. |
I mean if you want to test accuracy and compare to papers, please use |
@Tracin |
@brisker As for SQ problem, there is a bug, you can fix it manually and I will push a MR later. |
@Tracin |
Sure, please use the latest main branch.
|
|
I will check that, please try use the latest branch, or my branch is
|
System Info
3090 gpu
0.7.1 tensorrt-llm
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python hf_llama_convert.py -i /root/models/Llama-2-7b/ -o /root/TensorRT -LLM/examples/llama/llama2_7b_w8_int8_kv_cache/ --calibrate-kv-cache -t fp16
python build.py --bin_model_dir /root/TensorRT-LLM/examples/llama/llama2_7b_w8_int8_kv_cache/bin_model_dir/ --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --output_dir /root/TensorRT-LLM/examples/llama/llama2_7b_w8_int8_kv_cache/1-gpu --int8_kv_cache --use_weight_only
python mmlu.py --hf_model_dir /root/models/Llama-2-7b/ --engine_dir /root/TensorRT-LLM/examples/llama/llama2_7b_w8_int8_kv_cache/1-gpu/ --test_trt_llm
(mmlu.py is provided by TensorRT-LLM here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/mmlu.pyUnfortunately, step 3 gives me:
the final mmlu accuracy is 38.4, but fp16 accuracy is 45.9, which is very bad. But according to some LLM quantization papers, the acc should not drop so much in this case.
the config.json generated by build.py is something like this:
Is there any bug in the quantization code?
Expected behavior
expected mmlu acc does not drop that much
actual behavior
mmlu acc drops so much
additional notes
no more
The text was updated successfully, but these errors were encountered: