Add Quantized_model + float LoRA model scenario to model builder #1043

apsonawane · 2024-11-07T01:24:47Z

Add Quantized_model + float LoRA model scenario to model builder

src/python/py/models/quantized_model.py

src/python/py/models/builder.py

src/python/py/models/quantized_model.py

src/python/py/models/builder.py

src/python/py/models/quantized_model.py

jambayk · 2024-11-08T05:49:57Z

src/python/py/models/builder.py

@@ -437,7 +441,7 @@ def save_model(self, out_dir):
        # Quantize ONNX model to desired precision
        # TODO: Replace by quantizing the MatMuls as they are created
        already_quantized_in_qdq_format = self.quant_type is not None and self.quant_attrs["use_qdq"]  # Skip quantizing `MatMul` in `DequantizeLinear --> Transpose --> MatMul` path
-        if self.onnx_dtype == "int4" and not already_quantized_in_qdq_format:
+        if self.onnx_dtype == "int4" and not already_quantized_in_qdq_format and not self.matmul_attrs["use_lora"]:


MatMul4bits quantizer has an option to excludes nodes for quantization https://github.com/microsoft/onnxruntime/blob/e7987a6b0ba429c0bec248c4a471e1782da4be6c/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L1342

Maybe instead of a flag, you can keep a set of the lora matmul names and provide it to the quantizer? Otherwise, if the user provides float base + float adapters with int4 as precision, the output model will be fully float. But you might want to quantize the base model?
also for a quantized base model + float adapters, you might want to quantize the lm head like #940? Not sure what effect always quantizing the lm head has on accuracy though.

jambayk · 2024-11-08T05:55:03Z

src/python/py/models/quantized_model.py

@@ -334,23 +369,51 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in
                            # model.layers.layer_id.mlp.dense_h_to_4h.bias
                            module.mlp.gate_proj.bias = tensor[: intermediate_size]
                            module.mlp.down_proj.bias = tensor[intermediate_size: ]
+                        elif bool(re.match(r"^model.layers\.\d+\.self_attn.q_proj.lora_A\.weight$", name)):


these only cover llama type models. phi3 has qkv_proj and gate_up_proj instead of q,k,v,gate_proj,up_proj

Yes, Kunal also mentioned about that. We can support for that as well

src/python/py/models/quantized_model.py

Add Quantized_model + float LoRA model scenario to model builder

5746896

github-advanced-security bot found potential problems Nov 7, 2024

View reviewed changes

src/python/py/models/quantized_model.py Fixed Show fixed Hide fixed

kunal-vaishnavi and others added 3 commits November 7, 2024 04:45

Re-use peft attributes

21b33dd

Fix GPTQModel function

728c651

Update scaling factor

0dadb37

apsonawane marked this pull request as ready for review November 7, 2024 19:02

apsonawane requested a review from kunal-vaishnavi November 7, 2024 19:02