Skip to content

Commit

Permalink
update readme for v0.3.1 release (#283)
Browse files Browse the repository at this point in the history
  • Loading branch information
wenhuach21 authored Oct 21, 2024
1 parent 68138e8 commit a359222
Show file tree
Hide file tree
Showing 3 changed files with 91 additions and 107 deletions.
192 changes: 88 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ AutoRound
<h3> Advanced Quantization Algorithm for LLMs</h3>

[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/auto-round)
[![version](https://img.shields.io/badge/release-0.3-green)](https://github.com/intel/auto-round)
[![version](https://img.shields.io/badge/release-0.3.1-green)](https://github.com/intel/auto-round)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/auto-round/blob/main/LICENSE)
---
<div align="left">
Expand All @@ -29,7 +29,8 @@ more accuracy data and recipes across various models.

* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default
configuration. This approach is typically better or comparable to asymmetric quantization and significantly
outperforms other symmetric variants, especially at low bit-widths like 2-bit. And,no need to compile from source to run
outperforms other symmetric variants, especially at low bit-widths like 2-bit. And,no need to compile from source to
run
AutoRound format anymore.
* [2024/09] AutoRound format supports several LVM models, check out the
examples [Qwen2-Vl](./examples/multimodal-modeling/Qwen-VL),[Phi-3-vision](./examples/multimodal-modeling/Phi-3-vision), [Llava](./examples/multimodal-modeling/Llava)
Expand All @@ -56,6 +57,70 @@ pip install auto-round

## Model Quantization

### Basic Usage ((Gaudi2/CPU/GPU))

A user guide detailing the full list of supported arguments is provided by calling ```auto-round -h``` on the terminal.
Alternatively, you can use ```auto_round``` instead of ```auto-round```.

```bash
auto-round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--format auto_round \
--disable_eval \
--output_dir ./tmp_autoround
```

We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
<details>
<summary>Other Recipes</summary>

```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 512 \
--iters 1000 \
--low_gpu_mem_usage \
--disable_eval
```

```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto-round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 128 \
--iters 200 \
--seqlen 512 \
--batch_size 4 \
--disable_eval
```

</details>

#### Formats

**AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. [2,4]
bits are supported. It also benefits
from the Marlin kernel, which can boost inference performance notably.However, it has not yet gained widespread
community adoption. For CUDA support, you will need to
install from the source.

**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
community, [2,3,4,8] bits are supported, for 3 bits, pip install auto-gptq first before quantization. It also benefits
from the Marlin kernel, which can boost inference performance notably. However, **the
asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
models.
Additionally, symmetric quantization tends to perform poorly at 2-bit precision.

**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
adopted
within the community, only 4-bits quantization is supported. It features
specialized layer fusion tailored for Llama models.

### API Usage (Gaudi2/CPU/GPU)

```python
Expand All @@ -67,18 +132,18 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
from auto_round import AutoRound
bits, group_size = 4, 128
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size)
bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size)
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size)
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym )
autoround.quantize()
output_dir = "./tmp_autoround"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq', 'auto_awq'
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
```

Expand Down Expand Up @@ -134,103 +199,26 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True)

</details>

### Basic Usage (version > 0.3.0)

A user guide detailing the full list of supported arguments is provided by calling ```auto_round -h``` on the terminal.
Alternatively, you can use ```auto-round``` instead of ```auto_round```.

```bash
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--format auto_round \
--disable_eval \
--output_dir ./tmp_autoround
```

We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
<details>
<summary>Other Recipes</summary>

```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 512 \
--iters 1000 \
--low_gpu_mem_usage \
--disable_eval
```

```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 128 \
--iters 200 \
--seqlen 512 \
--batch_size 4 \
--disable_eval
```

</details>

#### Formats

**AutoRound Format**:This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. [2,4]
bits are supported. It also benefits
from the Marlin kernel, which can boost inference performance notably.However, it has not yet gained widespread
community adoption. For CUDA support, you will need to
install from the source.

**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
community, [2,3,4,8] bits are supported, for 3 bits, pip install auto-gptq first before quantization. It also benefits
from the Marlin kernel, which can boost inference performance notably. However, **the
asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
models.
Additionally, symmetric quantization tends to perform poorly at 2-bit precision.

**AutoAWQ Format**(>0.3.0): This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted
within the community, only 4-bits quantization is supported. It features
specialized layer fusion tailored for Llama models.

## Model Inference

Please run the quantization code first

### AutoGPTQ/AutoAWQ format

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

### AutoRound format

**CPU**: pip install intel-extension-for-transformers

**HPU**: docker image with Gaudi Software Stack is recommended. More details can be found
in [Gaudi Guide](https://docs.habana.ai/en/latest/).

**CUDA**: pip install auto-gptq for sym quantization(tuning needs auto-round 0.30+), for asym quantization, need to install auto-round from source
**CUDA**: no extra operations for sym quantization, for asym quantization, need to install auto-round from source

#### CPU/HPU/CUDA on 0.3.0+
#### CPU/HPU/CUDA

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig
backend = "auto" ##cpu, hpu, cuda, cuda:marlin('pip install -v gptqmodel --no-build-isolation')
backend = "auto" ##cpu, hpu, cuda, cuda:marlin(supported in auto_round>0.3.1 'pip install -v gptqmodel --no-build-isolation')
quantization_config = AutoRoundConfig(
backend=backend
)
Expand All @@ -243,13 +231,23 @@ inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

#### CPU/HPU/CUDA on 0.3.0
<br>
<details>
<summary>Evaluation</summary>

**CUDA**: need to install auto-round from source
```bash
auto-round --model saved_quantized_model \
--eval \
--task lambada_openai \
--eval_bs 1
```

</details>

### AutoGPTQ/AutoAWQ format

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round.auto_quantizer import AutoHfQuantizer ## must import
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
Expand All @@ -260,20 +258,6 @@ inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

<br>
<details>
<summary>Evaluation</summary>

```bash
## version > 0.3.0
auto_round --model saved_quantized_model \
--eval \
--task lambada_openai \
--eval_bs 1
```

</details>

## Support List

AutoRound supports basically all the major large language models.
Expand Down
2 changes: 1 addition & 1 deletion auto_round/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@
"""Intel® auto-round: An open-source Python library
supporting popular model weight only compression based on signround."""

__version__ = "0.3.1.dev"
__version__ = "0.4.0.dev"
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ datasets
py-cpuinfo
sentencepiece
torch
transformers
transformers<=4.45.2
triton
numpy < 2.0
threadpoolctl
lm-eval==0.4.4
lm-eval>=0.4.2,<=0.4.5
tqdm
packaging
auto-gptq>=0.7.1

0 comments on commit a359222

Please sign in to comment.