Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readme for v0.3.1 release #283

Merged
merged 4 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 88 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ AutoRound
<h3> Advanced Quantization Algorithm for LLMs</h3>

[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/auto-round)
[![version](https://img.shields.io/badge/release-0.3-green)](https://github.com/intel/auto-round)
[![version](https://img.shields.io/badge/release-0.3.1-green)](https://github.com/intel/auto-round)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/auto-round/blob/main/LICENSE)
---
<div align="left">
Expand All @@ -29,7 +29,8 @@ more accuracy data and recipes across various models.

* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default
configuration. This approach is typically better or comparable to asymmetric quantization and significantly
outperforms other symmetric variants, especially at low bit-widths like 2-bit. And,no need to compile from source to run
outperforms other symmetric variants, especially at low bit-widths like 2-bit. And,no need to compile from source to
run
AutoRound format anymore.
* [2024/09] AutoRound format supports several LVM models, check out the
examples [Qwen2-Vl](./examples/multimodal-modeling/Qwen-VL),[Phi-3-vision](./examples/multimodal-modeling/Phi-3-vision), [Llava](./examples/multimodal-modeling/Llava)
Expand All @@ -56,6 +57,70 @@ pip install auto-round

## Model Quantization

### Basic Usage ((Gaudi2/CPU/GPU))

A user guide detailing the full list of supported arguments is provided by calling ```auto-round -h``` on the terminal.
Alternatively, you can use ```auto_round``` instead of ```auto-round```.

```bash
auto-round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--format auto_round \
--disable_eval \
--output_dir ./tmp_autoround
```

We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
<details>
<summary>Other Recipes</summary>

```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 512 \
--iters 1000 \
--low_gpu_mem_usage \
--disable_eval
```

```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto-round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 128 \
--iters 200 \
--seqlen 512 \
--batch_size 4 \
--disable_eval
```

</details>

#### Formats

**AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. [2,4]
bits are supported. It also benefits
from the Marlin kernel, which can boost inference performance notably.However, it has not yet gained widespread
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space

community adoption. For CUDA support, you will need to
install from the source.

**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
community, [2,3,4,8] bits are supported, for 3 bits, pip install auto-gptq first before quantization. It also benefits
from the Marlin kernel, which can boost inference performance notably. However, **the
asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
models.
Additionally, symmetric quantization tends to perform poorly at 2-bit precision.

**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
adopted
within the community, only 4-bits quantization is supported. It features
specialized layer fusion tailored for Llama models.

### API Usage (Gaudi2/CPU/GPU)

```python
Expand All @@ -67,18 +132,18 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size = 4, 128
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size)
bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)

## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size)
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)

## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size)
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym )

autoround.quantize()
output_dir = "./tmp_autoround"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq', 'auto_awq'
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
```

Expand Down Expand Up @@ -134,103 +199,26 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True)

</details>

### Basic Usage (version > 0.3.0)

A user guide detailing the full list of supported arguments is provided by calling ```auto_round -h``` on the terminal.
Alternatively, you can use ```auto-round``` instead of ```auto_round```.

```bash
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--format auto_round \
--disable_eval \
--output_dir ./tmp_autoround
```

We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
<details>
<summary>Other Recipes</summary>

```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 512 \
--iters 1000 \
--low_gpu_mem_usage \
--disable_eval
```

```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 128 \
--iters 200 \
--seqlen 512 \
--batch_size 4 \
--disable_eval
```

</details>

#### Formats

**AutoRound Format**:This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. [2,4]
bits are supported. It also benefits
from the Marlin kernel, which can boost inference performance notably.However, it has not yet gained widespread
community adoption. For CUDA support, you will need to
install from the source.

**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
community, [2,3,4,8] bits are supported, for 3 bits, pip install auto-gptq first before quantization. It also benefits
from the Marlin kernel, which can boost inference performance notably. However, **the
asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
models.
Additionally, symmetric quantization tends to perform poorly at 2-bit precision.

**AutoAWQ Format**(>0.3.0): This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted
within the community, only 4-bits quantization is supported. It features
specialized layer fusion tailored for Llama models.

## Model Inference

Please run the quantization code first

### AutoGPTQ/AutoAWQ format

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

### AutoRound format

**CPU**: pip install intel-extension-for-transformers

**HPU**: docker image with Gaudi Software Stack is recommended. More details can be found
in [Gaudi Guide](https://docs.habana.ai/en/latest/).

**CUDA**: pip install auto-gptq for sym quantization(tuning needs auto-round 0.30+), for asym quantization, need to install auto-round from source
**CUDA**: no extra operations for sym quantization, for asym quantization, need to install auto-round from source

#### CPU/HPU/CUDA on 0.3.0+
#### CPU/HPU/CUDA

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig

backend = "auto" ##cpu, hpu, cuda, cuda:marlin('pip install -v gptqmodel --no-build-isolation')
backend = "auto" ##cpu, hpu, cuda, cuda:marlin(supported in auto_round>0.3.1 'pip install -v gptqmodel --no-build-isolation')
quantization_config = AutoRoundConfig(
backend=backend
)
Expand All @@ -243,13 +231,23 @@ inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

#### CPU/HPU/CUDA on 0.3.0
<br>
<details>
<summary>Evaluation</summary>

**CUDA**: need to install auto-round from source
```bash
auto-round --model saved_quantized_model \
--eval \
--task lambada_openai \
--eval_bs 1
```

</details>

### AutoGPTQ/AutoAWQ format

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round.auto_quantizer import AutoHfQuantizer ## must import

quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
Expand All @@ -260,20 +258,6 @@ inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

<br>
<details>
<summary>Evaluation</summary>

```bash
## version > 0.3.0
auto_round --model saved_quantized_model \
--eval \
--task lambada_openai \
--eval_bs 1
```

</details>

## Support List

AutoRound supports basically all the major large language models.
Expand Down
2 changes: 1 addition & 1 deletion auto_round/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@
"""Intel® auto-round: An open-source Python library
supporting popular model weight only compression based on signround."""

__version__ = "0.3.1.dev"
__version__ = "0.4.0.dev"
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ datasets
py-cpuinfo
sentencepiece
torch
transformers
transformers<=4.45.2
triton
numpy < 2.0
threadpoolctl
lm-eval==0.4.4
lm-eval>=0.4.2,<=0.4.5
tqdm
packaging
auto-gptq>=0.7.1