Releases · intel/auto-round

22 Nov 13:32

f7913f9

v0.4 Latest

Latest

Highlights

[Experimental Feature] We provide API support for VLM models
[Kernel] We add ipex support for intel cpu
[Bug fix] We fix tuning bug for glm4 model
[Enhancement] better align gradient_accumulate_steps behavior for varied length input

What's Changed

refine AuoRound format and support marlin repacking by @wenhuach21 in #280
update readme for v0.3.1 release by @wenhuach21 in #283
update readme for cpu inference by @wenhuach21 in #284
avoid deterministic algorithm warning in inference by @wenhuach21 in #285
fix mx_fp issues by @wenhuach21 in #286
update torch ao integration information by @wenhuach21 in #287
Refine code by @wenhuach21 in #291
Add ipex support for intel cpu by @wenhuach21 in #292
fix ipex tqdm mismatch issue by @wenhuach21 in #293
fix bug of backend by @wenhuach21 in #294
[Experimental Feature]support for common hf multimodel by @n1ck-guo in #276
use torch.compile by default for PyTorch versions 2.6 and above by @wenhuach21 in #295
refine forward hook by @WeiweiZhang1 in #290
eval for MLLMs by @n1ck-guo in #296
mllm eval bug fix by @n1ck-guo in #297
Port Numba-based packing from INC by @yiliu30 in #301
refine model config file for mixed precision quantization by @wenhuach21 in #300
fix glm4-9b batch dim issue by @wenhuach21 in #304
better align gradient_accumulate_steps for varied length input by @wenhuach21 in #309
Enable torch.compile on HPU by @yiliu30 in #307
Update autogptq exporting by @wenhuach21 in #310
fix typo by @wenhuach21 in #311
qwen2 vision quantization bugfix by @WeiweiZhang1 in #313
multiple gpu evaluation/calibration refine by @wenhuach21 in #312
HPU only release binary by @yiliu30 in #302
patch 1 for mllm by @n1ck-guo in #298
add torch compile arg by @wenhuach21 in #314
fix merge error by @n1ck-guo in #316
Update the check for HPU by @yiliu30 in #318
fix eval device issue by @wenhuach21 in #319
fix multiple device bug by @wenhuach21 in #321
add warning for no gptq exllamav2 kernel by @wenhuach21 in #324
add pile calib, rename quant_block_list to to_quant_block_names by @WeiweiZhang1 in #322
fix autogptq version error by @wenhuach21 in #325
new mllm eval by @n1ck-guo in #317
Add cpu only version by @XuehaoSun in #315
set default mllm dataset by @n1ck-guo in #327
fix fp_layers issue and force to FP16 on cuda for autoround format inference by @wenhuach21 in #326
fix the bug of test model support for test-only by @n1ck-guo in #328
Increase unit test timeout to 120 minutes by @XuehaoSun in #330
fix mllm dataset config bug and add gptq cuda backend by @wenhuach21 in #329
add tips and tricks for llm&mllm quantization by @wenhuach21 in #333
fix eval_bs in fake format and reset auto-gptq exporting max_shard_size by @wenhuach21 in #332
fix model_dtype issue and reformat mllm code by @wenhuach21 in #335
Exclude markdown files from unit test pipelines by @XuehaoSun in #337
refine mllm docs by @WeiweiZhang1 in #336
cogvlm doc by @n1ck-guo in #339
add qwen2.5 recipe and refine readme by @WeiweiZhang1 in #338
add cogvlm recipe and refine readme by @WeiweiZhang1 in #340
refine mllm API and add help info by @n1ck-guo in #334

Full Changelog: v0.3.1...v0.4

Contributors

yiliu30, wenhuach21, and 3 other contributors

Assets 2

21 Oct 04:12

wenhuach21

v0.3.1

0c4319c

Intel® auto-round v0.3.1 Release

Release Highlights:

New Features:

Full-Range Symmetric Quantization: We’ve introduced full-range symmetric quantization, which often matches or even exceeds the performance of asymmetric quantization, especially at lower bit widths, such as 2.

Command-Line Support: You can now quantize models using the command auto-round --model xxx --format xxx

Default Exporting Format Change: The default format has been updated to auto_round instead of auto_gptq.

Muiti-thread packing: up to 2X speed up on packing phase

Bug Fixes:

Resolved Missing Cached Position Embeddings: Fixed an issue with missing cached position embeddings in Transformer version 4.45.2.

Mutable Default Values Issue: Addressed problems related to mutable default values.

3 bit packing bug for AutoGPTQ format

What's Changed

Add setseed in autoround by @WeiweiZhang1 in #201
support autoawq format by @yintong-lu in #115
Remove UT coverage check by @XuehaoSun in #202
set autoround format as default to unify CPU/HPU/CUDA by @wenhuach21 in #205
add local file of pile-10k by @WeiweiZhang1 in #198
modify setup.py by @n1ck-guo in #206
limit the scale minimum value not to 0 by @WeiweiZhang1 in #211
fix example dataset regression by @WeiweiZhang1 in #212
remove local pile file by @WeiweiZhang1 in #213
update xpu format exporting by @WeiweiZhang1 in #214
fix a bug in autoround format inference by @wenhuach21 in #215
avoid underflow and overflow for exllamav2 by @wenhuach21 in #218
add qwen int4 model, refine example by @WeiweiZhang1 in #217
[Experimental Feature]fast tuning norm/bias at 2 bits by @wenhuach21 in #208
update readme by @wenhuach21 in #220
refine eval_042 to enable parallelize evaluation by @WeiweiZhang1 in #221
Enable phi3v tuning by @WeiweiZhang1 in #197
Bump setuptools from 69.5.1 to 70.0.0 in /examples/multimodal-modeling/Phi-3-vision by @dependabot in #223
refine example by @WeiweiZhang1 in #224
change the scale thresh generally by @WeiweiZhang1 in #229
add quantized models by 3rd party by @WeiweiZhang1 in #230
add meta3.1-70B-instruct model, refine docs by @WeiweiZhang1 in #231
fix model link by @WeiweiZhang1 in #232
refine docs, add accuracy data, add receip and eval scripts by @WeiweiZhang1 in #226
add brief formats introduction by @wenhuach21 in #236
update readme and add itrex in the requirements.txt by @wenhuach21 in #238
add tritonv2, improve packing and pbar by @wenhuach21 in #239
refine the code and the speedup is notable by @wenhuach21 in #240
move some settings from example to main by @wenhuach21 in #241
add runable script for autoround by @n1ck-guo in #225
update readme by @n1ck-guo in #242
Add MANIFEST.in file to include requirements.txt by @XuehaoSun in #243
fix example bug by @n1ck-guo in #245
enable llava int4 inference with autoround format by @WeiweiZhang1 in #237
remove autoawq requirement at packing stage by @n1ck-guo in #249
remove unused log by @n1ck-guo in #252
support INC API by @WeiweiZhang1 in #255
avoid potential bug for auto-gptq 0.8 by @wenhuach21 in #250
fix example by @n1ck-guo in #256
fix preci by @n1ck-guo in #258
enable_qwen2-vl_quantization by @WeiweiZhang1 in #248
update eval and fix example by @n1ck-guo in #260
refine autoawq exporting code by @wenhuach21 in #261
better support quant_lm_head for larger models by @wenhuach21 in #263
Fix 3bit packing for auto-gptq format by @wenhuach21 in #264
Add a warning for improper export formats. by @wenhuach21 in #265
Update readme for VLM support and integration by @wenhuach21 in #266
remove g_idx in gptq format by @wenhuach21 in #267
keep the dtype after qdq by @wenhuach21 in #268
enable llama3.2-vision model quantization by @WeiweiZhang1 in #269
fix mutable default value by @wenhuach21 in #272
change to even rounding for mantissa of mx_fp by @wenhuach21 in #277
adamround bugfix, refine import by @WeiweiZhang1 in #275
[Important Change]set full range sym as the default by @wenhuach21 in #278
refine eval by @wenhuach21 in #282
qwen2_bugfix, add adamround vision UT by @WeiweiZhang1 in #281

New Contributors

@dependabot made their first contribution in #223

Full Changelog: v0.3...v0.3.1

Contributors

dependabot, wenhuach21, and 4 other contributors

Assets 2

14 Aug 11:33

wenhuach21

v0.3

4ac1104

Intel® auto-round v0.3 Release

Highlights:
- Broader Device Support:
  - Expanded support for CPU, HPU, and CUDA inference in the AutoRound format, resolving the 2-bit accuracy issue.
- New Recipes and Model Releases:
  - Published numerous recipes on the Low Bit Open LLM Leaderboard, showcasing impressive results on LLaMa 3.1 and other leading models.
- Experimental Features:
  - Introduced several experimental features, including activation quantization and mx_fp, with promising outcomes with AutoRound.
- Multimodal Model Support:
  - Extended capabilities for tuning and inference across several multimodal models.
Lowlights:
- Implemented support for low_cpu_mem_usage, auto_awq format, calibration dataset concatenation, and calibration datasets with chat templates.

Assets 2

30 May 02:13

wenhuach21

v0.2

aafb82e

Intel® auto-round v0.2 Release

Overview

We supported the Intel XPU format and implemented lm-head quantization and inference, reducing the model size from 5.4GB to 4.7GB for LLAMA3 at W4G128. Additionally, we supported both local and mixed online datasets for calibration. By optimizing memory usage and tuning costs, the calibration process now takes approximately 20 minutes for 7B models and 2.5 hours for 70B models with 512 samples by setting disable_low_gpu_mem_usage.

Others:

More accuracy data as presented in [paper](https://arxiv.org/pdf/2309.05516) and [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)

More technical details as presented in [paper](https://arxiv.org/pdf/2309.05516)

Known issues:

Large discrepancy between gptq model and qdq model for asymmetric quantization in some scenarios. We are working on it.

Assets 2

08 Mar 08:11

wenhuach21

v0.1

514aa49

Intel® auto-round v0.1 Release

Overview

AutoRound introduces an innovative weight-only quantization algorithm designed specifically for low-bit LLM inference, approaching near-lossless compression for a range of popular models including gemma-7B, Mistral-7b, Mixtral-8x7B-v0.1, Mixtral-8x7B-Instruct-v0.1, Phi2, LLAMA2 and more at W4G128. AutoRound consistently outperforms established methods across the majority of scenarios at W4G128, W4G-1, W3G128, and W2G128 .

Key Features

Wide Model Support: AutoRound caters to a diverse range of model families. About 20 model families have been verified.
Export Flexibility: Effortlessly export quantized models to ITREX[1] and AutoGPTQ[2] formats for seamless deployment on Intel CPU and Nvidia GPU platforms respectively.
Device Compatibility: Compatible with tuning devices including Intel CPUs, Intel Guadi2, and Nvidia GPUs.
Dataset Flexibility: AutoRound supports calibration with Pile10k and MBPP datasets, with easy extensibility to incorporate additional datasets.

Examples

Explore language modeling and code generation examples to unlock the full potential of AutoRound.

Additional Benefits

PreQuantized Models: Access a variety of pre-quantized models on Hugging Face for immediate integration into your projects, with more models under review and coming soon.
Comprehensive Accuracy Data: Simplified user deployment with extensive accuracy data provided.

Known issues:

baichuan-inc/Baichuan2-13B-Chat has some issues, we will support it soon

Reference:

[1] https://github.com/intel/intel-extension-for-transformers

[2] https://github.com/AutoGPTQ/AutoGPTQ

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Releases: intel/auto-round

v0.4

Highlights

What's Changed

Contributors

Intel® auto-round v0.3.1 Release

What's Changed

New Contributors

Contributors

Intel® auto-round v0.3 Release

Intel® auto-round v0.2 Release

Intel® auto-round v0.1 Release