New GEMM kernels for weight-only quantization #2090

lzhangzz · 2024-07-19T18:44:38Z

f16*u4g128 vs cublasGemmEx f16*f16, both using HMMA + f32 accumulator, on 32 weight matrices from 8 models range from 7B to 72B

sm90 features are not used yet
sm70 tensor core and fp16 have a shared pipeline

src/turbomind/kernels/attention/quantization.h

lvhan028 · 2024-08-16T10:08:52Z

The main branch has OpenGVLab/InternVL2-40B-inner-4bits issue. This PR has nothing to do with it. @zhulinJulia24

zhyncs · 2024-08-16T10:10:54Z

Verified LMDeploy fp16 and AWQ on H100 with @lzhangzz , both are blazing fast.

zhyncs · 2024-08-16T10:24:40Z

TODO:

Set the default max-batch-size to 256 when VRAM > 40G
Resolve the runtime issue on SM90 by disabling certain code sections
When using Quant, consider supporting the H100 feature (which can be implemented in other PRs)

zhyncs

Overall LGTM and just need to solve some minor issues with the H100 mentioned above.

lzhangzz added 30 commits March 8, 2024 06:22

preprocess for kv-int8

0e546d7

working kv-int8

33a2726

minor

9d089db

working kv-int4

a52d6ec

optimize kv-int4

db61d42

optimize kv-int4

082e49f

optimized SIMT f16/u8/u4 decoding

7ee4388

fix tc decoding

48ef596

int8 tc decoding

77cfc98

int4 tc decoding

9957801

minor

dc2eb32

optimize

41a7c0d

optimize tc kv-int4/int8

86f572d

fix sm_75/sm_70

373386f

simplify

cc456f1

bf16+kv4/8

5f25c0c

support more mma instruction

ff8c439

refactor

cd0f266

dispatching

2d2ed7d

Merge remote-tracking branch 'origin/main' into online-quant

99864f6

integration

a4238b8

remove offline kv params

798f39d

fix msvc build

0269951

fix msvc build

6250739

fix lint

b225059

fix lint

fb281f2

fix cmake

1fe6a61

fix lint

4bef81d

fix lint

8118cb1

minor

3509ec2

lzhangzz added 10 commits August 13, 2024 08:05

refactor & batch_dim support

f2866b8

A100

b6c7ebb

TuningParams

38cfab3

Merge remote-tracking branch 'origin/main' into gemm2_dev

b835420

lint

7855310

lint

ab424d5

minor

cac6e51

switch to m-major MMA for sm70

945214b

recognize GPTQ models

d43becb

RTX 2080 & GTX 1660

1110fa9

lvhan028 reviewed Aug 16, 2024

View reviewed changes

src/turbomind/kernels/attention/quantization.h Show resolved Hide resolved

lzhangzz added 2 commits August 16, 2024 04:50

fix missing return

1eb0585

fix cu12 build for sm90

a11a580

zhyncs approved these changes Aug 16, 2024

View reviewed changes

lzhangzz added 4 commits August 16, 2024 11:43

fix ptr of operand C

51d85be

disable cache eviction policy on sm_90

5a1711a

fix lint

8404e22

add refs

d71edd5

lvhan028 mentioned this pull request Aug 17, 2024

Fix the way to get "quantization_config" from model's coniguration #2325

Merged

lzhangzz added 2 commits August 19, 2024 04:39

fix lint

50454b0

lint

fcdc6cb

lvhan028 approved these changes Aug 19, 2024

View reviewed changes

lvhan028 merged commit b28a1d0 into InternLM:main Aug 19, 2024
9 checks passed

This was referenced Aug 19, 2024

[Bug] V100使用turbomind推理AWQ的Qwen2-72b-Instruct会出现奇怪的推理结果 #2332

Closed

同一个模型4卡推理吞吐量低于2卡？ #2346

Closed

Tushar-ml mentioned this pull request Aug 22, 2024

[Bug] Llama3.1 AWQ at TP>1 giving different responses #2166

Closed

3 tasks

QwertyJack mentioned this pull request Nov 18, 2024

Support Volta architecture casper-hansen/AutoAWQ#103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New GEMM kernels for weight-only quantization #2090

New GEMM kernels for weight-only quantization #2090

lzhangzz commented Jul 19, 2024 •

edited

Loading

lvhan028 commented Aug 16, 2024

zhyncs commented Aug 16, 2024

zhyncs commented Aug 16, 2024

zhyncs left a comment

New GEMM kernels for weight-only quantization #2090

New GEMM kernels for weight-only quantization #2090

Conversation

lzhangzz commented Jul 19, 2024 • edited Loading

lvhan028 commented Aug 16, 2024

zhyncs commented Aug 16, 2024

zhyncs commented Aug 16, 2024

zhyncs left a comment

Choose a reason for hiding this comment

lzhangzz commented Jul 19, 2024 •

edited

Loading