[Feature]: FP6 #4515

nivibilla · 2024-05-01T06:52:00Z

🚀 The feature, motivation and pitch

Fp6 allows for models such llama 70b to fit in a single a100 GPU. Also 6bit is often the sweet spot between performance and speed. This was a paper from deep speed and is integrated into deepspeed-mii.

But they also have the code and kernels seperately
https://github.com/usyd-fsalab/fp6_llm

Alternatives

No response

Additional context

No response

mgoin · 2024-05-01T13:09:40Z

@nivibilla thanks for sharing the separated kernel implementation - this makes it a lot more straightforward to understand and I would be interested in implementing this within vLLM.
We will face the same problem that we have with dynamic FP8 quantization, where we have to fully load in the model weights in original precision i.e. FP16 and then can quantize the weights down. This means the peak memory consumption will still be equivalent to the original model weights. We will address this soon with a weight loader refactor.

I just want to share that they did compare against fine-grained and coarse-grained W4A16 INT4 kernels, which we already have very good ones implemented, and saw performance slightly below them as expected. So while I don't think this will offer particularly new capability in vLLM, it would be very nice to get a relatively accurate 6-bit model compression with just a runtime flag.

nivibilla · 2024-05-01T13:36:10Z

Thanks @mgoin , yes the performance isn't as good as INT4. However the model performance is nearly indistinguishable from fp16 which is really nice. I hope that fp6 becomes the new fp8 standard. There's no need to run the weights in any higher precision. I think it's a nice tradeoff from int4, better performance and slightly slower.

And yes the weight loading may be an issue. I hope to have two replicas of a model load inside the same GPU so if one of them takes up the entire GPU memory when loading then it will fail. Hope this can be fixed too.

rkooo567 · 2024-05-01T14:21:10Z

cc @comaniac

comaniac · 2024-05-01T15:36:06Z

Thanks for the request. We can definitely integrate FP6 quantization to vLLM as another supported quantization method to run FP6 models. It shouldn't be too hard given that it only quantized linear layers, and the FP6 linear kernels are open source.

Meanwhile, I'd still keep FP8 as the standard (actually the term "standard" is not really important because FP8 is also implemented as a quantization method and it's up to users). The reason is that FP8 is officially supported by GPU vendors (both NVIDIA and AMD) at instruction level, meaning that 1) the vendors will maintain compatibility and performance in future GPU releases, and 2) more workloads (e.g., FP flash attention and kv-cache) can be covered.

mgoin · 2024-05-09T03:02:22Z

It seems support for this will land in #4652 as quantization="deepspeedfp"

comaniac · 2024-05-09T04:21:10Z

Not really. The PR you pointed out only uses FP6/8 checkpoints. The compute is still in FP16.

mgoin · 2024-05-09T21:17:49Z

@comaniac FP6_LLM is weight-only quantization i.e. W6A16, you can see this in the graph I shared in my comment above. There is no compute savings with this method compared to FP16. Also the PR I pointed to allows quantizing at runtime, like our FP8 quantization, not just loading pre-quantized checkpoints.

comaniac · 2024-05-09T21:52:54Z

Thanks for the clarification. Then we can close this issue I suppose?

nivibilla · 2024-05-09T23:46:02Z

@mgoin I'm a bit confused, why does fp6 not save vram? Even if the activations are in fp16. Surely the weights being in fp6 save memory right?

twaka · 2024-05-10T04:14:29Z

~~Is this feature now usable with non-arctic model?~~

Installed from source and it works as expected, amazing!

Download model https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct for example
Put https://huggingface.co/Snowflake/snowflake-arctic-instruct/blob/main/quant_config.json into model dir
python -m vllm.entrypoints.openai.api_server --model ./NousResearch/Meta-Llama-3-8B-Instruct --quantization deepspeedfp

Loaded model weights are reduced as well.
no quantization: model_runner.py:167] Loading model weights took 14.9595 GB
8bits: model_runner.py:167] Loading model weights took 8.6860 GB
6bits: model_runner.py:167] Loading model weights took 7.0610 GB

mgoin · 2024-05-10T13:32:34Z

Hi @nivibilla you just misunderstood what I said. I said there is no compute savings, meaning the computation still all happens at fp16 precision. This does not imply there is no memory savings, which is very much happening.

nivibilla · 2024-05-10T13:45:37Z

@mgoin ohhh I see. Lol mb

wearegolden · 2024-07-19T06:01:05Z

@twaka
hi there, I've tried it myself and also observed the model weights reduced using Llama3-8B.
But for me, although the memory consumption decreased, the latency seems to increase significantly when using deepspeed fp6/fp8.
I was just wondering if you observed the same thing!

twaka · 2024-07-21T10:48:43Z

I think increase of latency is expected unless integration of fo6_llm's kernel since dequantize and matmul are not fused in deepspeedfp implementation currently.

AlpinDale · 2024-09-25T16:51:21Z

Support being added in #8751

github-actions · 2024-12-25T02:00:12Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

nivibilla added the feature request label May 1, 2024

AlpinDale linked a pull request Sep 25, 2024 that will close this issue

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

Open

4 tasks

github-actions bot added the stale label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: FP6 #4515

[Feature]: FP6 #4515

nivibilla commented May 1, 2024

mgoin commented May 1, 2024

nivibilla commented May 1, 2024

rkooo567 commented May 1, 2024

comaniac commented May 1, 2024

mgoin commented May 9, 2024

comaniac commented May 9, 2024

mgoin commented May 9, 2024 •

edited

Loading

comaniac commented May 9, 2024

nivibilla commented May 9, 2024

twaka commented May 10, 2024 •

edited

Loading

mgoin commented May 10, 2024

nivibilla commented May 10, 2024

wearegolden commented Jul 19, 2024

twaka commented Jul 21, 2024

AlpinDale commented Sep 25, 2024

github-actions bot commented Dec 25, 2024

[Feature]: FP6 #4515

[Feature]: FP6 #4515

Comments

nivibilla commented May 1, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

mgoin commented May 1, 2024

nivibilla commented May 1, 2024

rkooo567 commented May 1, 2024

comaniac commented May 1, 2024

mgoin commented May 9, 2024

comaniac commented May 9, 2024

mgoin commented May 9, 2024 • edited Loading

comaniac commented May 9, 2024

nivibilla commented May 9, 2024

twaka commented May 10, 2024 • edited Loading

mgoin commented May 10, 2024

nivibilla commented May 10, 2024

wearegolden commented Jul 19, 2024

twaka commented Jul 21, 2024

AlpinDale commented Sep 25, 2024

github-actions bot commented Dec 25, 2024

mgoin commented May 9, 2024 •

edited

Loading

twaka commented May 10, 2024 •

edited

Loading