Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExLlamaV2: exl2 support #3203

Open
pabl-o-ce opened this issue Mar 5, 2024 · 29 comments · May be fixed by #11348
Open

ExLlamaV2: exl2 support #3203

pabl-o-ce opened this issue Mar 5, 2024 · 29 comments · May be fixed by #11348

Comments

@pabl-o-ce
Copy link

If is possible ExLlamaV2 is a very fast and good library to Run LLM

ExLlamaV2 Repo

@mgoin
Copy link
Member

mgoin commented Mar 5, 2024

@pabl-o-ce What specifically would you like to see? vLLM already integrates kernels from exllamav2 - see things such as the GPTQ kernels

Copied from https://github.com/turboderp/exllamav2

@pabl-o-ce
Copy link
Author

Hi @mgoin thanks for the response.

So is possible to use exl2 format on vllm? or is using gptq format
sorry if this is a very n00b question

@tobiajung
Copy link

Hi,
seems like 0.3.3 has vllm support, but I'm not able to get a model up and running. I use the docker environment with the following args: --model LoneStriker/CodeFuse-DeepSeek-33B-4.0bpw-h6-exl2 --gpu-memory-utilization 0.65 --max-model-len 2048
But seems like vllm tries to allocate much more memory than the given 0.65 (of 48GB) and results in an error.

Is exl2 proper supported? How to start the docker container correctly to operate on exl2 models?

@wxupjack
Copy link

Hi @mgoin, I think this feature submitted by @chu-tianxiang in #2330 and #916 just utilize the shuffle and dequant functions from exllamav2 repo for GPTQ. But not means vllm(main branch) has been compatible with the dynamic precision exl2 format.

So it still don't support exl2 properly. @tobiajung

@sapountzis
Copy link

We are also interested in exl2 dynamic precision format. What steps would be needed to support it?

@nkeilar
Copy link

nkeilar commented Apr 5, 2024

Support for this would allow dbrx exl2 support for the dbrx model quantized weights. Allowing inference of the model on a dual 24 gpu system.

@zminer123
Copy link

Hello there! Has any more thought/attention been given to the idea of exl2 support? The newest derivatives of llama3 (such as dolphin 70b) utilize it and it seems no one else is quantizing it to AWQ or GPTQ. I love vLLM regardless! Thank you guys for all the work you put in.

@saucebing
Copy link

Hi, we are also interested in EXL2 format, which is quite flexible and fast. As for flexibility, you can use 3.2, 4.5, 8.5 bpw (bit per weight) to quantize a model. And the inference speed of EXL2 is much faster than GPTQ in 8-bit precision.

@houmie
Copy link

houmie commented Apr 28, 2024

Fully agree. Supporting exl2 would be perfect!

@belladoreai
Copy link

I would also love to see exl2 support

@mku-wedoai
Copy link

I would love to see EXL2 support in vLLM!

@DenisSergeevitch
Copy link

exl2 is needed feature 100%

@sparsh35
Copy link

I support it too

@chopin1998
Copy link

any update?
really hope to get perfect support for exl2 format ASAP.

@meditans
Copy link

meditans commented Jun 9, 2024

Also voicing my support!

@kulievvitaly
Copy link

VLLM is great. Like to see exlv2 support too!

@Respaired
Copy link

yeah this would be great to have

@paulb-seldon
Copy link

Would love to see this!

@rjmehta1993
Copy link

+1. EXl quants is unbeatable

@fablerq
Copy link

fablerq commented Sep 2, 2024

Is there any chance to see exl2? 👀

@DaBossCoda
Copy link

can we get this? no one making awq and gtpq quants anymore :(

@javAlborz
Copy link

exl2 would be nice 😃

@alkeryn
Copy link

alkeryn commented Dec 9, 2024

@javAlborz it would !

@SlapDrone
Copy link

+1!

@alkeryn
Copy link

alkeryn commented Dec 13, 2024

vllm's cli is my favorite so far because it just works, also the api is better than tabby.
but god, exl2 is better than awq.

@Originalimoc
Copy link

+1. Most new models are in GGUF and EXL2.

@drexample
Copy link

Give exl2 support pls

@rsxdalv
Copy link

rsxdalv commented Dec 19, 2024

Although I am waiting for exl2 support myself, the amount of +1 messages really don't help.
If you truly wish to make a +1, state reasons so that the developers or forkers have a real incentive.
Yes, for inactive issues a +1 might amplify your voice and is better than nothing, but a stream of +1s really can be worse than nothing. The probability that devs have unsubscribed from this issue is fairly high.

@AlpinDale AlpinDale linked a pull request Dec 20, 2024 that will close this issue
@JohnConnor123
Copy link

we are waiting for exl2 support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.