-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible solution to allow K-quants on models with n_vocab!=32000 #2148
Conversation
…nts to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.
In #2001 @ikawrakow has said:
So perhaps there are other changes to k-quants planned that would also fix this? |
Hmm yeah, but this does not exclude K-quants eventually supporting arbitrary dimensions. I do think this can work as a decent stopgap solution, since it's both backwards and forwards compatible. Backwards compatible because all existing ggjtv3 clients already know the Q8_0 format, and will be able to use it seamlessly (for just only 2 tensors) while mixed with with the rest of the k-quantized tensors, no changes required. Forwards compatible, because once K-quants does support non-multiple of 256/64 tensors, the layer doesn't get marked as incompatible and converted. So it's more of a fallback, where instead of K-quants failing completely with an error, it provides a warnings plus a slightly sub-optimal (but fully functional) mostly-K-quant output with no new parts needed. |
I think it is OK to quantize |
Thank you, LostRuins! As discussed on Discord, this change is excellent for me because it allows me to safely put out k-quants for non-32000 Llama models, without requiring any special work by the user to use them. They will work with whatever client/UI/library the user is using, and then they will be able to benefit from k-quants on these non-standard models with very minimal cost. I can warn them that the file sizes are slightly larger than a normal k-quant, but because it's extremely minimal I think they will be very happy. I tested the change and confirmed that a q4_K_M of a 13B 32,001 model (WizardLM 13B 1.1) quantized with your fix inferred fine in base llama.cpp. The file was around 150MB bigger than the same quant made with llama.cpp. It's really great that llama.cpp is able to handle these alternative quantisations on a per-tensor basis. That's really flexible.
I think that would be fine too. Perhaps the best compromise. |
Unfortunately Q8_0 is not implemented in metal. |
Co-authored-by: Georgi Gerganov <[email protected]>
…ort, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.
Ah, I didn't realise Metal couldn't do q8_0. Yes that sounds like a good idea then. Thanks for working on this! |
As an alternative, to avoid failing due to lack of Q8_0 support on Metal, instead quantize |
If no other issues, will merge in a few hours. |
This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64.
Since the problematic dimensions only apply for
tok_embeddings.weight
andoutput.weight
(dimentions 4096 x n_vocab) which are a comparatively small part of a llama model, we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.What do you all think? @TheBloke @ikawrakow