Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible solution to allow K-quants on models with n_vocab!=32000 #2148

Merged
merged 3 commits into from
Jul 11, 2023

Conversation

LostRuins
Copy link
Collaborator

This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64.

Since the problematic dimensions only apply for tok_embeddings.weight and output.weight (dimentions 4096 x n_vocab) which are a comparatively small part of a llama model, we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.

What do you all think? @TheBloke @ikawrakow

…nts to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.
@LostRuins LostRuins marked this pull request as ready for review July 8, 2023 13:02
@JohannesGaessler
Copy link
Collaborator

In #2001 @ikawrakow has said:

The initial idea for handling this case was that we will simply pad tensor rows to be multiple of 256 and I initially started along these lines. But the change was turning much too big for my taste and was not just in isolated places related to the k-quants (as this PR), but was massively affecting the entire ggml. Hence, I eventually abandoned this approach.

Then there was the idea to use row-wise padding to 256 that but get away without major changes to the code by simply using appropriate tensor views before operations that depend on the row size (e.g., rms norm). But that did not (easily) work because of the attention heads that make it necessary to have the padding in the middle to not change the embeddings each head is seeing, so this becomes too complicated as well.

[...]

In retrospect, it would have been better to finish the ggml modifications necessary for general support of tensor sizes not divisible by the quantization block size. I think this should be the longer-term goal. This will make ggml future proof against somebody coming up with the idea that tensor sizes should be picked, say, from the Fibonacci sequence rather than the currently more common approach of using powers of 2.

So perhaps there are other changes to k-quants planned that would also fix this?

@LostRuins
Copy link
Collaborator Author

Hmm yeah, but this does not exclude K-quants eventually supporting arbitrary dimensions. I do think this can work as a decent stopgap solution, since it's both backwards and forwards compatible.

Backwards compatible because all existing ggjtv3 clients already know the Q8_0 format, and will be able to use it seamlessly (for just only 2 tensors) while mixed with with the rest of the k-quantized tensors, no changes required.

Forwards compatible, because once K-quants does support non-multiple of 256/64 tensors, the layer doesn't get marked as incompatible and converted.

So it's more of a fallback, where instead of K-quants failing completely with an error, it provides a warnings plus a slightly sub-optimal (but fully functional) mostly-K-quant output with no new parts needed.

@ikawrakow
Copy link
Contributor

I think it is OK to quantize output.weight and tok_embeddings.weight with Q8_0 when dimensions are not divisible by 256/64 for now. This will increase the size of 7B by ~62 MiB. At least for Meta LLaMA, tok_embeddings.weight could be even done with Q4_0 as more accurate quantization of this particular tensor has the least impact on generation quality (and if so, the model size increase would be only ~31 MiB).

@TheBloke
Copy link
Contributor

TheBloke commented Jul 8, 2023

Thank you, LostRuins!

As discussed on Discord, this change is excellent for me because it allows me to safely put out k-quants for non-32000 Llama models, without requiring any special work by the user to use them. They will work with whatever client/UI/library the user is using, and then they will be able to benefit from k-quants on these non-standard models with very minimal cost.

I can warn them that the file sizes are slightly larger than a normal k-quant, but because it's extremely minimal I think they will be very happy.

I tested the change and confirmed that a q4_K_M of a 13B 32,001 model (WizardLM 13B 1.1) quantized with your fix inferred fine in base llama.cpp. The file was around 150MB bigger than the same quant made with llama.cpp.

It's really great that llama.cpp is able to handle these alternative quantisations on a per-tensor basis. That's really flexible.

At least for Meta LLaMA, tok_embeddings.weight could be even done with Q4_0 as more accurate quantization of this particular tensor has the least impact on generation quality (and if so, the model size increase would be only ~31 MiB).

I think that would be fine too. Perhaps the best compromise.

llama.cpp Outdated Show resolved Hide resolved
@jxy
Copy link
Contributor

jxy commented Jul 10, 2023

Unfortunately Q8_0 is not implemented in metal.

LostRuins and others added 2 commits July 10, 2023 22:57
Co-authored-by: Georgi Gerganov <[email protected]>
…ort, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.
@TheBloke
Copy link
Contributor

Ah, I didn't realise Metal couldn't do q8_0. Yes that sounds like a good idea then.

Thanks for working on this!

@LostRuins
Copy link
Collaborator Author

As an alternative, to avoid failing due to lack of Q8_0 support on Metal, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality. Thoughts?

@LostRuins
Copy link
Collaborator Author

If no other issues, will merge in a few hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants