Possible solution to allow K-quants on models with n_vocab!=32000 #2148

LostRuins · 2023-07-08T13:02:47Z

This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64.

Since the problematic dimensions only apply for tok_embeddings.weight and output.weight (dimentions 4096 x n_vocab) which are a comparatively small part of a llama model, we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.

What do you all think? @TheBloke @ikawrakow

…nts to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.

JohannesGaessler · 2023-07-08T13:46:11Z

In #2001 @ikawrakow has said:

The initial idea for handling this case was that we will simply pad tensor rows to be multiple of 256 and I initially started along these lines. But the change was turning much too big for my taste and was not just in isolated places related to the k-quants (as this PR), but was massively affecting the entire ggml. Hence, I eventually abandoned this approach.

Then there was the idea to use row-wise padding to 256 that but get away without major changes to the code by simply using appropriate tensor views before operations that depend on the row size (e.g., rms norm). But that did not (easily) work because of the attention heads that make it necessary to have the padding in the middle to not change the embeddings each head is seeing, so this becomes too complicated as well.

[...]

In retrospect, it would have been better to finish the ggml modifications necessary for general support of tensor sizes not divisible by the quantization block size. I think this should be the longer-term goal. This will make ggml future proof against somebody coming up with the idea that tensor sizes should be picked, say, from the Fibonacci sequence rather than the currently more common approach of using powers of 2.

So perhaps there are other changes to k-quants planned that would also fix this?

LostRuins · 2023-07-08T14:06:59Z

Hmm yeah, but this does not exclude K-quants eventually supporting arbitrary dimensions. I do think this can work as a decent stopgap solution, since it's both backwards and forwards compatible.

Backwards compatible because all existing ggjtv3 clients already know the Q8_0 format, and will be able to use it seamlessly (for just only 2 tensors) while mixed with with the rest of the k-quantized tensors, no changes required.

Forwards compatible, because once K-quants does support non-multiple of 256/64 tensors, the layer doesn't get marked as incompatible and converted.

So it's more of a fallback, where instead of K-quants failing completely with an error, it provides a warnings plus a slightly sub-optimal (but fully functional) mostly-K-quant output with no new parts needed.

ikawrakow · 2023-07-08T14:10:58Z

I think it is OK to quantize output.weight and tok_embeddings.weight with Q8_0 when dimensions are not divisible by 256/64 for now. This will increase the size of 7B by ~62 MiB. At least for Meta LLaMA, tok_embeddings.weight could be even done with Q4_0 as more accurate quantization of this particular tensor has the least impact on generation quality (and if so, the model size increase would be only ~31 MiB).

TheBloke · 2023-07-08T15:47:48Z

Thank you, LostRuins!

As discussed on Discord, this change is excellent for me because it allows me to safely put out k-quants for non-32000 Llama models, without requiring any special work by the user to use them. They will work with whatever client/UI/library the user is using, and then they will be able to benefit from k-quants on these non-standard models with very minimal cost.

I can warn them that the file sizes are slightly larger than a normal k-quant, but because it's extremely minimal I think they will be very happy.

I tested the change and confirmed that a q4_K_M of a 13B 32,001 model (WizardLM 13B 1.1) quantized with your fix inferred fine in base llama.cpp. The file was around 150MB bigger than the same quant made with llama.cpp.

It's really great that llama.cpp is able to handle these alternative quantisations on a per-tensor basis. That's really flexible.

At least for Meta LLaMA, tok_embeddings.weight could be even done with Q4_0 as more accurate quantization of this particular tensor has the least impact on generation quality (and if so, the model size increase would be only ~31 MiB).

I think that would be fine too. Perhaps the best compromise.

llama.cpp

jxy · 2023-07-10T03:28:17Z

Unfortunately Q8_0 is not implemented in metal.

Co-authored-by: Georgi Gerganov <[email protected]>

…ort, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.

TheBloke · 2023-07-10T15:24:22Z

Ah, I didn't realise Metal couldn't do q8_0. Yes that sounds like a good idea then.

Thanks for working on this!

LostRuins · 2023-07-10T15:24:56Z

As an alternative, to avoid failing due to lack of Q8_0 support on Metal, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality. Thoughts?

LostRuins · 2023-07-11T07:43:37Z

If no other issues, will merge in a few hours.

LostRuins marked this pull request as ready for review July 8, 2023 13:02

ggerganov approved these changes Jul 9, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

LostRuins and others added 2 commits July 10, 2023 22:57

Fix indentation

048dca9

Co-authored-by: Georgi Gerganov <[email protected]>

LostRuins merged commit bbef282 into ggerganov:master Jul 11, 2023

LostRuins deleted the kquant_vocab_fix branch July 12, 2023 03:25

LostRuins mentioned this pull request Jul 12, 2023

The ./quantize command uses the Q4_K_M parameter， Unsupported tensor size encountered error #2143

Closed

LostRuins mentioned this pull request Jul 15, 2024

llama : change fallback type IQ4_NL -> Q4_0 #8489

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible solution to allow K-quants on models with n_vocab!=32000 #2148

Possible solution to allow K-quants on models with n_vocab!=32000 #2148

LostRuins commented Jul 8, 2023

JohannesGaessler commented Jul 8, 2023

LostRuins commented Jul 8, 2023

ikawrakow commented Jul 8, 2023

TheBloke commented Jul 8, 2023

jxy commented Jul 10, 2023

TheBloke commented Jul 10, 2023

LostRuins commented Jul 10, 2023

LostRuins commented Jul 11, 2023

Possible solution to allow K-quants on models with n_vocab!=32000 #2148

Possible solution to allow K-quants on models with n_vocab!=32000 #2148

Conversation

LostRuins commented Jul 8, 2023

JohannesGaessler commented Jul 8, 2023

LostRuins commented Jul 8, 2023

ikawrakow commented Jul 8, 2023

TheBloke commented Jul 8, 2023

jxy commented Jul 10, 2023

TheBloke commented Jul 10, 2023

LostRuins commented Jul 10, 2023

LostRuins commented Jul 11, 2023