Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something might be wrong with either llama.cpp or the Llama 3 GGUFs #6914

Closed
coder543 opened this issue Apr 25, 2024 · 14 comments
Closed

Something might be wrong with either llama.cpp or the Llama 3 GGUFs #6914

coder543 opened this issue Apr 25, 2024 · 14 comments

Comments

@coder543
Copy link

coder543 commented Apr 25, 2024

Try this query: "What is 3333+777?"

Yes, yes, LLMs are bad at math. That's not what I'm getting at. Someone mentioned this on Reddit, and I have to agree that I'm seeing weird stuff too.

Let's get a baseline. Here is what meta.ai yields:

image

This is likely running on Llama 3 70B.

Here is what Groq yields:

image

and at 8B:

image

Now, here's where things get weird. Using Open WebUI on top of Ollama, let's use llama.cpp to run the GGUFs of Llama 3.

First, 8B at fp16:

image

Then 8B at Q8_0:

image

Then 70B at Q4_0:

image

I think the problem should be clear. All of the non-llama.cpp instances that were not using GGUFs did the math problem correctly. All of the llama.cpp instances got the problem wrong in exactly the same way. This issue is extremely repeatable on both ends. I have never seen the cloud instances make this mistake, and I have never seen the llama.cpp instances not make this exact mistake of adding an extra digit to the problem and then getting it wrong.

To me, it appears that something is degrading the accuracy of Llama 3 when run under llama.cpp.

Any ideas of what's going wrong here?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 25, 2024

I reproduced this in main,

./main -m ~/Meta-Llama-3-8B-Instruct-IQ3_M.gguf -s 7 -i --color --penalize-nl -e --temp 0 -r "<|eot_id|>" --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"

I think the result is related to the way the llama.cpp tokenizer chunks 3333+777, see here:

buffer: '3333+777?'

input tokens: [ ... '33':1644, '33':1644, '+':10, '777':15831, ... ]

It works as expected with: 3,333 + 777(add comma):
Screenshot_20240425-202220_Termux

4333+777(no comma) also worked as expected, so it's not a gguf problem.

@LostRuins
Copy link
Collaborator

likely related to BPE token merging behavior #6809

@itodorovic
Copy link

Same on my side, Bartowski fresh conversion from https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main , Q8_0 quant

image

@Dampfinchen
Copy link

I did these tests as well and can confirm your findings. Exllama v2 also did better in my own test I mentioned in the discussion yesterday.

@Jeximo
Copy link
Contributor

Jeximo commented Apr 26, 2024

fixed by #6920. log with PR:

buffer: '3333+777?'

input tokens: [ ..., '333':8765, '3':18, '+':10, '777':15831, ...]

Screenshot_20240426-164038_Termux

@LostRuins
Copy link
Collaborator

@Jeximo some other tests you can try, taken from the JS library sample reference:

// Simple test case
test("grabbed",                           [59312, 2788])

// Naive implementation produces inconsistent tokenization for " grabbed" 
test(" grabbed",                          [30418])

// Naive implementation (in LLAMA 1 TOKENIZER) uses incorrect merge order for multiple consecutive space merges
test("           grabbed",                [1881, 30418])

// Linebreak and tab are handling
test("\n",                                [198])
test(" \n",                               [720])
test("	tabs				out here",   [3324, 3518, 573, 14294, 1618])

// Equal prio merges are performed left-to-right 
test("ax\n####\nboo",                     [710, 198, 71050, 34093])

// UTF-8 multipoint character that should be found in vocabulary
test('镇',                                [104643])

// UTF-8 multipoint character that should NOT be found in vocabulary
test('🦙',                               [9468, 99, 247])

// Consecutive UTF-8 multipoint characters that are NOT found in a vocabulary and use different number of bytes
test('🦙Ꙋ',                             [9468, 99, 247, 166, 247, 232])
test('Ꙋ🦙',                             [166, 247, 232, 9468, 99, 247])

// Test input from official LLaMA 3 library
test('This is a test sentence.',         [2028, 374, 264, 1296, 11914, 13])

@belladoreai
Copy link

belladoreai commented Apr 27, 2024

Note regarding the test cases above: I wrote those test cases for LLaMA 1 tokenization, and updated the "expected results" for LLaMA 3 (with the exception of the last case, which is copied from official repo). So, those tests are definitely better than nothing, but they are not "optimal test cases" to uncover potential bugs in LLaMA 3 tokenization. Would be good to find some adversarial inputs specific to LLaMA 3 tokenization and update the tests as such.

@nkeilar
Copy link

nkeilar commented Apr 30, 2024

So I just tested the current llama3 models and despite the changes to support llama3 in latest ollama release, the official llama3 model that is uploaded still gives the wrong result

What is 3333 + 777? 

Let me calculate that for you!

33,333 + 777 = 34,110

this is with llama3:70b-instruct-q4_K_M 5338d7c58d8d

@coder543
Copy link
Author

@nkeilar what is “the official one”? If you mean ollama… ollama isn’t official, and also they won’t have regenerated their GGUFs yet. Wherever you’re looking, when was the file generated? The changes made require the GGUFs to be generated again. It isn’t just a change to llama.cpp.

I haven’t tried to test the changes, but the discussion around the changes sounded convincing to me.

@coder543
Copy link
Author

Someone here apparently had good luck with a new GGUF of the 8B model.

@Jeximo
Copy link
Contributor

Jeximo commented Apr 30, 2024

llama.cpp + 8b IQ_3M

Screenshot_20240430_002136

@coder543
Copy link
Author

@Jeximo is that a 3-bit quantization? 🤔

regardless, looks correct.

@ddh0
Copy link
Contributor

ddh0 commented Apr 30, 2024

After pulling the latest llama.cpp repo, and the latest Llama 3 8B Instruct HF repo, converting and quantizing, I can confirm that the issues are indeed fixed:
Llama 3 working

@Galunid
Copy link
Collaborator

Galunid commented May 11, 2024

fixed in #6920

@Galunid Galunid closed this as completed May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants