-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad results for LLaMA #443
Comments
@juletx hi I have similar issue, I run several tasks, the results as following: do you have solutions? |
No, I don't have solutions |
Can look into this! For some tasks, this may not be "fixable" in the sense that we don't know exactly what the LLaMA team did to evaluate, but for others like LAMBADA this is very much not expected. |
Yes, I agree. We can't expect exactly the same results because LLaMA prompts are not published. However, tasks where the accuracy is 0 indicate that there might be a problem. LAMBADA is a crear example, but there are more such as math tasks and some QA tasks. |
One source of inconsistency is from special token handling in the harness. LLaMA models are trained with BOS tokens, so you probably want to encode with it to give it a "fair" shot. See feature todo: lm-evaluation-harness/lm_eval/models/huggingface.py Lines 147 to 155 in 602abce
|
Another possibility worth keeping in mind is that the LLaMA implementation in HF could be bugged. I’m not sure how well tested it is against the original codebase, but it’s not an official implementation and (for licensing reasons) had to be written without reference to the original implementation. |
For reference, I ran Hellaswag and PiQA on lit-llama (https://github.com/Lightning-AI/lit-llama) and got
This is an independent nanoGPT-based reimplementation of LLaMA, so results are confirmed (slightly higher for lit-llama but that's within uncertainty). Evaluation for |
It was recently pointed out on Twitter that in the allegedly zero-shot examples they "provide a textual description of the task and a test example." I am comfortable assuming that this explains the discrepancy. |
Probably related to Tokenizer issues, solved via specifying tokens: #442 |
@upunaprosk if correcting the tokenizer solves the problem, it seems like this issue should be opened on the HF transformers repo instead of this one. We are loading the model the way we are told to, it’s just that the transformers library doesn’t know how to load the model. Can you share your evaluation results with this correction? |
Closing because the tokenizer fixes seem to fix most wildly off results. The others, like TriviaQA, have also required some minor modifications to tasks. |
I have evaluated LLaMA (7B, 13B and 30B) in most of the tasks available in this library and the results are bad for some tasks. I will give some examples with the 7B model. I haven't checked all the results yet, I put them here so that we can fix the problems that we find. I can share more results and configs if you need more information.
This is the script that I have used for evaluation.
Common Sense Reasoning
Results are similar to the paper, generally a bit lower. This is expected because of the differences in prompts. Some exceptions include ARC and openbookqa where the result is much lower.
Mathematical Reasoning
Very low accuracies are obtained, 0 is same cases. GSM8K and MATH results are much lower than in the paper.
Reading Comprehension
RACE results are much lower than on the paper.
Question Answering
0 accuracy for TriviaQA and webqs
LAMBADA
LAMBADA does not work properly, 0 accuracy is obtained.
Arithmetic
Another task that returns 0 accuracy.
BLIMP
Human alignment
ETHICS, Toxigen and CrowsPairs
MMLU
MMLU results seem to be ok.
The text was updated successfully, but these errors were encountered: