-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tried 7B and 13B models, can't get any decent result from inference #69
Comments
Yes, I'm getting some pretty erratic responses on the 7B. But considering it is so much smaller than GPT 3 or the 65B, that's to be expected. |
Don't forget, it's not been trained to be a chat-bot. All it knows how to do is predict the next word in the sequence. Chat-GPT also has a lot of hidden prompts that you don't see with examples of how it should behave. So to answer questions, try and first give it examples of questions and answers. e.g.
Maybe that will work better. IDK. I haven't tried it myself. Because I haven't been accepted 😭😭😭 |
Human> QUESTION: What colour is the sky? ANSWER: It is commonly thought to be blue. QUESTION: Who first landed on the moon? ANSWER: A good question, this is known to be Neil Armstrong. QUESTION: What is the captial of France? |
QUESTION: What colour is the sky? This is the default settings, this is really bad! |
Thank you ChatGPT |
Seems good to me? Just take the substring before it gets to the next "QUESTION:" and its a working chatbot... almost. 😁 Ooh - just saw it said Russia is the largest Island 😂😂😂 OK. You are right it's not very good!!!! Well perhaps it saw my bad grammar and deciding to imitate someone who is not very clever. How about if you did it with longer words: QUESTION: What is the capital of France? Or perhaps add this to the top of the prompt: |
Ya'll really got accepted as machine learning researchers? Smh First, lets stop comparing this to ChatGPT right now, less ya'll talking about your 6 gpu cluster running the 65b version. GPT3 is 175b and we know ChatGPT is a new model with unknown size. If you're hoping a 7b model can compete at its level you can pack up and go home now. Now, apparently the 65b licks it in every metric, which is like, okay wow. But ya'll gamer GPU kiddies need to go home. Secondly, This model is not a chatbot. ChatGPT is, well, its in the name tho innit? This is a standard language model, like GPT3 was. Its function is basically a glorified autofill. If you sufficiently prompt it, it will act like a chatbot (what chatgpt is doing a little bit behind the scenes), but you guys coming in like boomers talking to it, like it takes instructions like your old mate, are actually embarrassing. It can do way more than ChatGPT can in this format, but its also more awkward to deal with. Thats how it works, go finetune about it. |
I know it is not an instruct model or a model trained with RLHF, but even a "glorified autofill" can do better than this. By the way, personal attacks and derogatory language are really not helpful in fostering productive dialogue. Anyway, you are telling me this is a good output? Look at the benchmarks, this doesn't make sense for a 13B model with these scores. You're right, we should fine-tune it, but for now I feel neo-x-20B is better and it is not like fine-tuning was free.
|
@allaccs @jwnsu Thanks for the feedback :) Since the model has not been finetuned on instructions, it is not unexpected that it would perform poorly when prompted with instructions. Without finetuning, these models are very sensitive to the prompts. Modifying your prompts slighly yields better results:
Here is what I get with the 7B model (prompts are in bold) I believe the meaning of life is to find happiness and be satisfied with what you have. ================================== Simply put, the theory of relativity states that 1) there is no absolute time or space and 2) the speed of light in a vacuum is the fastest speed possible. There are two key principles in relativity:
================================== Building a website can be done in 10 simple steps:
We've added these suggestions to the FAQ: https://github.com/facebookresearch/llama/blob/main/FAQ.md#2-generations-are-bad and adapted example.py accordingly. |
@timlacroix Thanks for the FAQ update. Changing the prompt formulation did give some much better outputs. I still find few shots prompting isn't giving very good results so I will try to fine-tune the model. Is there some recommendations for the data-set formatting? (Don't know if I should open another issue for this) |
By "data-set" formatting, do you mean the right format for few-shot prompting ? I think it's best to find a format that really separates the prompt you're giving from text that could be found on the internet "in the wild". For instance, I found the "tweet sentiment" example with "###" as separators worked much better than the "translation" example without any kind of separators between each example in the few shot prompt. Hopefully that helps ? |
No I mean actual fine-tuning, few-shot prompting seems to have it's limits. For Neo-x I have a 2go JSONL file with many documents, each document with a prompt/answer structure with tags such as "<|endoftext|>". For llama I can't seem to find much info. |
@allaccs I think the main reason why LLaMa behaves so unexpectedly is due to the fact that no Reinforcement Learning from Human Feedback (RLHF) has been done so far. This could be done using the following repo I suppose... However I'm still trying to figure out how to really get the training going 😅 also see: I think the key to quality is lots and lots of RLHF hours, and I think OpenAI payed about 35 experts for about half a year throwing question answer Pairs at ChatGPT before releasing it for the public. Since these big language Models are currently just there to predict webTokens that could not be more random at least for the average human, you'll get lots of disappointing results at the moment... This is due to the fact that LLaMa is simpy not trained on conversational prompts. You can read the paper that explains how ChatGPT' s predecessor has been built. I think the key to getting ChatGPT-like quality is some crowd-soured effort to train LLaMa, and I suppose we could get even way better quality with that than ChatGPT.... Just imagine the 65B params model fine-tuned by thousands of random people instead of a meager 30-ish probably more geeky people 😅 |
Model Benchmarks ARC HellaSwag MMLU etc shows that even falcon 40B (which seems to be best atm) is nowhere near to chatgpt-3.5 turbo mainly it is 63/100 chatgpt-3.5-turbo is 84.5/100. Thats probably why. And it is telling me that in AI research most important thing seems to be training data quality. |
3 months late mate, this issue was posted shortly after llama was out. I
figured out (and others here for sure) that llama needed finetuning and
RLHF to reach gpt level or better answers to instructions. Also those
benchmarks are not the most useful/precise you can clearly train an expert
on llama that would beat gpt in some domain.
…On Fri, Jul 7, 2023, 1:59 AM Gediz GÜRSU ***@***.***> wrote:
Model Benchmarks ARC HellaSwag MMLU etc shows that even falcon 40B (which
seems to be best atm) is nowhere near to chatgpt-3.5 turbo mainly it is
63/100 chatgpt-3.5-turbo is 84.5/100.
Thats probably why ?
—
Reply to this email directly, view it on GitHub
<#69 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2YSSQGCFLDYWBCIEGUDYHLXO5GODANCNFSM6AAAAAAVN3ZXMA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Closing this issue as solved on lack of RLHF for the original models. For future reference, check both llama and llama-recipes repos for getting started guides. |
Well both models seems to be unable to follow any instruction, answer any question or even continue text. Do we need to fine-tune it or add more functions in order to get decent results?
A few examples, everything is like this :
How weird is that?
Ten easy steps to build a website...
The text was updated successfully, but these errors were encountered: