-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the eos_token_id and bos_token_id #279
Comments
same question, if the fine-tune need same configuration? |
Same question, I finetuned an alpaca-lora using the author's code, and found it will generate a |
This is a huge issue. The Transformers head now fixes this issue but broke backward compact. You can use |
Everyone needs to check out |
For reference, the following is the token mapping generated by transformer[head] convert from llama weights:
If the model you downloaded/referencing has tokenizer that does not match the above, don't use it and just throw it away. |
@diegomontoya Thanks for your prompt reply, which addresses my confusion. And I have another question. according to finetune.py, each training sequence is appended with an EOS token during the preprocess. So I think the models trained on these data should tend to generate sentences ending with an [EOS]. However, I use the checkpoint provided in this repo to generate something, and I found the generated sentences end with [EOS] [BOS], instead of a single [EOS]. Is that normal? |
For everyone's convenience, I've uploaded llama models converted with the latest transformer git head here: 7B - https://huggingface.co/yahma/llama-7b-hf |
is decapoda aware? they might be willing to update their models. |
with 13B model, the size is 38G of the decapoda_research uploaded, while the model here is about 26G, could you tell me what's the difference between them,please? |
Interesting observation. |
Also got the same after finetuning on my end. Anybody found a workaround? |
For me doing:
|
Thank you very much! This solved some very annoying inference bug in relation to the padding token of the tokenizer that would sometimes show up. If I changed the padding token it would just show up in another batch after a while. For people who might land on this page via Google, this is the error I used to (only sometimes) get:
Thanks to your uploaded models the issue somehow got fixed! |
Any updates to this? Are all things good now? Can we fix old models by changing the tokenizer config or? |
@teknium1 You need to retrain on the fixed/updated base HF models. Anything trained using old transformer code on the decapoda models are bound to break. You can hack your way around the diff token ids but I wouldnt recommend it. |
Is this mean that I have to download a new llama-hf model and retrain, or i can just use the old one, and use the newest transformer code with LlamaTokenizer? |
I think it means either train on a llama model converted recently to HF format, or do it yourself with latest transformers. Unfortunately, the best fine tuned models rn are all based on old format. Only thing I can do atm is revert to older transformers commit to resolve |
Hi @gururise , is it possible to upload llama-30B and llama-65B as well? Thanks! |
I would like to report all of Neko's tokenizers are current and match with https://huggingface.co/oobabooga/llama-tokenizer. Also if you want me to update stuff in the future just bug me here or on Neko. |
@USBhost Your contributions are appreciated! |
Unfortunately, unlike the decapoda-research/llama-7b-hf model the new yahma/llama-7b-hf does not load in a free Google Colab notebook (using Tesla T4 GPU). It just aborts with "^C" during the loading checkpoint shards stage (which can be demonstrated using test.py from #364). I suspect that it runs out of RAM because of the shard size and the Python process gets killed. Would it be possible for you to (re)publish this model split into several smaller shards (or is there some simple procedure to split it after downloading)? |
Try Neko or Elinas' repos |
elinas/llama-7b-hf-transformers-4.29 and Neko-Institute-of-Science/LLaMA-7B-HF both suffer from the same problem. They also both use the same two-big-shards config, which confirms my suspicion it is the cause (I can also see the RAM peaking and the process aborting when the 12.68 GB limit is hit; I'm talking of system RAM, not GPU RAM here). So to sum it up, it would be nice to have a test configuration which can execute in the free Google Colab notebook - which I know is technically possible because decapoda-research/llama-7b-hf can be trained there (although the training produces wrong results). (I also tried Kaggle, but there it fails because of the 20GB disk space limit.) |
I uploaded jploski/llama-7b-hf, which allows just this. It uses 34 checkpoint shards, but is otherwise identical to yahma/llama-7b-hf. (And the results of test.py from #364 are ok when the final LoRA weights from it are fed to generate.py.) |
Do we have alpaca-lora weights based on these new models? |
Hi @gururise. Thanks for sharing the model! I guess these two lora weights are based on new llama models, am I right? 7B - https://huggingface.co/yahma/alpaca-7b-lora |
Yes, they are both based on the new llama models. |
I used alpaca-lora to fine-tune on top of openlm-research's open llama model. Now I'm getting lots of Can someone please help me understand what actually changed in the tokens? Which token ids changed and which is "correct"? and if anyone knows if openlm's model uses the "correct" tokenizer that would also help me a tonne. Appreciated. |
There is still something wrong. I replace decapoda-research/llama-7b-hf with yahma/llama-7b-hf. And I find its tokenizer has no pad_token and pad_token_id. Its special tokens are as follows: <unk> 0, <bos> 1, <eos> 2. So what are the special tokens and their ids in original llama on earth? Do I have any misunderstandings? |
You should use huggyllama
…On Sun, Jun 18, 2023, 5:38 AM Kong Aobo ***@***.***> wrote:
There is still something wrong. I replace decapoda-research/llama-7b-hf
with yahma/llama-7b-hf. And I find its tokenizer has no pad_token and
pad_token_id. Its special tokens are as follows: 0, 1, 2. So what are the
special tokens and their ids in original llama on earth? Do I have any
misunderstandings?
—
Reply to this email directly, view it on GitHub
<#279 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIQ4BNEYREGNLR6RSZGNB3XL3ZERANCNFSM6AAAAAAWUMVGUE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@nevercast I think your issue stems from setting the pad token equal to unk token, which leads to generating unk tokens more frequently if the fine-tuning hasn't been done properly. Can someone explain why this method is selected despite HF staff's pad token = eos token usage in various places? Is there any empirical validation behind this? |
https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/generation_config.json |
@louisoutin Oh, I wish I could see your post earlier. I tried past several days to work around, and found similar solution to yours. Any other unexpected side effects ? |
Not sure if the decapoda-llama 7B is trained with (pad=0, bos=1, eos=2) |
Could you find a solution for this? |
You can use https://huggingface.co/jploski/llama-7b-hf instead of yahma/llama-7b-hf |
I've just used https://huggingface.co/jploski/llama-7b-hf for the following code: from open_flamingo import create_model_and_transforms but it didn't work. |
In generate.py, the bos_token_id=1 and eos_token_id=2,
model.config.bos_token_id = 1
model.config.eos_token_id = 2
However, in finetune.py, the tokenizer is directly loaded from the official llama checkpoint, where bos_token_id=0 and eos_token_id=0.
How to understand this discrepancy? Thank you!
The text was updated successfully, but these errors were encountered: