-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Huggingface tokenzier support #189
Conversation
Thank you so much for this work @DOGEwbx . I'm waiting for deepseek coder support all the time. This is very helpful. |
Is creating Exl2 quants with this also possible? |
@CyberTimon Thanks for your interest on our work. I haven’t run testing on exl2 quants file but as all the modifications are on the tokenizer part, I don't think there will be problems on the specific data format. |
At the very least, this is going to take some time to review. Transformers is a massive dependency to include just to support one model (Falcon still wouldn't work as there are other architectural differences). As for remote code, my guess would be that 90% of users are unaware of the risks involved, so it should at least be opt-in. I'll need a moment to think about it, to test that this doesn't break functionality like constrained sampling, and make sure there really isn't a better way. |
Thanks for your reply. |
Is there any specific way to use the fork? With
6.7B shows very similar behaviour, but most of the time results in an invisible output loop in the chat example I get the same behaviour no matter what prompt format (also tested the deepseek instruct format) Maybe I am just doing something wrong, I'd appreciate help |
@SinanAkkoyun The model seems to use linear a RoPE scaling factor of 4. I've been able to get coherent output out of the 1.3B model at least, using that. @DOGEwbx The Tokenizers library seems like a more reasonable dependency, especially if it's optional. It largely mirrors Transformers, so it should be possible to adapt it to the code in this PR. There are still a few things I need to sort out and verify, like how control symbols are encoded, optional BOS/EOS tokens, that the vocabulary is preprocessed correctly, how UTF-8 characters are emitted and so on. I'll get to that in a few hours. It's definitely not a trivial issue. I see over on the llama.cpp repo a whole bunch of people have been working on it for some weeks now. As for remote code, the issue is that with the option enabled, |
@turboderp Thank youu, 6.7B is working coherently :) |
@turboderp However, I can't seem to get 1.3B to output coherent responses. What params did you use? EXL2 GPTQ:
Or is this just due to 4bit quantization? The bf16 model responds with great answers for it's 1.3b size |
I think maybe you're just asking too much of a tiny model. And quantization is known to affect smaller models more severely anyway. Remember you can also just run the FP16 version to compare. |
There. I rewrote it to use the Tokenizers library instead, as an optional dependency, and it seems to run okay now. It seems to consistently encode and decode the same as a HF AutoTokenizer. Encoding seems to work correctly during quantization as well. I also added a workaround for the Tokenizer bug where some added tokens would decode incorrectly. Still need to test it with some of the other models that lack a SentencePiece tokenizer model. |
Thank you |
Yes, that's what puzzled me, the FP16 model ran perfectly fine and conquered most basic coding tasks easily
Thank you so much! |
Add logic to decide use huggingface tokenizer or sentence piece tokenizer.
It can support models using huggingface tokenizer like Falcon and Deepseek Coder