Generate/load Encoder from tokenizer.json file #40

michalblaha · 2024-07-09T12:57:38Z

What would you like to be added:

Easy use of specific tokenizer for specific (mostly open source) models

HavenDV · 2024-07-10T01:17:52Z

I started working on this, but ran into a series of difficulties:

Tiktoken files are initially designed to work with Regex, which is not defined in this file. I'm trying to generate them from Vocab, but it doesn't work with the compiled regex. And for a regular replacement by key, instead of regex, we need to change the Core part of the library.
When deserializing json, for some reason, model.merges and added_tokens are empty.
If we try to work with the generated regex, there is a problem with spaces

I'm a little out of context now, the bulk of the work on this library was done over a year ago, but I'd be glad for any help.

HavenDV linked a pull request Jul 10, 2024 that will close this issue

feat: Initial tokenizer.json support #41

Draft