AutoTikTokenizer

🚀 Accelerate your HuggingFace tokenizers by converting them to TikToken format with AutoTikTokenizer - get TikToken's speed while keeping HuggingFace's flexibility.

Features • Installation • Examples • Supported Models • Benchmarks • Sharp Bits • Citation

Key Features

🚀 High Performance - Built on TikToken's efficient tokenization engine
🔄 HuggingFace Compatible - Seamless integration with the HuggingFace ecosystem
📦 Lightweight - Minimal dependencies, just TikToken and Huggingface-hub
🎯 Easy to Use - Simple, intuitive API that works out of the box
💻 Well Tested - Comprehensive test suite across supported models

Installation

Install autotiktokenizer from PyPI via the following command:

pip install autotiktokenizer

You can also install it from source, by the following command:

pip install git+https://github.com/bhavnicksm/autotiktokenizer

Examples

This section provides a basic usage example of the project. Follow these simple steps to get started quickly.

# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)

Supported Models

AutoTikTokenizer should ideally support ALL models on HF Hub but because of the vast diversity of models out there, we cannot test out every single model. These are the models we have already validated for, and know that AutoTikTokenizer works well for them. If you have a model you wish to see here, raise an issue and we would validate and add it to the list. Thanks :)

GPT2
GPT-J Family
SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
Deepseek Family: Deepseek-v2.5 etc
Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
Mistral Family: Mistral-7B-Instruct-v0.3 etc
Aya Family: Aya-23B, Aya Expanse etc
BERT Family: BERT, RoBERTa, MiniLM, TinyBERT, DeBERTa etc.

NOTE: Some models use the unigram tokenizers, which are not supported with TikToken and hence, 🧰 AutoTikTokenizer cannot convert the tokenizers for such models. Some models that use unigram tokenizers include T5, ALBERT, Marian and XLNet.

Benchmarks

Benchmarking results for tokenizing 1 billion tokens from fineweb-edu dataset using Llama 3.2 tokenizer on CPU (Google colab)

Configuration	Processing Type	AutoTikTokenizer	HuggingFace	Speed Ratio
Single Thread	Sequential	14:58 (898s)	40:43 (2443s)	2.72x faster
Batch x1	Batched	15:58 (958s)	10:30 (630s)	0.66x slower
Batch x4	Batched	8:00 (480s)	10:30 (630s)	1.31x faster
Batch x8	Batched	6:32 (392s)	10:30 (630s)	1.62x faster
4 Processes	Parallel	2:34 (154s)	8:59 (539s)	3.50x faster

The above table shows that AutoTikTokenizer's tokenizer (TikToken) is actually way faster than HuggingFace's Tokenizer by 1.6-3.5 times under fair comparison! While, it's not making the most optimal use of TikToken (yet), its still way faster than the stock solutions you might be getting otherwise.

Sharp Bits

A known issue of the repository is that it does not do any pre-processing or post-processing, which means that if a certain tokenizer (like minilm) expect all lower-case letters only, then you would need to convert it to lower case manually. Similarly, any spaces added in the process are not removed during decoding, so they need to handle them on your own.

There might be more sharp bits to the repository which are unknown at the moment, please raise an issue if you encounter any!

Acknowledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

If you found this repository useful, give it a ⭐️! Thank You :)

Citation

If you use autotiktokenizer in your research, please cite it as follows:

@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
assets		assets
src/autotiktokenizer		src/autotiktokenizer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoTikTokenizer

Key Features

Installation

Examples

Supported Models

Benchmarks

Sharp Bits

Acknowledgement

Citation

About

Releases 6

Packages

Languages

License

bhavnicksm/autotiktokenizer

Folders and files

Latest commit

History

Repository files navigation

AutoTikTokenizer

Key Features

Installation

Examples

Supported Models

Benchmarks

Sharp Bits

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages