Unicode Tokenization and Encoding

Overview

The script contains implementations of tokenization and encoding functions for Unicode text. It demos the process of tokenizing text, generating token pairs, merging tokens, and encoding text using different encoding schemes.

Features:

Tokenization and encoding of Unicode text
Token pair generation and merging
Encoding using UTF-8 and UTF-16
Implementation of GPT-4 style tokenization
Demonstration of whitespace handling in different tokenization schemes
Downloading and using pre-trained encoder and vocabulary for tokenization

Usage

Tokenization and Encoding: The script can be run to tokenize and encode Unicode text. Simply execute the script and observe the output.
GPT-4 Style Tokenization: The script demonstrates GPT-4 style tokenization and whitespace handling. Modify the text input and observe the tokenization results.
Pre-trained Encoder and Vocabulary: The script downloads pre-trained encoder and vocabulary files for tokenization. Ensure an internet connection for downloading these files.

Dependencies

Python 3.x
tiktoken library (install using pip install tiktoken)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
README.md		README.md
Tokenization.ipynb		Tokenization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unicode Tokenization and Encoding

Overview

Features:

Usage

Dependencies

About

Releases

Packages

Languages

supotrois/tokenizer

Folders and files

Latest commit

History

Repository files navigation

Unicode Tokenization and Encoding

Overview

Features:

Usage

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages