Skip to content

supotrois/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Repository files navigation

Unicode Tokenization and Encoding

Overview

The script contains implementations of tokenization and encoding functions for Unicode text. It demos the process of tokenizing text, generating token pairs, merging tokens, and encoding text using different encoding schemes.

Features:

  • Tokenization and encoding of Unicode text
  • Token pair generation and merging
  • Encoding using UTF-8 and UTF-16
  • Implementation of GPT-4 style tokenization
  • Demonstration of whitespace handling in different tokenization schemes
  • Downloading and using pre-trained encoder and vocabulary for tokenization

Usage

  1. Tokenization and Encoding: The script can be run to tokenize and encode Unicode text. Simply execute the script and observe the output.

  2. GPT-4 Style Tokenization: The script demonstrates GPT-4 style tokenization and whitespace handling. Modify the text input and observe the tokenization results.

  3. Pre-trained Encoder and Vocabulary: The script downloads pre-trained encoder and vocabulary files for tokenization. Ensure an internet connection for downloading these files.

Dependencies

  • Python 3.x
  • tiktoken library (install using pip install tiktoken)

About

gpt4-esque tokenizer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published