Skip to content

Set of classes to tokenize Amharic language sentences.

Notifications You must be signed in to change notification settings

ymitiku/amtokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amharic Language Tokenizers

This package contains set of Classes which can be used to encode Amharic language sentences into tokens that could be used by language models. The tokenizers are trained using Contemporary Amharic Corpus (CACO) dataset

Installing

Pip installation

pip install -i https://test.pypi.org/simple/ amtokenizers==0.0.5

Sample Code

Variable length

from amtokenizers import AmTokenizer

a  = AmTokenizer(10000, 5 , "byte_bpe")
encoded = a.encode("አበበ በሶ በላ።")
print("encoded", encoded.tokens)
# encoded ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį', '¢', '</s>']
print("decoded:", a.decode(encoded.input_ids))
# decoded: <s>አበበ በሶ በላ።</s>

Fixed length

a  = AmTokenizer(10000, 5 , "byte_bpe", max_length=16)
encoded = a.encode("አበበ በሶ በላ።")
print("encoded", encoded.tokens())
# encoded ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį', '¢', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
print(encoded.input_ids)
# [0, 337, 3251, 3598, 3486, 270, 100, 2, 1, 1, 1, 1, 1, 1, 1, 1]
print("decoded:", a.decode(encoded.input_ids))
# decoded: <s>አበበ በሶ በላ።</s><pad><pad><pad><pad><pad><pad><pad><pad>

Disclaimer

This package is highly inspired by Hugging Face's How to train a new language model from scratch using Transformers and Tokenizers tutorial.

About

Set of classes to tokenize Amharic language sentences.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages