Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes several updates to the
autotiktokenizer
package, focusing on version updates, improvements to theAutoTikTokenizer
class, and enhancements to the test suite.Version Updates:
0.1.0
to0.1.1
inpyproject.toml
and__init__.py
to reflect the new changes. [1] [2]Codebase Enhancements:
__init__
method insrc/autotiktokenizer/autotiktokenizer.py
to initializebytes_encoder
andbytes_decoder
and added new methods_bytes_to_unicode
and_normalize_token_bytes
for handling byte-to-unicode conversions.get_mergable_ranks
to acceptvocab
andspecial_tokens
as parameters and updated its logic to use the new_normalize_token_bytes
method.get_tiktoken_encoding
to acceptvocab
as a parameter and adjusted related method calls accordingly.from_pretrained
and__call__
methods to improve their functionality and added a__repr__
method for better representation.Testing Improvements:
sample_text
intests/test_gpt2.py
to provide a comprehensive text sample for testing.test_long_text
to validate the encoding and decoding of long texts, ensuring the tokenizer's accuracy and robustness.