Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4

bhavnicksm · 2024-11-05T19:27:14Z

This pull request includes several updates to the autotiktokenizer package, focusing on version updates, improvements to the AutoTikTokenizer class, and enhancements to the test suite.

Version Updates:

Updated the version from 0.1.0 to 0.1.1 in pyproject.toml and __init__.py to reflect the new changes. [1] [2]

Codebase Enhancements:

Refactored the __init__ method in src/autotiktokenizer/autotiktokenizer.py to initialize bytes_encoder and bytes_decoder and added new methods _bytes_to_unicode and _normalize_token_bytes for handling byte-to-unicode conversions.
Modified get_mergable_ranks to accept vocab and special_tokens as parameters and updated its logic to use the new _normalize_token_bytes method.
Updated get_tiktoken_encoding to accept vocab as a parameter and adjusted related method calls accordingly.
Refactored from_pretrained and __call__ methods to improve their functionality and added a __repr__ method for better representation.

Testing Improvements:

Added a new fixture sample_text in tests/test_gpt2.py to provide a comprehensive text sample for testing.
Introduced a new test test_long_text to validate the encoding and decoding of long texts, ensuring the tokenizer's accuracy and robustness.

Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality

d32dc77

bhavnicksm merged commit d37b10b into main Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4

Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4

bhavnicksm commented Nov 5, 2024

Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4

Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4

Conversation

bhavnicksm commented Nov 5, 2024

Version Updates:

Codebase Enhancements:

Testing Improvements: