Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump version to 0.1.1 and enhance _AutoTikTokenizer functionality #4

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request includes several updates to the autotiktokenizer package, focusing on version updates, improvements to the AutoTikTokenizer class, and enhancements to the test suite.

Version Updates:

  • Updated the version from 0.1.0 to 0.1.1 in pyproject.toml and __init__.py to reflect the new changes. [1] [2]

Codebase Enhancements:

  • Refactored the __init__ method in src/autotiktokenizer/autotiktokenizer.py to initialize bytes_encoder and bytes_decoder and added new methods _bytes_to_unicode and _normalize_token_bytes for handling byte-to-unicode conversions.
  • Modified get_mergable_ranks to accept vocab and special_tokens as parameters and updated its logic to use the new _normalize_token_bytes method.
  • Updated get_tiktoken_encoding to accept vocab as a parameter and adjusted related method calls accordingly.
  • Refactored from_pretrained and __call__ methods to improve their functionality and added a __repr__ method for better representation.

Testing Improvements:

  • Added a new fixture sample_text in tests/test_gpt2.py to provide a comprehensive text sample for testing.
  • Introduced a new test test_long_text to validate the encoding and decoding of long texts, ensuring the tokenizer's accuracy and robustness.

@bhavnicksm bhavnicksm merged commit d37b10b into main Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant