Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for Recursive Chunking (RecursiveChunker) #107

Merged
merged 11 commits into from
Dec 27, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request introduces a new RecursiveChunker class and makes several related updates to the codebase. The RecursiveChunker class provides a hierarchical approach to text chunking using customizable rules. Additionally, there are updates to the README.md file, imports, and other existing chunkers.

New Feature:

  • RecursiveChunker class: A new class that chunks text hierarchically using customizable rules to create semantically meaningful chunks. This includes methods for splitting text, merging splits, and recursive chunking logic. (src/chonkie/chunker/recursive.py)

Documentation Updates:

  • README.md: Added RecursiveChunker to the list of available chunkers and updated the citation format to bibtex. [1] [2]

Import and Export Adjustments:

  • __init__.py files: Updated import statements to include RecursiveChunker and related types, ensuring the new class is properly integrated into the module. (src/chonkie/__init__.py, src/chonkie/chunker/__init__.py) [1] [2] [3] [4] [5]

Refinery Enhancements:

  • base.py: Added refine_batch method to handle batches of chunks and updated the __call__ method to support both single and batch processing of chunks. (src/chonkie/refinery/base.py) [1] [2]

Other Refinements:

  • overlap.py: Improved token handling by introducing _AVG_CHAR_PER_TOKEN and updating methods to use this constant for more accurate token estimates. (src/chonkie/refinery/overlap.py) [1] [2] [3] [4]

@bhavnicksm bhavnicksm merged commit 3f9632a into development Dec 27, 2024
1 check passed
@bhavnicksm bhavnicksm deleted the add-recursive branch December 27, 2024 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant