Add initial support for Recursive Chunking (RecursiveChunker
)
#107
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new
RecursiveChunker
class and makes several related updates to the codebase. TheRecursiveChunker
class provides a hierarchical approach to text chunking using customizable rules. Additionally, there are updates to theREADME.md
file, imports, and other existing chunkers.New Feature:
RecursiveChunker
class: A new class that chunks text hierarchically using customizable rules to create semantically meaningful chunks. This includes methods for splitting text, merging splits, and recursive chunking logic. (src/chonkie/chunker/recursive.py
)Documentation Updates:
README.md
: AddedRecursiveChunker
to the list of available chunkers and updated the citation format tobibtex
. [1] [2]Import and Export Adjustments:
__init__.py
files: Updated import statements to includeRecursiveChunker
and related types, ensuring the new class is properly integrated into the module. (src/chonkie/__init__.py
,src/chonkie/chunker/__init__.py
) [1] [2] [3] [4] [5]Refinery Enhancements:
base.py
: Addedrefine_batch
method to handle batches of chunks and updated the__call__
method to support both single and batch processing of chunks. (src/chonkie/refinery/base.py
) [1] [2]Other Refinements:
overlap.py
: Improved token handling by introducing_AVG_CHAR_PER_TOKEN
and updating methods to use this constant for more accurate token estimates. (src/chonkie/refinery/overlap.py
) [1] [2] [3] [4]