Core features:
- 21MB default install vs 80-171MB alternatives
- 33x faster token chunking than popular alternatives
- Supports multiple chunking strategies: token, word, sentence, and semantic
- Works with all major tokenizers (transformers, tokenizers, tiktoken)
- Zero external dependencies for basic functionality
Technical optimizations:
- Uses tiktoken with multi-threading for faster tokenization
- Implements aggressive caching and precomputation
- Running mean pooling for efficient semantic chunking
- Modular dependency system (install only what you need)
Benchmarks and code: https://github.com/bhavnicksm/chonkie
Looking for feedback on the architecture and performance optimizations. What other chunking strategies would be useful for RAG applications?
I've been hoping to find an ultra light-weight chunking library that can do things like very simple regex-based sentence/paragraph/markdown-aware chunking with minimal additional dependencies.
The more complicated stuff is the effective bin-packing problem that emerges depending on how much different contextual sources you have.
[1] https://gist.github.com/LukasKriesch/e75a0132e93ca989f8870c4...
[2] https://jina.ai/segmenter/
I just removed one sentence at a time from the left until there was a jump in the embedding distance. Then repeated for the right side.
I hope that you will stick with Chonkie for the journey of making the 'perfect' chunking library!
Thanks again!
1) what
Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...
Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)
If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.
That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.
I’m using o1-preview for chunking, creating summary subdocuments.
Thanks for responding, I'll try to make it easier to use something like that in Chonkie in the future!
Chunking is easily where all of these problems die beyond PoC scale.
I’ve talked to multiple code generation companies in the past week — most are stuck with BM25 and taking in whole files.
But, it's on the roadmap, so please hold on!